Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis

10/15/2019 ∙ by Xihui Liu, et al. ∙ 33

Semantic image synthesis aims at generating photorealistic images from semantic layouts. Previous approaches with conditional generative adversarial networks (GAN) show state-of-the-art performance on this task, which either feed the semantic label maps as inputs to the generator, or use them to modulate the activations in normalization layers via affine transformations. We argue that convolutional kernels in the generator should be aware of the distinct semantic labels at different locations when generating images. In order to better exploit the semantic layout for the image generator, we propose to predict convolutional kernels conditioned on the semantic label map to generate the intermediate feature maps from the noise maps and eventually generate the images. Moreover, we propose a feature pyramid semantics-embedding discriminator, which is more effective in enhancing fine details and semantic alignments between the generated images and the input semantic layouts than previous multi-scale discriminators. We achieve state-of-the-art results on both quantitative metrics and subjective evaluation on various semantic segmentation datasets, demonstrating the effectiveness of our approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 8

page 12

page 13

page 14

Code Repositories

CC-FPSE

Code for NeurIPS 2019 paper "Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, generative adversarial networks (GAN) Goodfellow et al. (2014) have shown stunning results in generating photorealistic images of faces Karras et al. (2017, 2018) and simple objects Zhang et al. (2018); Brock et al. (2018); Lucic et al. (2019). However, generating photorealistic images for complex scenes with different types of objects and stuff remains a challenging problem. We consider semantic image synthesis, which aims at generating photorealistic images conditioned on semantic layouts. It has wide applications on controllable image synthesis and interactive image manipulation. State-of-the-art methods are mostly based on Generative Adversarial Networks (GAN).

A fundamental question to semantic image synthesis is how to exploit the semantic layout information in the generator. Most previous GAN-based approaches feed the label maps as inputs, and generate images by an encoder-decoder network Isola et al. (2017); Wang et al. (2018); Park et al. (2019). Nonetheless, since the semantic label maps are only fed into the network once at the input layer, the layout information cannot be well preserved in the generator. To mitigate the problem, SPADE Park et al. (2019) uses the label maps to predict spatially-adaptive affine transformations for modulating the activations in normalization layers. However, such feature modulation by simple affine transformations is limited in representational power and flexibility.

On the other hand, we rethink the functionality of convolutional layers for image synthesis. In a generation network, each convolutional layer learns “how to draw” by generating fine features at each location based on a local neighborhood of input features. The same translation-invariant convolutional kernels are applied to all samples and at all spatial locations, irrespective of different semantic labels at different locations, as well as the unique semantic layout of each sample. Our argument is that different convolutional kernels should be used for generating different objects or stuff.

Motivated by the two aforementioned aspects, we propose to predict spatially-varying conditional convolution kernels based on the input semantic layout, so that the layout information can more explicitly and effectively control the image generation process. However, naively predicting all convolutional kernels is infeasible, because it requires a large amount of learnable parameters, which causes overfitting and requires too much GPU memory. Inspired by recent works on lightweight convolutional neural networks 

Chollet (2017); Howard et al. (2017); Ma et al. (2018), we propose to predict the depthwise separable convolution, which factorizes a convolutional operation into a conditional depthwise convolution and a conventional pointwise convolution (i.e.  convolution). The conditional kernel weights for each spatial location are predicted from the semantic layout by a global-context-aware weight prediction network. Our proposed conditional convolution enables the semantic layout to better control the generation process, without a heavy increase in network parameters and computational cost.

Most existing methods for semantic image synthesis adopt a multi-scale PatchGAN discriminator Wang et al. (2018); Park et al. (2019), but its limited representation power cannot match the increased capacity of the generator. We believe that a robust discriminator should focus on two indispensable and complementary aspects of the images: high-fidelity details, and semantic alignment with the input layout map. Motivated by the two principles, we propose to utilize multi-scale feature pyramids for promoting high-fidelity details such as texture and edges, and exploit patch-based semantic-embeddings to enhance the spatial semantic alignment between the generated images and the input semantic layout.

The contribution of this paper are summarized as follows. (1) We propose a novel approach for semantic image synthesis by learning to predict layout-to-image conditional convolution kernels based on the semantic layout. Such conditional convolution operations enable the semantic layout to adaptively control the generation process based on distinct semantic labels at different locations. (2) We propose a feature pyramid semantics-embedding discriminator which is more effective in encouraging high-fidelity details and semantic alignment with the input layout map. (3) With the proposed approach CC-FPSE, we achieve state-of-the-art results on CityScapes, COCO-Stuff, and ADE20K datasets, demonstrating the effectiveness of our approach in generating images with complex scenes.

Figure 1: Semantic image synthesis results by previous approaches and our approach. Best viewed in color. Zoom in for details. Key differences are highlighted by red boxes.

2 Related Work

Generative adversarial networks (GAN) Goodfellow et al. (2014) has made great success in image synthesis Brock et al. (2018); Karras et al. (2018); Lucic et al. (2019). Conditional GANs synthesize images based on given conditions, which can be labels Zhang et al. (2018); Brock et al. (2018), sentence descriptions Zhang et al. (2017); Xu et al. (2018)

, or semantic layout in our task. Our work is also related to image-to-image translation 

Isola et al. (2017); Liu et al. (2017), which translates a possible representation of an image into another representation.

Semantic image synthesis aims at synthesizing photorealistic images given the semantic layout. Pix2pix Isola et al. (2017) adopted an encoder-decoder generator which takes semantic label maps as inputs. Pix2pixHD Wang et al. (2018) proposed a coarse-to-fine generator and multi-scale discriminators to generate high-resolution images. SPADE Park et al. (2019) used the semantic label maps to predict affine transformation parameters for modulating the activations in normalization layers. Besides GAN-based approaches, CRN Chen and Koltun (2017) used a cascaded refinement network with regression loss as training supervisions. SIMS Qi et al. (2018) developed a semi-parametric approach, by retrieving fragments and refining them with a refinement network. Our method differ from previous GAN-based approaches in how the semantic layout information controls the generation process. We propose to predict spatially-varying convolutional kernels conditioned on the semantic layout, so that it can explicitly control the generation process.

Dynamic filter networks Jia et al. (2016) was the first attempt to generate dynamic filters based on the inputs. Ha et al. Ha et al. (2016) proposed HyperNetworks, where a hyper-network is used to generate weights for another network. This idea has been applied to different applications such as neural style transfer Shen et al. (2018)

, super-resolution 

Jo et al. (2018); Hu et al. (2019), image segmentation Harley et al. (2017); Wu et al. (2018), motion prediction Xue et al. (2016) and tracking Li et al. (2017). However, most of them only predicted a limited number of filters, and it would be computation and memory extensive if we use dynamically predicted filters in each layer. Su et al. Su et al. (2019) proposed pixel-adaptive CNN which multiplies the conventional convolutional filter with a spatially-varying kernel to obtain convolutional kernels. Zhao et al. Zhao et al. (2018) adopted a shared filter bank and predict adaptive weights to linearly combine the basis filters. Such operations are still based on the conventional convolutions.So the input information has limited capacity in controlling or influencing the adaptive convolutional kernels, and the behaviors of the generation networks were still dominated by the conventional convolutional kernels. Our approach differs from previous work in several aspects. Firstly, we predict the convolutional kernels conditioned on the layout information, so that the conditional information can explicitly control the generation process. Secondly, we reduce the computation and memory costs by introducing depthwise separable convolutions, while enable the conditional information to control the generation process by directly predicting the convolutional kernel weights.

3 Method

Figure 2:

(Left) The structure of a Conditional Convolution Block (CC Block). (Right) The overall framework of our proposed CC-FPSE. The weight prediction network predicts weights for CC Blocks in the generator. The conditional convolution generator is built up of Conditional Convolution (CC) Blocks shown on the left. The feature pyramid semantics-embedding (FPSE) discriminator predicts real/fake scores as well as semantic alignment scores. L-ReLU in the CC Block denotes Leaky ReLU.

We propose a novel approach for semantic image synthesis with conditional generative adversarial networks. The proposed framework, CC-FPSE, is composed of a novel generator with conditional convolutions predicted by the weight prediction network, and a feature-pyramid semantics-embedding discriminator as shown in Figure 2 (right). The proposed generator is able to fully utilize the semantic layout information to control the image generation process by predicting the convolution kernels in multiple layers of the generation network with limited computational resources. The proposed discriminator is able to supervise the generation of fine details and forces the spatial alignment between the generated image and the input semantic layout by embedding both images and label maps into a joint feature space.

3.1 Learning to Predict Conditional Convolutions for Image Generator

Our proposed generator takes a low-resolution noise map as input. It alternatively uses the proposed conditional convolution blocks He et al. (2016) and upsampling layers to gradually refine the intermediate feature maps and eventually generate the output image. In conventional convolution layers, the same convolution kernels are applied to all samples and at all spatial locations regardless of their distinct semantic layout. We argue that such convolution operation is not flexible and effective enough for semantic image synthesis. In semantic image synthesis, the convolution layers gradually generate refined features at each location given the coarse features in a local neighborhood. Since different objects or stuff should be generated differently, we would like the convolution layer to be aware of the unique semantic label at the target location.

In order to better incorporate the layout information into the image generation process, we propose to predict convolutional kernel weights based on the semantic layout. Given the input feature map , we aim to produce the output feature map by a convolution layer with kernel size . We adopt a weight prediction network that takes the semantic label map as input and outputs the predicted convolutional kernel weights for each conditional convolution layer. However, naively predicting all the kernel weights causes excessive computational costs and GPU memory usage. To solve the problem, we factorize the convolutional layer into depthwise convolution and pointwise convolution, and only predict the weights of the lightweight depthwise convolutions.

3.1.1 Efficient Conditional Convolution Blocks for Image Generation

A conventional convolution kernel has weight parameters. A naive solution for generating the spatially-varying convolution kernel needs to predict weight parameters. This is impractical because the convolution operation is the basic building blocks of the generator and would be stacked for multiple times in the generator. Such a network is not only computation and memory intensive, but also prone to overfit the training data.

To solve the problem, we introduce depthwise separable convolution Chollet (2017) and only predict the depthwise convolutional kernel weights, which substantially reduces the number of parameters to predict. In particular, we factorize the convolutional kernel into a conditional depthwise convolution and a conventional pointwise convolution (i.e., convolution). The conditional depthwise convolution performs spatial filtering over each input channel independently, and its spatially-varying kernel weights are dynamically predicted based on the semantic layout. The predicted weights for the conditional convolution layer are denoted as , and the output feature maps are denoted as . The conditional depthwise convolution is formulated as,

(1)

where denotes the spatial coordinates of the feature maps, denotes the convolution kernel size, and denotes the channel index. The convolutional kernels in with kernel size operates at each channel and each spatial location of independently to generate output feature maps. Then we exploit a conventional pointwise convolution ( convolution) to map the input channels to output channels, and the output is denoted as .

In addition, we also propose a conditional attention operation to gate the information flow passed to the next layer. The conditional attention weights are predicted in the same way as the conditional convolution kernels, which will be detailed later. An element-wise product between the predicted attention weights and the convolution output produces the gated feature maps,

(2)

where is the channel index and denotes the spatial location in the feature maps.

The size of predicted parameters in the conditional convolution and the conditional attention are ( in our implementation) and , respectively. The parameter size is reduced by times compared to directly predicting the whole convolutional kernel weights.

By predicting unique convolutional kernel weights for each spatial location, the image generation process becomes more flexible and adaptive to the semantic layout conditions. In the meantime, we keep an affordable parameter size and computational cost by introducing the depthwise separable convolutions. We define a ResBlock-like structure, named Conditional Convolution Block, with the operations introduced above. As shown in Figure 2

(left), it includes a conventional batch normalization layer, a conditional depthwise convolution with

, a conventional pointwise convolution, followed by a conditional attention layer, and finally the the non-linear activation layer. There are also identity additive skip connections for evert two such blocks.

3.1.2 Conditional Weight Prediction and Overall Generator Structure

The conditional weight prediction network predicts the conditional weights given the input semantic layout. A simple design of the weight prediction network would be simply stacking multiple convolutional layers. In SPADE Park et al. (2019), two convolutional layers of kernel size are applied to the downsampled semantic label map to generate the adaptive scale and bias for their proposed adaptive normalization layer. But downsampling a semantic label map to a very small size, e.g.,

, by nearest neighbor interpolation will inevitably lose much useful information. Moreover, such a structure only has a receptive field of

, which restricts the weight prediction from incorporating long-range context information. If there is a large area of the same semantic label, pixels inside this area can only access a local neighborhood with identical semantic labels. So they will be processed by identical predicted weights, regardless of their relative positions inside the object or stuff.

Therefore, we design a global-context-aware weight prediction network with a feature pyramid structure Lin et al. (2017). The architecture of our weight prediction network is shown in Figure 2 (right). The label map is first downsampled through the layout encoder, and then upsampled by the decoder with lateral connections from the encoder. The features at different levels of the feature pyramid are concatenated with the original semantic map to obtain the global-context-aware semantic feature maps, which are used to predict the conditional convolution weights and conditional attention weights separately. We use two convolutional layers to predict the conditional convolution weights. To predict the conditional attention weights, we adopt two convolutional layers and a Sigmoid activation layer.

With the encoder-decoder structure of the weight prediction network, our predicted weights are aware of not only the local neighborhood, but also long-range context and relative locations.

The overall generator network is built of a series of Conditional Convolution Blocks and upsampling layers, with conditional weights predicted by the weight prediction network .

3.2 Feature Pyramid Semantics-Embedding Discriminator

We believe that a good discriminator should focus on two indispensable and complementary aspects: high-fidelity details such as texture and edges, and semantic alignment with the input semantic map. Existing methods for semantic image synthesis apply a multi-scale PatchGAN discriminator Wang et al. (2018); Park et al. (2019), where images concatenated with the semantic label maps are scaled to multiple resolutions and fed into different discriminators with identical structure. But it still struggles to discriminate the fine details, and does not pose strong constraints on the spatial semantic alignment between the generated image and the input label map.

Motivated by the aforementioned two design principles of discriminators, we propose a more effective design for the discriminator . We create multi-scale feature pyramids for promoting high-fidelity details such as texture and edges and exploit a semantics-embedding discriminator to force the spatial semantic alignment between the generated images and the input semantic layout.

3.2.1 Feature Pyramid Discriminator

Current image generation methods tend to generate images with blurry edges, textures and obvious artifacts. This problem suggests that we should cast more attention on low-level details when designing the discriminator architectures. On the other hand, the discriminator should also have a global view of the high-level semantics. The previously introduced multi-scale PatchGAN discriminator Wang et al. (2018) attempts to balance large receptive field and fine details by multiple discriminators at different scales. The same image at different scales are independently fed into different discriminators, leading to increased network parameters, memory footprint and computational cost.

Inspired by the evolution from image pyramids to feature pyramids Lin et al. (2017), we propose a single feature pyramid discriminator to produce a multi-scale feature representation with both global semantics and low-level texture and edge information. As shown in Figure. 2(right), our feature pyramid discriminator takes the input image at a single scale. The bottom-up pathway produces a feature hierarchy consisting of multi-scale feature maps and the top-down pathway gradually upsamples the spatially coarse but semantically rich feature maps. The lateral combines the high-level semantic feature maps from the top-down pathway with the low-level feature maps from the bottom-up pathway. As a result, the combined multi-scale features are semantically strong, as well as containing finer low-level details such as edges and textures. So the discriminator would pose stronger constraints on both the semantic information and the fine details.

3.2.2 Semantic Embeddings for Discriminator

In the conventional discriminators for semantic image synthesis, an image and its corresponding semantic label map is concatenated and fed into the discriminator as its inputs. However, there is no guarantee that the discriminator makes use of the label maps for distinguishing real/fake images. In other words, the discriminator could satisfy the training constraints by only discriminating whether an image is real or not, without considering whether it matches well with the label map. Inspired by projection discriminator Miyato and Koyama (2018)

which computes the dot product between the class label and image feature vector as part of the output discriminator score, we adapt this idea to our scenerio where the condition is the spatial label map. In order to encourage the semantic alignment between generated images and the conditional semantic layout, we propose a patch-based semantics embedding discriminator.

Our discriminator takes only the real or generated images as inputs, and produces a set of feature pyramids at different scales. denotes feature maps at a spatial resolution of with channels. The feature vector at each spatial location of

represents a patch in the original image. The conventional PatchGAN discriminator tries to classify if each patch is real or not, by predicting a score for each spatial location in the feature map

. While we force the discriminator to classify not only real or fake images, but also whether the patch features match with the semantic labels in that patch within a joint embedding space.

We downsample the label map to the same spatial resolution as , and embed the one-hot label at each spatial location into a -dimensional vector. The embedded semantic layout is denoted as . We calculate the inner product between each spatial location of and , to obtain a semantic matching score map, where each value represents the semantic alignment score of the corresponding patch in the original image. The semantic matching score is added with the conventional real/fake score as the final discriminator score. In this way, not only does the discriminator guide the generator to generate high-fidelity images, but also it drives the generated images to be better semantically aligned with the conditional semantic layout.

3.3 Loss Functions and Training Scheme

The generator and the discriminator of our network are trained alternatively, where the discriminator adopts the hinge loss for distinguishing real/fake images while the generator is optimized with multiple losses, including the hinge-based adversarial loss, discriminator feature matching loss, and perceptual loss, following previous works Wang et al. (2018); Park et al. (2019),

(3)
(4)

where is a real image, is the semantic label map, and is the input noise map. denotes the perceptual loss, which matches the VGG extracted features between the generated images and the original images. denotes the discriminator feature matching loss, which matches the discriminator intermidiate features between the generated images and the original images. and denote the weights for the perceptual loss and feature matching loss, respectively.

4 Experiments

4.1 Datasets and Evaluation Metrics

We experiment on Cityscapes Cordts et al. (2016), COCO-Stuff Caesar et al. (2018), and ADE20K Zhou et al. (2017) datasets. The Cityscapes dataset has 3,000 training images and 500 validation images of urban street scenes. COCO-Stuff is the most challenging dataset, containing 118,000 training images and 5,000 validation images from complex scenes. ADE20K dataset provides 20,000 training images and 2,000 validation images from both outdoor and indoor scenes. All images are annotated with semantic segmentation masks.

We evaluate our approach from three aspects. We firstly compare synthesized images by our approach and previous approaches, and conduct a human perceptual evaluation to compare the visual quality of the generated images. We then evaluate the segmentation performance of the generated images using a segmentation model pretrained on the original datasets. We use the same segmentation models as those in Park et al. (2019) for testing. The segmentation performance is measured by mean Intersection-over-Union (mIOU) and pixel accuracy. Finally, we calculate the distribution distances between the generated images and real images by the Fréchet Inception Distance (FID) Heusel et al. (2017).

4.2 Implementation Details

The training and generated image resolution is for COCO-Stuff and ADE20K datasets, and

for Cityscapes dataset. For the generator, synchronized batch normalization between different GPUs is adopted for better estimating the batch statistics. For the discriminator, we utilize instance normalization. We use Leaky ReLU activations, to avoid sparse gradients caused by ReLU activation. We adopt ADAM 

Kingma and Ba (2014) optimizer with learning rate for the generator and for the discriminator. The weights for the perceptual loss is 10 and the weight discriminator feature matching loss is 20. Following Park et al. (2019)

, to enable multi-modal synthesis and style-guided synthesis, we apply a style encoder and a KL-Divergence loss with loss weight 0.05. Our models are trained on 16 TITANX GPUs, with a batch size of 32. We train 200 epochs for Cityscapes and ADE20K datasets, and 100 epochs for COCO-Stuff dataset. Code is available at

https://github.com/xh-liu/CC-FPSE.

4.3 Qualitative Results and Human Perceptual Evaluation

Figure 3: Results comparison with previous approaches. Better viewed in color. Zoom in for details.
Figure 4: Semantic image synthesis results on COCO and ADE20K. Better viewed in color.

We compare our results with previous approaches pix2pixHD Wang et al. (2018) and SPADE Park et al. (2019), as shown in Figure 3. The images generated by our approach show significant improvement over previous approaches for challenging scenes. They have finer details such as edges and textures, and less artifacts, and matches better with the input semantic layout. Figure 4 shows more images generated by our proposed approach. More results and comparisons are provided in the supplementary material.

We also conduct a human perception evaluation to compare the generated image quality between our method and the previous state-of-the-art method, SPADE Park et al. (2019). In particular, we randomly sample 500 semantic label maps from the validation set of each dataset. At each experiment, the worker is shown a semantic label map with two generated images by our approach and SPADE, respectively. The worker is required to choose an image with higher quality that matches better with the semantic layout. We found that in Cityscapes, COCO-Stuff, and ADE20K datasets respectively, 55%, 76%, and 61% images generated by our method is preferred compared to SPADE. The human perceptual evaluation validates that our approach is able to generate higher-fidelity images that are better spatially aligned with the semantic layout.

4.4 Quantitative Results

2 COCO-Stuff Cityscapes ADE20K
mIOU Accu FID mIOU Accu FID mIOU Accu FID
CRN Chen and Koltun (2017) 23.7 40.4 70.4 52.4 77.1 104.7 22.4 68.8 73.3
SIMS Qi et al. (2018) N/A N/A N/A 47.2 75.5 49.7 N/A N/A N/A
pix2pixHD Wang et al. (2018) 14.6 45.7 111.5 58.3 81.4 95.0 20.3 69.2 81.8
SPADE Park et al. (2019) 37.4 67.9 22.6 62.3 81.9 71.8 38.5 79.9 33.9
Ours 41.6 70.7 19.2 65.5 82.3 54.3 43.7 82.9 31.7
2
Table 1: Results by the proposed and previous approaches on multiple public datasets. Higher mIOU/accuracy and lower FID score indicate better performance.
2 Baseline (1) (2) (3) (4) (5) (6) CC-FPSE (Ours)
Generator SPADE CC w/o FP CC w/ FP CC w/o FP SPADE w/ FP SPADE w/ FP CC w/ FP CC w/ FP
Discriminator MsPatch MsPatch MsPatch FP+SE MsPatch+SE FP+SE MsPatch+SE FP+SE
mIOU 35.2 36.2 36.7 40.4 38.0 39.17 40.4 41.3
2
Table 2: Ablation studies on COCO-Stuff dataset.

Table 1 shows the segmentation performance and FID scores of results by our approach and those by previous approaches. CRN Chen and Koltun (2017) uses cascaded refinement networks with regression loss, without using GAN for training. SIMS is a semi-parametric approach which retrieves reference segments from a memory bank and refines the canvas by a refinement network. Both pix2pixHD Wang et al. (2018) and SPADE Park et al. (2019) are GAN-based approaches. Pix2pixHD takes the semantic label map as the generator input, and uses a multi-scale generator and multi-scale discriminator to generate high-resolution images. SPADE takes a noise vector as input, and the semantic label map are used for modulating the activations in normalization layers by learned affine transformations. Our approach performs consistently better than previous approaches, which demonstrate the effectiveness of the propose approach. Note that SIMS has better FID scores than GAN-based approaches, because it generates images by refining segments retrieved from the real data. However, it has poor segmentation performance, because it might retrieve semantically mismatched patches.

4.5 Ablation Studies

We conduct controlled experiments to verify the effectiveness of each component in our approach. We use the SPADE Park et al. (2019) model as our baseline, and gradually add or eliminate each component to the framework. Our model is denoted as CC-FPSE in the last column. The segmentation mIOU scores of the generated images by each experiment are shown in Table 2.222To be comparable with ablation study results in Park et al. (2019), we report the model performance at 50 epochs.

Conditional convolutions for generator. We firstly replace the SPADE layer with our conditional convolution layers to incorporate the semantic layout information in the experiments denoted as “CC”. By comparing baseline with (1) (CC generator vs SPADE generator, both with MsPatch discriminator), (5) with CC-FPSE (ours) (CC generator vs SPADE generator, both with FPSE discriminator), and (4) with (6) (CC generator vs SPADE generator, both with MsPatch+SE discriminator), results indicate that our conditional convolutions are able to better exploit the semantic layout information for adaptively generating high-quality images.

Feature pyramid weight prediction network. Next, we replace the feature pyramid structure with a stack of two convolutional layers in the weight prediction network, and this experiment is denoted as “w/ FP” and “w/o FP”. Comparing (1) with (2), and (3) with CC-FPSE (Ours), we found that removing the feature pyramid structure for the weight prediction network leads to inferior performance, indicating that the global and long-range information are necessary for predicting the convolutional weights.

FPSE Discriminator. We fix our proposed generator (“CC w/ FP” or “SPADE w/ FP”) and test different designs of the discriminator, to demonstrate the effectiveness of our FPSE discriminator. We force the spatial semantic alignment with the semantic layout, by the introduced semantics-embedding constraint for the discriminator. Comparing (2) with (6) indicates the effectiveness of the semantics embedding discriminator. With the semantics-embedding constraint, the discriminator is driven to classify the correspondence between the image patches and the semantic layout. So the generator is encouraged to generate images that are better aligned with the semantic layout. Furthermore, we replace the multiscale discriminator with the feature pyramid structure, denoted as “FP+SE”, which is our proposed discriminator design. The comparison between (6) and last column CC-FPSE (Ours) indicates that the feature pyramid discriminator structure, which combines the low-level and semantic features at different scales, leads to further performance improvement.

5 Conclusion

We propose a novel approach (CC-FPSE) for image synthesis from a given semantic layout via better using the semantic layout information to generate images with high-quality details and well aligned semantic meanings. Our generator is able to better exploit the semantic layout to control the generation process, by predicting the spatially-varying weights for the conditional convolution layers. Our feature pyramid semantics-embedding discriminator guides the generator to generate images that contain high-fidelity details and aligns well with the conditional semantic layout. Our approach achieves state-of-the art performance and is able to generate photorealistic images on Cityscapes, COCO-Stuff, and ADE20K datasets.

Acknowledgments

This work is supported in part by SenseTime Group Limited, in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14207319, CUHK14208417, CUHK14239816, and in part by CUHK Direct Grant. We thank Lu Sheng for proofreading and helpful suggestions on the paper.

References

  • A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §1, §2.
  • H. Caesar, J. Uijlings, and V. Ferrari (2018) Coco-stuff: thing and stuff classes in context. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1209–1218. Cited by: §4.1.
  • Q. Chen and V. Koltun (2017) Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1511–1520. Cited by: §2, §4.4, Table 1.
  • F. Chollet (2017)

    Xception: deep learning with depthwise separable convolutions

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258. Cited by: §1, §3.1.1.
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §4.1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §1, §2.
  • D. Ha, A. Dai, and Q. V. Le (2016) Hypernetworks. arXiv preprint arXiv:1609.09106. Cited by: §2.
  • A. W. Harley, K. G. Derpanis, and I. Kokkinos (2017) Segmentation-aware convolutional networks using local attention masks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5038–5047. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §4.1.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
  • X. Hu, H. Mu, X. Zhang, Z. Wang, J. Sun, and T. Tan (2019) Meta-sr: a magnification-arbitrary network for super-resolution. arXiv preprint arXiv:1903.00875. Cited by: §2.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §1, §2, §2.
  • X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool (2016) Dynamic filter networks. In Advances in Neural Information Processing Systems, pp. 667–675. Cited by: §2.
  • Y. Jo, S. Wug Oh, J. Kang, and S. Joo Kim (2018) Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3224–3232. Cited by: §2.
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §1.
  • T. Karras, S. Laine, and T. Aila (2018) A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948. Cited by: §1, §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • Z. Li, R. Tao, E. Gavves, C. G. Snoek, and A. W. Smeulders (2017) Tracking by natural language specification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6495–6503. Cited by: §2.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125. Cited by: §3.1.2, §3.2.1.
  • M. Liu, T. Breuel, and J. Kautz (2017) Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, pp. 700–708. Cited by: §2.
  • M. Lucic, M. Tschannen, M. Ritter, X. Zhai, O. Bachem, and S. Gelly (2019) High-fidelity image generation with fewer labels. arXiv preprint arXiv:1903.02271. Cited by: §1, §2.
  • N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: §1.
  • T. Miyato and M. Koyama (2018) CGANs with projection discriminator. arXiv preprint arXiv:1802.05637. Cited by: §3.2.2.
  • T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. arXiv preprint arXiv:1903.07291. Cited by: §1, §1, §2, §3.1.2, §3.2, §3.3, §4.1, §4.2, §4.3, §4.3, §4.4, §4.5, Table 1, footnote 2.
  • X. Qi, Q. Chen, J. Jia, and V. Koltun (2018) Semi-parametric image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8808–8816. Cited by: §2, Table 1.
  • F. Shen, S. Yan, and G. Zeng (2018) Neural style transfer via meta networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8061–8069. Cited by: §2.
  • H. Su, V. Jampani, D. Sun, O. Gallo, E. Learned-Miller, and J. Kautz (2019) Pixel-adaptive convolutional neural networks. arXiv preprint arXiv:1904.05373. Cited by: §2.
  • T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807. Cited by: §1, §1, §2, §3.2.1, §3.2, §3.3, §4.3, §4.4, Table 1.
  • J. Wu, D. Li, Y. Yang, C. Bajaj, and X. Ji (2018) Dynamic filtering with large sampling field for convnets. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 185–200. Cited by: §2.
  • T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. CVPR. Cited by: §2.
  • T. Xue, J. Wu, K. Bouman, and B. Freeman (2016) Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems, pp. 91–99. Cited by: §2.
  • H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §1, §2.
  • H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. ICCV. Cited by: §2.
  • F. Zhao, J. Zhao, S. Yan, and J. Feng (2018) Dynamic conditional networks for few-shot learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–35. Cited by: §2.
  • B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641. Cited by: §4.1.

6 Appendix

Examples of generated images by our approach are shown in Figure 5. Over proposed approach is able to synthesis images of diverse scenes. Moreover, we show the semantic image synthesis results compared to previous approaches pix2pixHD and SPADE in Figure 6. Some differences between the generated images of different approaches are highlighted in red boxes. Our proposed approach generates high-quality images with fine details. It can generate small objects based on the label map, while previous approaches are likely to ignore them. For example, in the first row of Figure 6, our approach generates a driver inside the bus based on the semantic layout, while other approaches fails to generate the driver.

Figure 5: Semantic image synthesis results by our proposed approach. Best viewed in color. Zoom in for details.
Figure 6: Semantic image synthesis results by previous approaches and our approach. Best viewed in color. Zoom in for details.