Semantic Bottleneck Scene Generation

by   Samaneh Azadi, et al.
berkeley college

Coupling the high-fidelity generation capabilities of label-conditional image synthesis methods with the flexibility of unconditional generative models, we propose a semantic bottleneck GAN model for unconditional synthesis of complex scenes. We assume pixel-wise segmentation labels are available during training and use them to learn the scene structure. During inference, our model first synthesizes a realistic segmentation layout from scratch, then synthesizes a realistic scene conditioned on that layout. For the former, we use an unconditional progressive segmentation generation network that captures the distribution of realistic semantic scene layouts. For the latter, we use a conditional segmentation-to-image synthesis network that captures the distribution of photo-realistic images conditioned on the semantic layout. When trained end-to-end, the resulting model outperforms state-of-the-art generative models in unsupervised image synthesis on two challenging domains in terms of the Frechet Inception Distance and user-study evaluations. Moreover, we demonstrate the generated segmentation maps can be used as additional training data to strongly improve recent segmentation-to-image synthesis networks.



page 5

page 7

page 8

page 11

page 12

page 13

page 14

page 15


Semantic Palette: Guiding Scene Generation with Class Proportions

Despite the recent progress of generative adversarial networks (GANs) at...

Full-Glow: Fully conditional Glow for more realistic image generation

Autonomous agents, such as driverless cars, require large amounts of lab...

End-to-End Optimization of Scene Layout

We propose an end-to-end variational generative model for scene layout s...

Fast and Flexible Indoor Scene Synthesis via Deep Convolutional Generative Models

We present a new, fast and flexible pipeline for indoor scene synthesis ...

Interactive Image Synthesis with Panoptic Layout Generation

Interactive image synthesis from user-guided input is a challenging task...

Controllable and Progressive Image Extrapolation

Image extrapolation aims at expanding the narrow field of view of a give...

Towards Full-to-Empty Room Generation with Structure-Aware Feature Encoding and Soft Semantic Region-Adaptive Normalization

The task of transforming a furnished room image into a background-only i...

Code Repositories


Semantic Bottleneck Scene Generation

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Significant strides have been made on generative models for image synthesis, with a variety of methods based on Generative Adversarial Networks (GANs)

[14] achieving state-of-the-art performance. At lower resolutions or in specialized domains, GAN-based methods are able to synthesize samples which are near-indistinguishable from real samples [7]. However, generating complex, high-resolution scenes from scratch remains a challenging problem. As image resolution and complexity increase, the coherence of synthesized images decreases — samples contain convincing local textures, but lack a consistent global structure.

Stochastic decoder-based models, such as conditional GANs, were recently proposed to alleviate some of these issues. In particular, both Pix2PixHD [39] and SPADE [32] are able to synthesize high-quality scenes using a strong conditioning mechanism based on semantic segmentation labels during the scene generation process. Global structure encoded in the segmentation layout of the scene is what allows these models to focus primarily on generating convincing local content consistent with that structure.

A key practical drawback of such conditional models is that they require full segmentation layouts as input. Thus, unlike unconditional generative approaches which synthesize images from randomly sampled noise, these models are limited to generating images from a set of scenes that is prescribed in advance, typically either through segmentation labels from an existing dataset, or scenes that are hand-crafted by experts.

Figure 1: We adversarially train the segmentation synthesis network to generate realistic segmentation maps, and then use a conditional image synthesis network to generate the final image. Fine-tuning these two components end-to-end results in state-of-the-art unconditional synthesis of complex scenes.

To overcome these limitations, we propose a new model, the Semantic Bottleneck GAN, which couples high-fidelity generation capabilities of label-conditional models with the flexibility of unconditional image generation. This in turn enables our model to synthesize an unlimited number of novel complex scenes, while still maintaining high-fidelity output characteristic of image-conditional models.

Our Semantic Bottleneck GAN first unconditionally generates a pixel-wise semantic label map of a scene (i.e. for each spatial location it outputs a class label), and then generates a realistic scene image by conditioning on that semantic map. By factorizing the task into these two steps, we are able to separately tackle the problems of producing convincing segmentation layouts (i.e. a useful global structure) and filling these layouts with convincing appearances (i.e. local structure). When trained end-to-end, the model yields samples which have a coherent global structure as well as fine local details. Empirical evaluation shows that our Semantic Bottleneck GAN achieves a new state-of-the-art on two complex datasets, Cityscapes and ADE-Indoor, as measured both by the Fréchet Inception Distance (FID) and by user studies. Additionally, we observe that the synthesized segmentation label maps produced as part of the end-to-end image synthesis process in Semantic Bottleneck GAN can also be used to improve the performance of the state-of-the-art semantic image synthesis network [32], resulting in higher-quality outputs when conditioning on ground truth segmentation layouts. Our code will be available at

Figure 2: Schematic of Semantic Bottleneck GAN. Starting from random noise, we synthesize a segmentation layout and use a discriminator to bias the segmentation synthesis network towards realistic looking segmentation layouts. The generated layout is then provided as input to a conditional image synthesis network to synthesize the final image. A second discriminator is used to bias the conditional image synthesis network towards realistic images paired with real segmentation layouts. Finally, a third unconditional discriminator is used to bias the conditional image synthesis network towards generating images that match the ground truth.

2 Related Work

Generative Adversarial Networks (GANs)

GANs [14] are a powerful class of implicit generative models successfully applied to various image synthesis tasks such as image style transfer [18, 47], unsupervised representation learning [11, 33, 36]

, image super-resolution 

[23, 13], and text-to-image synthesis [45, 40, 34]. Training GANs is notoriously hard and recent efforts focused on improving neural architectures [21, 44, 9]

, loss functions 

[1], regularization [15, 30], large-scale training [7], self-supervision [10], and sampling [7, 4]. One compelling approach which enables generation of high-resolution images is based on progressive training: a model is trained to first synthesize lower-resolution images (e.g. ), then the resolution is gradually increased until the desired resolution is achieved [21]. Recently, BigGAN [7] showed that GANs significantly benefit from large-scale training, both in terms of model size and batch size. We note that these models are able to synthesize high-quality images in settings where objects are very prominent and centrally placed or follow some well-defined structure, as the corresponding distribution is easier to capture. In contrast, when the scenes are more complex and the amount of data is limited, the task becomes extremely challenging for these state-of-the-art models. The aim of this work is to improve the performance in the context of complex scenes and a small number of training examples.

GANs on discrete domains

GANs for discrete domains have been investigated in several works [22, 43, 24, 6, 26]. Training in this domain is even more challenging as the samples from discrete distributions are not differentiable with respect to the network parameters. This problem can be somewhat alleviated by using the Gumbel-softmax distribution, which is a continuous approximation to a multinomial distribution parameterized in terms of the softmax function [22]. We will show how to apply a similar principle to learn the distribution of discrete segmentation masks.

Conditional image synthesis

In conditional image synthesis one aims to generate images by conditioning on an input which can be provided in the form of an image [18, 47, 3, 5, 25], a text phrase [37, 45, 35, 2, 17], a scene graph [20, 2], a class label or a semantic layout [31, 8, 39, 32]. These conditional GAN methods learn a mapping that translates samples from the source distribution into samples from the target domain.

The text-to-image synthesis model proposed in [17]

decomposes the synthesis task into multiple steps. First, given the text description, a semantic layout is constructed by generating object bounding boxes and refining each box by estimating object shapes. Then, an image is synthesized conditioned on the generated semantic layout from the first step. Our work shares the same high-level idea of decomposing the image generation problem into the semantic layout synthesis and the conditional semantic-layout-to-image synthesis. A key difference is that we focus on

unconditional image generation which results in a novel semantic layout generation pipeline and end-to-end network design.

3 Semantic Bottleneck GAN (SB-GAN)

We propose an unconditional Semantic Bottleneck GAN architecture to learn the distribution of complex scenes. To tackle the problems of learning both the global layout and the local structure, we divide this synthesis problem into two parts: an unconditional segmentation map synthesis network and a conditional segmentation-to-image synthesis model. Our first network is designed to coarsely learn the scene distribution by synthesizing semantic layouts. It generates per-pixel semantic categories following the progressive GAN model architecture (ProGAN) [21]. The second network populates the synthesized semantic layouts with texture by predicting RGB pixel values using Spatially-Adaptive Normalization (SPADE) [32], following the architecture of the state-of-the-art semantic synthesis network in [32]. We assume the ground truth segmentation masks are available for all or part of the target scene dataset. In the following sections, we will first discuss our semantic bottleneck synthesis pipeline and summarize the SPADE network for image synthesis. We will then couple these two networks in an end-to-end final design which we refer to as Semantic Bottleneck GAN (SB-GAN).

3.1 Semantic bottleneck synthesis

Our goal here is to learn a (coarse) estimate of the scene distribution from samples corresponding to real segmentation maps with

semantic categories. Starting from random noise, we generate a tensor

which represents a per-pixel segmentation class, with and indicating the height and width, respectively, of the generated map and the batch size. In practice, we progressively train from a low to a high resolution using the ProGAN architecture [21] coupled with the Improved WGAN loss function [15] on the ground truth discrete-valued segmentation maps. In contrast to ProGAN, in which the generator outputs continuous RGB values, we predict per-pixel discrete semantic class labels. This task is extremely challenging as it requires the network to capture the intricate relationship between segmentation classes and their spatial dependencies. To this end, we apply the Gumbel-softmax trick [19, 29] coupled with a straight-through estimator [19], which we describe in detail below.

Applying a softmax function to the last layer of the generator (i.e. logits) leads to an output that can be interpreted as a probability score for each pixel belonging to each of the

semantic classes. This results in probability maps , with for each spatial location . To sample a semantic class from this multinomial distribution, we would ideally apply the following well-known procedure at each spatial location: (1) sample i.i.d. samples, , from the standard Gumbel distribution, (2) add these samples to each logit, and (3) take the index of the maximal value. This reparametrization indeed allows for an efficient forward-pass, but is not differentiable. Nevertheless, the max operator can be replaced with the softmax function and the quality of the approximation can be controlled by varying the

temperature hyperparameter

—the smaller , the closer the approximation is to the categorical distribution [19]:


Similar to the real samples, the synthesized samples fed to the GAN discriminator should still contain discrete category labels. As a result, for the forward pass, we simply compute , while for the backward pass, we use the soft predicted scores directly, a strategy also known as straight-through estimation [19].

3.2 Semantic image synthesis

Our second sub-network converts the synthesized semantic layouts into photo-realistic images using spatially-adaptive normalization [32]. The segmentation masks are employed to spread the semantic information throughout the generator by modulating the activations with a spatially adaptive learned transformation. We follow the same generator and discriminator architectures and loss functions used in [32], where the generator contains a series of SPADE residual blocks with upsampling layers. The loss functions to train SPADE are summarized as:

where , stand for the SPADE generator and discriminator, and and represent the VGG and discriminator feature matching loss functions, respectively [32, 39]. We pre-train this network using pairs of real RGB images, , and their corresponding real segmentation masks, , from the target scene data set.

In the next section, we will describe how to employ the synthesized segmentation masks in an end-to-end manner to improve the performance of both the semantic bottleneck and the semantic image synthesis sub-networks.

3.3 End-to-end framework

After training semantic bottleneck synthesis model to synthesize segmentation masks and the semantic image synthesis model to stochastically map segmentations to photo-realistic images, we adversarially fine-tune the parameters of both networks in an end-to-end approach by introducing an unconditional discriminator network on top of the SPADE generator (see Figure 2).

This second discriminator, , has the same architecture as the SPADE discriminator, but is designed to distinguish between real RGB images and the fake ones generated from the synthesized semantic layouts. Unlike the SPADE conditional GAN loss, which examines pairs of input segmentations and output images, in equation 3.2, the GAN loss on , , is unconditional and only compares real images to synthesized ones, as shown in equation 3:


where represents the semantic bottleneck synthesis generator, and is the improved WGAN loss used to pretrain described in Section 3.1. In contrast to the conditional discriminator in SPADE, which enforces consistency between the input semantic map and the output image, is primarily concerned with the overall quality of the final output. The hyper parameter determines the ratio between the two generators during fine-tuning. The parameters of both generators, and , as well as the corresponding discriminators, and , are updated in this end-to-end fine-tuning.

We illustrate our final end-to-end network in Figure 2. Jointly fine-tuning the two networks in an end-to-end fashion allows the two networks to reinforce each other, leading to improved performance. The gradients with respect to RGB images synthesized by SPADE are back-propagated to the segmentation synthesis model, thereby encouraging it to synthesize segmentation layouts that lead to higher quality final images. Hence, SPADE plays the role of a loss function for synthesizing segmentations, but in the RGB space, hence providing a goal that was absent from the initial training. Similarly, fine-tuning SPADE with synthesized segmentations allows it to adapt to a more diverse set of scene layouts, which improves the quality of generated samples.

4 Experiments and Results

We evaluate the performance of the proposed approach on two datasets containing images with complex scenes, where the ground truth segmentation masks are available during training (possibly only for a subset of the images). We also study the role of the two network components, semantic bottleneck and semantic image synthesis, on the final result. We compare the performance of SB-GAN against the state-of-the-art BigGAN model [7] as well as a ProGAN [21] baseline that has been trained on the RGB images directly. We evaluate our method using Fréchet Inception Distance (FID) as well as a user study.

Figure 3: Images synthesized by different methods trained on Cityscapes-5K. Zoom in for more detail. Although both models capture the general scene layout, SB-GAN (1st row) generates more convincing objects such as buildings and cars.
Figure 4: Images synthesized by different methods trained on Cityscapes-25K. Zoom in for more detail. Images synthesized by BigGAN (3rd row) are blurry and sometimes defective in local structures.


We study the performance of our model on the Cityscapes and ADE-indoor datasets as the two domains with complex scene images.

  • [itemsep=0pt,topsep=1pt]

  • Cityscapes-5K [12] contains street scene images in German cities with training and validation set sizes of 3,000 and 500 images, respectively. Ground truth segmentation masks with 33 semantic classes are available for all images in this dataset.

  • Cityscapes-25K [12] contains street scene images in German cities with training and validation set sizes of 23,000 and 500 images, respectively. Cityscapes-5K is a subset of this dataset, providing 3,000 images in the training set here as well as the entire validation set. Fine ground truth annotations are only provided for this subset, with the remaining 20,000 training images containing only coarse annotations. We extract the corresponding fine annotations for the rest of training images using the state-of-the-art segmentation model [42, 41] trained on the training annotated samples from Cityscapes-5K. This dataset contains 19 semantic classes.

  • ADE-Indoor is a subset of the ADE20K dataset [46] containing 4,377 challenging training images from indoor scenes and 433 validation images with 95 semantic categories.

Figure 5: Images synthesized by different methods trained on ADE-Indoor. This dataset is very challenging, causing mode collapse for the BigGAN model (3rd row). In contrast, samples generated by SB-GAN (1st row) are generally of higher quality and much more structured than those of ProGAN (2nd row).


We use the Fréchet Inception Distance (FID) [16] as well as a user study to evaluate the quality of the generated samples. To compute FID, the real data and generated samples are first embedded in a specific layer of a pre-trained Inception network. Then, a multivariate Gaussian is fit to the data, and the distance is computed as , where and denote the empirical mean and covariance, and subscripts and denote the real and generated data respectively. FID is shown to be sensitive to both the addition of spurious modes and to mode dropping [38, 27]. On the Cityscapes dataset, we ran five trials where we computed FID on 500 random synthetic images and 500 real validation images, and report the average score. On ADE-Indoor, the same process is repeated on batches of 433 images.

Implementation details

In all our experiments, we set , and . The initial generator and discriminator learning rates for training SPADE both in the pretraining and end-to-end steps are and , respectively. The learning rate for the semantic bottleneck synthesis sub-network is set to in the pretraining step and to in the end-to-end fine-tuning on Cityscapes, and to for ADE-Indoor. The temperature hyperparameter, , is always set to 1. For BigGAN, we followed the setup in [28]111Configuration as in, where we modified the code to allow for non-square images of Cityscapes. We used one class label for all images to have an unconditional BigGAN model. For both datasets we varied the batch size (using values in ), the learning rate, and the location of the self-attention block. We trained the final model for K iterations.

4.1 Qualitative results

In Figures 34, and 5, we provide qualitative comparisons of the competing methods on the three aforementioned datasets. We observe that both Cityscapes-5K and ADE-Indoor are very challenging for the state-of-the-art ProGAN and BigGAN models, likely due to the complexity of the data and small number of training instances. Even at a resolution of on the ADE-Indoor dataset, BigGAN suffers from mode collapse, as illustrated in the last row of Figure 5. In contrast, SB-GAN significantly improves on the structure of the scene distribution, and provides samples of higher quality. On Cityscapes-25K, the performance improvement of SB-GAN is more modest due to the large number of training images available. It is worth emphasizing that in this case only 3K ground truth segmentations for training SB-GAN are available. Compared to BigGAN, images synthesized by SB-GAN are sharper and contain more structural details (e.g., one can zoom-in on the synthesized cars). More qualitative examples are presented in the Appendix.

4.2 Quantitative evaluation

To provide a thorough empirical evaluation of the proposed approach, we generate samples for each dataset and report the FID scores of the resulting images (averaged across 5 sets of generated samples). We evaluate SB-GAN both before and after end-to-end fine-tuning, and compare our method to two strong baselines, ProGAN [21] and BigGAN [7]. The results are detailed in Tables 1 and 2.

Cityscapes-5k 92.57 83.20 65.49
Cityscapes-25k 63.87 71.13 62.97
ADE-Indoor 104.83 91.80 85.27
Table 1: FID of the synthesized samples (lower is better), averaged over 5 random sets of samples. Images were synthesized at resolution of on Cityscapes and on ADE-Indoor.
Figure 6: The effect of fine-tuning on the baseline setup for the Cityscapes-25K dataset. We observe that both the global structure of the segmentations and the performance of semantic image synthesis improve after fine-tuning, resulting in images of higher quality.

First, in the low-data regime, even without fine-tuning, our Semantic Bottleneck GAN produces higher quality samples and significantly outperforms the baselines on Cityscapes-5K and ADE-Indoor. The advantage of our proposed method is even more striking on smaller datasets. While competing methods are unable to learn a high-quality model of the underlying distribution without having access to a large number of samples, SB-GAN is less sensitive to the number of training data points. Secondly, we observe that by jointly training the semantic bottleneck and image synthesis components, SB-GAN produces state-of-the-art results across all three datasets.

We were not able to successfully train BigGAN at a resolution of due to instability observed during training and mode collapse. Table 2, however, shows the results for a lower-resolution setting, for which we were able to successfully train BigGAN. We report the results before the training collapses. BigGAN is, to a certain extent, able to capture the distribution of Cityscapes-25K, but fails completely on ADE-Indoor. Interestingly, BigGAN fails to capture the distribution of Cityscapes-5K even at

resolution. The standard deviation of the FID scores computed in Tables 

1 and 2 is within of the mean for Cityscapes and within of the mean for ADE-Indoor.

Cityscapes-25k 56.7 64.82 54.92
ADE-Indoor 85.94 156.65 81.39
Table 2: FID of the synthesized samples (lower is better), averaged over 5 random sets of samples. Images were synthesized at resolution of on Cityscapes and on ADE-Indoor.

Generating by conditioning on real segmentations

To independently assess the impact of end-to-end training on the conditional image synthesis sub-network, we evaluate the quality of generated samples when conditioning on ground truth validation segmentations from each dataset. Comparisons to the baseline network SPADE [32] are provided in Table 3 and Figure 8. We observe that the image synthesis component of SB-GAN consistently outperforms SPADE across all three datasets, indicating that fine-tuning on synthetic labels produced by the segmentation generator improves the conditional image generator. Please refer to the Appendix for more qualitative examples.

Cityscapes-5k 72.12 60.39
Cityscapes-25k 60.83 54.13
ADE-Indoor 50.30 48.15
Table 3: FID of the synthesized samples when conditioned on the ground truth labels (lower is better), averaged over 5 random sets of samples. For SB-GAN, we train the entire model end-to-end, extract the trained SPADE sub-network, and synthesize samples conditioned on the ground truth labels.
Figure 7: The effect of fine-tuning (FT) on the baseline setup for ADE-Indoor dataset. We observe that both the global structure of the segmentations and the performance of semantic image synthesis have been improved after fine-tuning, resulting in images of higher quality.
Figure 8: The effect of SB-GAN on improving the performance of the state-of-the-art semantic image synthesis model (SPADE [32]) on ground truth segmentations of Cityscapes-25K (left) and ADE-Indoor (right) validation sets. For SB-GAN, we train the entire model end-to-end, extract the trained SPADE sub-network, and synthesize samples conditioned on the ground truth labels.

Fine-tuning ablation study

To further dissect the effect of end-to-end training, we perform a study on different components of SB-GAN. In particular, we consider three settings: (1) SB-GAN before end-to-end fine-tuning, (2) fine-tuning only the semantic bottleneck synthesis component, (3) fine-tuning only the conditional image synthesis component, and (4) fine-tuning all components jointly. The results on the Cityscapes-5K dataset (resolution ) are reported in Table 4. Finally, the impact of fine-tuning on the quality of samples can be observed in Figures 6 and  11.

70.15 66.22 63.04 58.67
Table 4: Ablation study of various components of SB-GAN. We report FID scores of SB-GAN before fine-tuning, fine-tuning only the semantic bottleneck synthesis component, fine-tuning only the image synthesis component, and full end-to-end fine-tuning. Experiments are performed on the Cityscapes-5K dataset at a resolution of .

4.3 Human evaluation

We used Amazon Mechanical Turk (AMT) to study and compare the performance of different methods in terms of user assessments. We evaluate the performance of each model on each dataset through 600 pairs of (synthesized images, evaluators) containing 200 unique synthesized images. For each image, evaluators were asked to select a quality score from 1 to 4, indicating terrible and high quality images, respectively. Results are summarized in Table 5 and are consistent with FID-based evaluations, with SB-GAN as the winner in all datasets once again.

Cityscapes-5k 2.08 - 2.48
Cityscapes-25k 2.53 2.27 2.61
Ade-Indoor 2.35 1.96 2.49
Table 5: Average user evaluation scores when each user has selected a quality score in the range of 1 (terrible quality) to 4 (high quality) for each image.

5 Conclusion

We proposed an end-to-end Semantic Bottleneck GAN model that synthesizes semantic layouts from scratch, and then generates photo-realistic scenes conditioned on the synthesized layouts. Through extensive quantitative and qualitative evaluations, we showed that this novel end-to-end training pipeline significantly outperforms the state-of-the-art models in unconditional synthesis of complex scenes. In addition, Semantic Bottleneck GAN strongly improves the performance of the state-of-the-art semantic image synthesis model in synthesizing photo-realistic images from ground truth segmentations.

We believe that the idea of applying a semantic bottleneck to other generative models should be explored in future work. In addition, novel ways to train GANs with discrete outputs could be explored, especially techniques to deal with the non-differentiable nature of the generated outputs.


This work was supported by Google through Google Cloud Platform research credits. We thank Marvin Ritter for help with issues related to the compare_gan library [27]. We are grateful to the members of BAIR for fruitful discussions. SA is supported by the Facebook graduate fellowship.


  • [1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. 2017.
  • [2] Oron Ashual and Lior Wolf. Specifying object attributes and relations in interactive scene generation. In ICCV, 2019.
  • [3] Samaneh Azadi, Matthew Fisher, Vladimir Kim, Zhaowen Wang, Eli Shechtman, and Trevor Darrell. Multi-content gan for few-shot font style transfer. CVPR, 2018.
  • [4] Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian Goodfellow, and Augustus Odena. Discriminator Rejection Sampling. arXiv preprint arXiv:1810.06758, 2019.
  • [5] Samaneh Azadi, Deepak Pathak, Sayna Ebrahimi, and Trevor Darrell. Compositional gan: Learning image-conditional binary composition. arXiv preprint arXiv:1807.07560, 2019.
  • [6] Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günnemann. Netgan: Generating graphs via random walks. arXiv preprint arXiv:1803.00816, 2018.
  • [7] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. 2019.
  • [8] Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded refinement networks. In ICCV, 2017.
  • [9] Ting Chen, Mario Lucic, Neil Houlsby, and Sylvain Gelly. On self modulation for generative adversarial networks. In ICLR, 2019.
  • [10] Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, and Neil Houlsby. Self-supervised gans via auxiliary rotation loss. In CVPR, 2019.
  • [11] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, 2016.
  • [12] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.

    The Cityscapes dataset for semantic urban scene understanding.

    In CVPR, 2016.
  • [13] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
  • [14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.
  • [15] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In NeurIPS, 2017.
  • [16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a Nash equilibrium. In NeurIPS, 2017.
  • [17] Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR, 2018.
  • [18] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
  • [19] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. In ICLR, 2017.
  • [20] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image generation from scene graphs. CVPR, 2018.
  • [21] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2017.
  • [22] Matt J Kusner and José Miguel Hernández-Lobato. Gans for sequences of discrete elements with the gumbel-softmax distribution. arXiv preprint arXiv:1611.04051, 2016.
  • [23] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. CVPR, 2017.
  • [24] Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and Ming-Ting Sun. Adversarial ranking for language generation. In NeurIPS, 2017.
  • [25] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In NeurIPS, 2017.
  • [26] Sidi Lu, Yaoming Zhu, Weinan Zhang, Jun Wang, and Yong Yu.

    Neural text generation: past, present and beyond.

  • [27] Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs Created Equal? A Large-scale Study. In NeurIPS, 2018.
  • [28] Mario Lucic, Michael Tschannen, Marvin Ritter, Xiaohua Zhai, Olivier Bachem, and Sylvain Gelly. High-fidelity image generation with fewer labels. In ICML, 2019.
  • [29] Chris J Maddison, Andriy Mnih, and Yee Whye Teh.

    The concrete distribution: A continuous relaxation of discrete random variables.

    In ICLR, 2016.
  • [30] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ICLR, 2018.
  • [31] Augustus Odena, Christopher Olah, and Jonathon Shlens.

    Conditional image synthesis with auxiliary classifier gans.

    In ICML, 2017.
  • [32] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019.
  • [33] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
  • [34] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Mirrorgan: Learning text-to-image generation by redescription. In CVPR, 2019.
  • [35] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Mirrorgan: Learning text-to-image generation by redescription. In CVPR, 2019.
  • [36] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
  • [37] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. Learning what and where to draw. In NeurIPS, 2016.
  • [38] Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly.

    Assessing generative models via precision and recall.

    In NeurIPS, 2018.
  • [39] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018.
  • [40] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
  • [41] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
  • [42] Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated residual networks. In CVPR, 2017.
  • [43] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, 2017.
  • [44] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In ICML, 2019.
  • [45] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  • [46] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, 2017.
  • [47] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.


Appendix A Additional results

In Figures 91011, and 12, we show additional synthetic results from our proposed SB-GAN model including both the synthesized segmentations and their corresponding synthesized images from the Cityscapes-25K and ADE-Indoor datasets. As mentioned in the paper, on the Cityscapes-25K dataset, fine ground truth annotations are only provided for the Cityscapes-5k subset. We extract the corresponding fine annotations for the rest of training images using the state-of-the-art segmentation model [42, 41] trained on the training annotated samples from Cityscapes-5K.

Moreover, Figures 13 and 14 present additional examples illustrating the impact of SB-GAN on improving the performance of SPADE [32], the state-of-the-art semantic image synthesis model on ground truth segmentations. The third row in these two figures show examples of the synthesized images conditioned on ground truth labels when the SPADE sub-network is extracted from a trained SB-GAN model.

Figure 9: Segmentations and their corresponding images synthesized by SB-GAN trained on the Cityscapes-25K dataset.
Figure 10: Segmentations and their corresponding images synthesized by SB-GAN trained on the Cityscapes-25K dataset.
Figure 11: Segmentations and their corresponding images synthesized by SB-GAN trained on the ADE-Indoor dataset.
Figure 12: Segmentations and their corresponding images synthesized by SB-GAN trained on the ADE-Indoor dataset.
Figure 13: The effect of SB-GAN on improving the performance of the state-of-the-art semantic image synthesis model (SPADE) on ground truth segmentations of Cityscapes-25K validation set. For SB-GAN, we train the entire model end-to-end, extract the trained SPADE sub-network, and synthesize samples conditioned on the ground truth labels.
Figure 14: The effect of SB-GAN on improving the performance of the state-of-the-art semantic image synthesis model (SPADE) on ground truth segmentations of ADE-Indoor validation set. For SB-GAN, we train the entire model end-to-end, extract the trained SPADE sub-network, and synthesize samples conditioned on the ground truth labels.