Semantic Bottleneck Scene Generation
Coupling the high-fidelity generation capabilities of label-conditional image synthesis methods with the flexibility of unconditional generative models, we propose a semantic bottleneck GAN model for unconditional synthesis of complex scenes. We assume pixel-wise segmentation labels are available during training and use them to learn the scene structure. During inference, our model first synthesizes a realistic segmentation layout from scratch, then synthesizes a realistic scene conditioned on that layout. For the former, we use an unconditional progressive segmentation generation network that captures the distribution of realistic semantic scene layouts. For the latter, we use a conditional segmentation-to-image synthesis network that captures the distribution of photo-realistic images conditioned on the semantic layout. When trained end-to-end, the resulting model outperforms state-of-the-art generative models in unsupervised image synthesis on two challenging domains in terms of the Frechet Inception Distance and user-study evaluations. Moreover, we demonstrate the generated segmentation maps can be used as additional training data to strongly improve recent segmentation-to-image synthesis networks.READ FULL TEXT VIEW PDF
Significant strides have been made on generative models for image synthesis, with a variety of methods based on Generative Adversarial Networks (GANs) achieving state-of-the-art performance. At lower resolutions or in specialized domains, GAN-based methods are able to synthesize samples which are near-indistinguishable from real samples . However, generating complex, high-resolution scenes from scratch remains a challenging problem. As image resolution and complexity increase, the coherence of synthesized images decreases — samples contain convincing local textures, but lack a consistent global structure.
Stochastic decoder-based models, such as conditional GANs, were recently proposed to alleviate some of these issues. In particular, both Pix2PixHD  and SPADE  are able to synthesize high-quality scenes using a strong conditioning mechanism based on semantic segmentation labels during the scene generation process. Global structure encoded in the segmentation layout of the scene is what allows these models to focus primarily on generating convincing local content consistent with that structure.
A key practical drawback of such conditional models is that they require full segmentation layouts as input. Thus, unlike unconditional generative approaches which synthesize images from randomly sampled noise, these models are limited to generating images from a set of scenes that is prescribed in advance, typically either through segmentation labels from an existing dataset, or scenes that are hand-crafted by experts.
To overcome these limitations, we propose a new model, the Semantic Bottleneck GAN, which couples high-fidelity generation capabilities of label-conditional models with the flexibility of unconditional image generation. This in turn enables our model to synthesize an unlimited number of novel complex scenes, while still maintaining high-fidelity output characteristic of image-conditional models.
Our Semantic Bottleneck GAN first unconditionally generates a pixel-wise semantic label map of a scene (i.e. for each spatial location it outputs a class label), and then generates a realistic scene image by conditioning on that semantic map. By factorizing the task into these two steps, we are able to separately tackle the problems of producing convincing segmentation layouts (i.e. a useful global structure) and filling these layouts with convincing appearances (i.e. local structure). When trained end-to-end, the model yields samples which have a coherent global structure as well as fine local details. Empirical evaluation shows that our Semantic Bottleneck GAN achieves a new state-of-the-art on two complex datasets, Cityscapes and ADE-Indoor, as measured both by the Fréchet Inception Distance (FID) and by user studies. Additionally, we observe that the synthesized segmentation label maps produced as part of the end-to-end image synthesis process in Semantic Bottleneck GAN can also be used to improve the performance of the state-of-the-art semantic image synthesis network , resulting in higher-quality outputs when conditioning on ground truth segmentation layouts. Our code will be available at https://github.com/azadis/SB-GAN.
GANs  are a powerful class of implicit generative models successfully applied to various image synthesis tasks such as image style transfer [18, 47], unsupervised representation learning [11, 33, 36]
, image super-resolution[23, 13], and text-to-image synthesis [45, 40, 34]. Training GANs is notoriously hard and recent efforts focused on improving neural architectures [21, 44, 9]1], regularization [15, 30], large-scale training , self-supervision , and sampling [7, 4]. One compelling approach which enables generation of high-resolution images is based on progressive training: a model is trained to first synthesize lower-resolution images (e.g. ), then the resolution is gradually increased until the desired resolution is achieved . Recently, BigGAN  showed that GANs significantly benefit from large-scale training, both in terms of model size and batch size. We note that these models are able to synthesize high-quality images in settings where objects are very prominent and centrally placed or follow some well-defined structure, as the corresponding distribution is easier to capture. In contrast, when the scenes are more complex and the amount of data is limited, the task becomes extremely challenging for these state-of-the-art models. The aim of this work is to improve the performance in the context of complex scenes and a small number of training examples.
GANs for discrete domains have been investigated in several works [22, 43, 24, 6, 26]. Training in this domain is even more challenging as the samples from discrete distributions are not differentiable with respect to the network parameters. This problem can be somewhat alleviated by using the Gumbel-softmax distribution, which is a continuous approximation to a multinomial distribution parameterized in terms of the softmax function . We will show how to apply a similar principle to learn the distribution of discrete segmentation masks.
In conditional image synthesis one aims to generate images by conditioning on an input which can be provided in the form of an image [18, 47, 3, 5, 25], a text phrase [37, 45, 35, 2, 17], a scene graph [20, 2], a class label or a semantic layout [31, 8, 39, 32]. These conditional GAN methods learn a mapping that translates samples from the source distribution into samples from the target domain.
The text-to-image synthesis model proposed in 
decomposes the synthesis task into multiple steps. First, given the text description, a semantic layout is constructed by generating object bounding boxes and refining each box by estimating object shapes. Then, an image is synthesized conditioned on the generated semantic layout from the first step. Our work shares the same high-level idea of decomposing the image generation problem into the semantic layout synthesis and the conditional semantic-layout-to-image synthesis. A key difference is that we focus onunconditional image generation which results in a novel semantic layout generation pipeline and end-to-end network design.
We propose an unconditional Semantic Bottleneck GAN architecture to learn the distribution of complex scenes. To tackle the problems of learning both the global layout and the local structure, we divide this synthesis problem into two parts: an unconditional segmentation map synthesis network and a conditional segmentation-to-image synthesis model. Our first network is designed to coarsely learn the scene distribution by synthesizing semantic layouts. It generates per-pixel semantic categories following the progressive GAN model architecture (ProGAN) . The second network populates the synthesized semantic layouts with texture by predicting RGB pixel values using Spatially-Adaptive Normalization (SPADE) , following the architecture of the state-of-the-art semantic synthesis network in . We assume the ground truth segmentation masks are available for all or part of the target scene dataset. In the following sections, we will first discuss our semantic bottleneck synthesis pipeline and summarize the SPADE network for image synthesis. We will then couple these two networks in an end-to-end final design which we refer to as Semantic Bottleneck GAN (SB-GAN).
Our goal here is to learn a (coarse) estimate of the scene distribution from samples corresponding to real segmentation maps with
semantic categories. Starting from random noise, we generate a tensorwhich represents a per-pixel segmentation class, with and indicating the height and width, respectively, of the generated map and the batch size. In practice, we progressively train from a low to a high resolution using the ProGAN architecture  coupled with the Improved WGAN loss function  on the ground truth discrete-valued segmentation maps. In contrast to ProGAN, in which the generator outputs continuous RGB values, we predict per-pixel discrete semantic class labels. This task is extremely challenging as it requires the network to capture the intricate relationship between segmentation classes and their spatial dependencies. To this end, we apply the Gumbel-softmax trick [19, 29] coupled with a straight-through estimator , which we describe in detail below.
temperature hyperparameter—the smaller , the closer the approximation is to the categorical distribution :
Similar to the real samples, the synthesized samples fed to the GAN discriminator should still contain discrete category labels. As a result, for the forward pass, we simply compute , while for the backward pass, we use the soft predicted scores directly, a strategy also known as straight-through estimation .
Our second sub-network converts the synthesized semantic layouts into photo-realistic images using spatially-adaptive normalization . The segmentation masks are employed to spread the semantic information throughout the generator by modulating the activations with a spatially adaptive learned transformation. We follow the same generator and discriminator architectures and loss functions used in , where the generator contains a series of SPADE residual blocks with upsampling layers. The loss functions to train SPADE are summarized as:
where , stand for the SPADE generator and discriminator, and and represent the VGG and discriminator feature matching loss functions, respectively [32, 39]. We pre-train this network using pairs of real RGB images, , and their corresponding real segmentation masks, , from the target scene data set.
In the next section, we will describe how to employ the synthesized segmentation masks in an end-to-end manner to improve the performance of both the semantic bottleneck and the semantic image synthesis sub-networks.
After training semantic bottleneck synthesis model to synthesize segmentation masks and the semantic image synthesis model to stochastically map segmentations to photo-realistic images, we adversarially fine-tune the parameters of both networks in an end-to-end approach by introducing an unconditional discriminator network on top of the SPADE generator (see Figure 2).
This second discriminator, , has the same architecture as the SPADE discriminator, but is designed to distinguish between real RGB images and the fake ones generated from the synthesized semantic layouts. Unlike the SPADE conditional GAN loss, which examines pairs of input segmentations and output images, in equation 3.2, the GAN loss on , , is unconditional and only compares real images to synthesized ones, as shown in equation 3:
where represents the semantic bottleneck synthesis generator, and is the improved WGAN loss used to pretrain described in Section 3.1. In contrast to the conditional discriminator in SPADE, which enforces consistency between the input semantic map and the output image, is primarily concerned with the overall quality of the final output. The hyper parameter determines the ratio between the two generators during fine-tuning. The parameters of both generators, and , as well as the corresponding discriminators, and , are updated in this end-to-end fine-tuning.
We illustrate our final end-to-end network in Figure 2. Jointly fine-tuning the two networks in an end-to-end fashion allows the two networks to reinforce each other, leading to improved performance. The gradients with respect to RGB images synthesized by SPADE are back-propagated to the segmentation synthesis model, thereby encouraging it to synthesize segmentation layouts that lead to higher quality final images. Hence, SPADE plays the role of a loss function for synthesizing segmentations, but in the RGB space, hence providing a goal that was absent from the initial training. Similarly, fine-tuning SPADE with synthesized segmentations allows it to adapt to a more diverse set of scene layouts, which improves the quality of generated samples.
We evaluate the performance of the proposed approach on two datasets containing images with complex scenes, where the ground truth segmentation masks are available during training (possibly only for a subset of the images). We also study the role of the two network components, semantic bottleneck and semantic image synthesis, on the final result. We compare the performance of SB-GAN against the state-of-the-art BigGAN model  as well as a ProGAN  baseline that has been trained on the RGB images directly. We evaluate our method using Fréchet Inception Distance (FID) as well as a user study.
We study the performance of our model on the Cityscapes and ADE-indoor datasets as the two domains with complex scene images.
Cityscapes-5K  contains street scene images in German cities with training and validation set sizes of 3,000 and 500 images, respectively. Ground truth segmentation masks with 33 semantic classes are available for all images in this dataset.
Cityscapes-25K  contains street scene images in German cities with training and validation set sizes of 23,000 and 500 images, respectively. Cityscapes-5K is a subset of this dataset, providing 3,000 images in the training set here as well as the entire validation set. Fine ground truth annotations are only provided for this subset, with the remaining 20,000 training images containing only coarse annotations. We extract the corresponding fine annotations for the rest of training images using the state-of-the-art segmentation model [42, 41] trained on the training annotated samples from Cityscapes-5K. This dataset contains 19 semantic classes.
ADE-Indoor is a subset of the ADE20K dataset  containing 4,377 challenging training images from indoor scenes and 433 validation images with 95 semantic categories.
We use the Fréchet Inception Distance (FID)  as well as a user study to evaluate the quality of the generated samples. To compute FID, the real data and generated samples are first embedded in a specific layer of a pre-trained Inception network. Then, a multivariate Gaussian is fit to the data, and the distance is computed as , where and denote the empirical mean and covariance, and subscripts and denote the real and generated data respectively. FID is shown to be sensitive to both the addition of spurious modes and to mode dropping [38, 27]. On the Cityscapes dataset, we ran five trials where we computed FID on 500 random synthetic images and 500 real validation images, and report the average score. On ADE-Indoor, the same process is repeated on batches of 433 images.
In all our experiments, we set , and . The initial generator and discriminator learning rates for training SPADE both in the pretraining and end-to-end steps are and , respectively. The learning rate for the semantic bottleneck synthesis sub-network is set to in the pretraining step and to in the end-to-end fine-tuning on Cityscapes, and to for ADE-Indoor. The temperature hyperparameter, , is always set to 1. For BigGAN, we followed the setup in 111Configuration as in https://github.com/google/compare_gan/blob/master/example_configs/biggan_imagenet128.gin, where we modified the code to allow for non-square images of Cityscapes. We used one class label for all images to have an unconditional BigGAN model. For both datasets we varied the batch size (using values in ), the learning rate, and the location of the self-attention block. We trained the final model for K iterations.
In Figures 3, 4, and 5, we provide qualitative comparisons of the competing methods on the three aforementioned datasets. We observe that both Cityscapes-5K and ADE-Indoor are very challenging for the state-of-the-art ProGAN and BigGAN models, likely due to the complexity of the data and small number of training instances. Even at a resolution of on the ADE-Indoor dataset, BigGAN suffers from mode collapse, as illustrated in the last row of Figure 5. In contrast, SB-GAN significantly improves on the structure of the scene distribution, and provides samples of higher quality. On Cityscapes-25K, the performance improvement of SB-GAN is more modest due to the large number of training images available. It is worth emphasizing that in this case only 3K ground truth segmentations for training SB-GAN are available. Compared to BigGAN, images synthesized by SB-GAN are sharper and contain more structural details (e.g., one can zoom-in on the synthesized cars). More qualitative examples are presented in the Appendix.
To provide a thorough empirical evaluation of the proposed approach, we generate samples for each dataset and report the FID scores of the resulting images (averaged across 5 sets of generated samples). We evaluate SB-GAN both before and after end-to-end fine-tuning, and compare our method to two strong baselines, ProGAN  and BigGAN . The results are detailed in Tables 1 and 2.
|ProGAN||SB-GAN W/O FT||SB-GAN|
First, in the low-data regime, even without fine-tuning, our Semantic Bottleneck GAN produces higher quality samples and significantly outperforms the baselines on Cityscapes-5K and ADE-Indoor. The advantage of our proposed method is even more striking on smaller datasets. While competing methods are unable to learn a high-quality model of the underlying distribution without having access to a large number of samples, SB-GAN is less sensitive to the number of training data points. Secondly, we observe that by jointly training the semantic bottleneck and image synthesis components, SB-GAN produces state-of-the-art results across all three datasets.
We were not able to successfully train BigGAN at a resolution of due to instability observed during training and mode collapse. Table 2, however, shows the results for a lower-resolution setting, for which we were able to successfully train BigGAN. We report the results before the training collapses. BigGAN is, to a certain extent, able to capture the distribution of Cityscapes-25K, but fails completely on ADE-Indoor. Interestingly, BigGAN fails to capture the distribution of Cityscapes-5K even at
resolution. The standard deviation of the FID scores computed in Tables1 and 2 is within of the mean for Cityscapes and within of the mean for ADE-Indoor.
To independently assess the impact of end-to-end training on the conditional image synthesis sub-network, we evaluate the quality of generated samples when conditioning on ground truth validation segmentations from each dataset. Comparisons to the baseline network SPADE  are provided in Table 3 and Figure 8. We observe that the image synthesis component of SB-GAN consistently outperforms SPADE across all three datasets, indicating that fine-tuning on synthetic labels produced by the segmentation generator improves the conditional image generator. Please refer to the Appendix for more qualitative examples.
To further dissect the effect of end-to-end training, we perform a study on different components of SB-GAN. In particular, we consider three settings: (1) SB-GAN before end-to-end fine-tuning, (2) fine-tuning only the semantic bottleneck synthesis component, (3) fine-tuning only the conditional image synthesis component, and (4) fine-tuning all components jointly. The results on the Cityscapes-5K dataset (resolution ) are reported in Table 4. Finally, the impact of fine-tuning on the quality of samples can be observed in Figures 6 and 11.
|No FT||FT SB||FT SPADE||FT Both|
We proposed an end-to-end Semantic Bottleneck GAN model that synthesizes semantic layouts from scratch, and then generates photo-realistic scenes conditioned on the synthesized layouts. Through extensive quantitative and qualitative evaluations, we showed that this novel end-to-end training pipeline significantly outperforms the state-of-the-art models in unconditional synthesis of complex scenes. In addition, Semantic Bottleneck GAN strongly improves the performance of the state-of-the-art semantic image synthesis model in synthesizing photo-realistic images from ground truth segmentations.
We believe that the idea of applying a semantic bottleneck to other generative models should be explored in future work. In addition, novel ways to train GANs with discrete outputs could be explored, especially techniques to deal with the non-differentiable nature of the generated outputs.
This work was supported by Google through Google Cloud Platform research credits. We thank Marvin Ritter for help with issues related to the compare_gan library . We are grateful to the members of BAIR for fruitful discussions. SA is supported by the Facebook graduate fellowship.
The Cityscapes dataset for semantic urban scene understanding.In CVPR, 2016.
Neural text generation: past, present and beyond.2018.
The concrete distribution: A continuous relaxation of discrete random variables.In ICLR, 2016.
Conditional image synthesis with auxiliary classifier gans.In ICML, 2017.
Assessing generative models via precision and recall.In NeurIPS, 2018.
In Figures 9, 10, 11, and 12, we show additional synthetic results from our proposed SB-GAN model including both the synthesized segmentations and their corresponding synthesized images from the Cityscapes-25K and ADE-Indoor datasets. As mentioned in the paper, on the Cityscapes-25K dataset, fine ground truth annotations are only provided for the Cityscapes-5k subset. We extract the corresponding fine annotations for the rest of training images using the state-of-the-art segmentation model [42, 41] trained on the training annotated samples from Cityscapes-5K.
Moreover, Figures 13 and 14 present additional examples illustrating the impact of SB-GAN on improving the performance of SPADE , the state-of-the-art semantic image synthesis model on ground truth segmentations. The third row in these two figures show examples of the synthesized images conditioned on ground truth labels when the SPADE sub-network is extracted from a trained SB-GAN model.