Log In Sign Up

Diverse Image Synthesis from Semantic Layouts via Conditional IMLE

Most existing methods for conditional image synthesis are only able to generate a single plausible image for any given input, or at best a fixed number of plausible images. In this paper, we focus on the problem of generating images from semantic segmentation maps and present a simple new method that can generate an arbitrary number of images with diverse appearance for the same semantic layout. Unlike most existing approaches which adopt the GAN framework, our method is based on the recently introduced Implicit Maximum Likelihood Estimation framework. Compared to the leading approach, our method is able to generate more diverse images while producing fewer artifacts despite using the same architecture. The learned latent space also has sensible structure despite the lack of supervision that encourages such behaviour.


page 5

page 6

page 7

page 11

page 12

page 13

page 14

page 15


Multimodal Image Synthesis with Conditional Implicit Maximum Likelihood Estimation

Many tasks in computer vision and graphics fall within the framework of ...

Pluralistic Image Completion

Most image completion methods produce only one result for each masked in...

Cascading Modular Network (CAM-Net) for Multimodal Image Synthesis

Deep generative models such as GANs have driven impressive advances in c...

Super-Resolution via Conditional Implicit Maximum Likelihood Estimation

Single-image super-resolution (SISR) is a canonical problem with diverse...

Diversifying Semantic Image Synthesis and Editing via Class- and Layer-wise VAEs

Semantic image synthesis is a process for generating photorealistic imag...

Deblurring via Stochastic Refinement

Image deblurring is an ill-posed problem with multiple plausible solutio...

Structural Consistency and Controllability for Diverse Colorization

Colorizing a given gray-level image is an important task in the media an...

Code Repositories



view repo

1 Introduction

Conditional image synthesis is a problem of great importance in computer vision. In recent years, the community has made great progress towards generating images of high visual fidelity on a variety of tasks. However, most proposed methods are only able to generate a single image given each input, even though most image synthesis problems are ill-posed, i.e.: there are multiple equally plausible images that are consistent with the same input. Ideally, we should aim to predict a distribution of all plausible images rather than just a single plausible image, which is a problem known as

multimodal image synthesis [38]. This problem is hard for two reasons:

  1. Model: Most state-of-the-art approaches for image synthesis use generative adversarial nets (GANs) [11], which suffer from the well-documented issue of mode collapse. In the context of conditional image synthesis, this leads to a model that generates only a single plausible image for each given input regardless of the latent noise and fails to learn the distribution of plausible images.

  2. Data: Multiple different ground truth images for the same input are not available in most datasets. Instead, only one ground truth image is given, and the model has to learn to generate other plausible images in an unsupervised fashion.

In this paper, we focus on the problem of multimodal image synthesis from semantic layouts, where the goal is to generate multiple diverse images for the same semantic layout. Existing methods are either only able to generate a fixed number of images [3] or are difficult to train [38] due to the need to balance the training of several different neural nets that serve opposing roles.

To sidestep these issues, unlike most image synthesis approaches, we step outside of the GAN framework and propose a method based on the recently introduced method of Implicit Maximum Likelihood Estimation (IMLE) [21]. Unlike GANs, IMLE by design avoids mode collapse and is able to train the same types of neural net architectures as generators in GANs, namely neural nets with random noise drawn from an analytic distribution as input.

This approach offers two advantages:

  1. Unlike [3]

    , we can generate an arbitrary number of images for each input by simply sampling different noise vectors.

  2. Unlike [38], which requires the simultaneous training of three neural nets that serve opposing roles, our model is much simpler: it only consists of a single neural net. Consequently, training is much more stable.

2 Related Work

2.1 Unimodal Prediction

Most modern image synthesis methods are based on generative adversarial nets (GANs) [11]. Most of these methods are capable of producing only a single image for each given input, due to the problem of mode collapse. Various work has explored conditioning on different types of information. Various methods condition on a scalar that only contains little information, such as object category and attribute [23, 9, 5]. Other methods condition on richer labels, such as text description [26], surface normal maps [33], previous frames in a video [22, 31] and images [34, 13, 37]. Some methods only condition on inputs images in the generator, but not in the discriminator [25, 19, 36, 20]. [15, 26, 28] explore conditioning on attributes that can be modified manually by the user at test time; these methods are not true multimodal methods because they require manual changes to the input (rather than just sampling from a fixed distribution) to generate a different image.

Another common approach to image synthesis is to treat it as a simple regression problem. To ensure high perceptual quality, the loss is usually defined on some transformation of the raw pixels. This paradigm has been applied to super-resolution

[1, 14], style transfer [14] and video frame prediction [30, 24, 8]. These methods are by design unimodal methods because neural nets are functions, and so can only produce point estimates.

Various methods have been developed for the problem of image synthesis from semantic layouts. For example, Karacan et al. [16] developed a conditional GAN-based model for generating images from semantic layouts and labelled image attributes. It is important to note that the method requires supervision on the image attributes and is therefore a unimodal method. Isola et al[13] developed a conditional GAN that can generate images solely from semantic layout. However, it is only able to generate a single plausible image for each semantic layout, due to the problem of mode collapse in GANs. Wang et al[32] further refined the approach of [13], focusing on the high-resolution setting. While these methods are able to generate images of high visual fidelity, they are all unimodal methods.

2.2 Fixed Number of Modes

A simple approach to generate a fixed number of different outputs for the same input is to use different branches or models for each desired output. For example, [12] proposed a model that outputs a fixed number of different predictions simultaneously, which was an approach adopted by Chen and Koltun [3] to generate different images for the same semantic layout. Unlike most approaches, [3] did not use the GAN framework; instead it uses a simple feedforward convolutional network. On the other hand, Ghosh et al[10] uses a GAN framework, where multiple generators are introduced, each of which generates a different mode. The above methods all have two limitations: (1) they are only able to generate a fixed number of images for the same input, and (2) they cannot generate continuous changes.

2.3 Arbitrary Number of Modes

A number of GAN-based approaches propose adding learned regularizers that discourage mode collapse. BiGAN/ALI [6, 7] trains a model to reconstruct the latent code from the image; however, when applied to the conditional setting, significant mode collapse still occurs because the encoder is not trained until optimality and so cannot perfectly invert the generator. VAE-GAN [17] combines a GAN with a VAE, which does not suffer from mode collapse. However, image quality suffers because the generator is trained on latent code sampled from the encoder/approximate posterior, and is never trained on latent code sampled from the prior. At test time, only the prior is available, resulting in a mismatch between training and test conditions. Zhu et al[38] proposed Bicycle-GAN, which combines both of the above approaches. While this alleviates the above issues, it is difficult to train, because it requires training three different neural nets simultaneously, namely the generator, the discriminator and the encoder. Because they serve opposing roles and effectively regularize one another, it is important to strike just the right balance, which makes it hard to train successfully in practice.

A number of methods for colourization [2, 35, 18]

predict a discretized marginal distribution over colours of each individual pixel. While this approach is able to capture multimodality in the marginal distribution, ensuring global consistency between different parts of the image is not easy, since there are correlations between the colours of different pixels. This approach is not able to learn such correlations because it does not learn the joint distribution over the colours of

all pixels.

3 Method

3.1 Formulation

Given a semantic segmentation map, our goal is to generate arbitrarily many plausible images that are all consistent with the segmentation. More formally, given a segmentation map where is the size of the image and is the number of semantic classes, the goal is to generate a plausible color image that is consistent with . Each pixel in the segmentation map

is represented as a one-hot encoding of the semantic category it belongs to, that is

We consider the conditional probability distribution

. A plausible image that is consistent with is a mode of this distribution; because there could be many plausible images that are consistent with the same segmentation, is usually multimodal. A method that performs unimodal prediction can be seen as producing a point estimate of this distribution. The ability to generate a high-quality image essentially corresponds to the ability to return an image that is close to some mode.

Because our goal is to generate multiple plausible images, producing a point estimate of is not enough. Instead, we need to model the full distribution.

We model using a probabilistic model with parameters (we will describe what this model looks like later). We will hereafter denote the distribution represented by the model as . Training the model is equivalent to estimating ; a standard method for parameter estimation is maximum likelihood estimation (MLE). That is, we want to maximize the log-likelihood of the ground truth image that corresponds to the semantic layout . Let denote the training set, we’d like to train the model by optimizing the following objective:

The probabilistic model that we use is an implicit probabilistic model, which is defined directly in terms of a sampling procedure. This contrasts with classical probabilistic models (sometimes known as prescribed probabilistic models), which are defined in terms of probability density functions (PDFs). Our implicit model is defined in terms of the following sampling procedure:

  1. Draw

  2. Return as a sample


represents a deep neural network, which takes the label map

and random vector as input and outputs the synthesized image . In other words, the model is the same as the generator in conditional GANs (but it does not have a discriminator and will not be trained using the GAN objective).

3.2 Implicit Maximum Likelihood Estimation

It is not possible to train an implicit model using MLE because log-likelihood cannot be expressed in closed form or evaluated numerically. Fortunately, recently Li and Malik [21] introduced Implicit Maximum Likelihood Estimation (IMLE), a method for training probabilistic models that does not need to compute the actual log-likelihood, but is equivalent to maximum likelihood under appropriate conditions.

More formally, given a set of training examples and an (unconditional) implicit probabilistic model , IMLE draws i.i.d. samples and optimize the parameters such that each data example is close to its nearest sample in expectation. It can be written as the optimization problem:

To apply IMLE to conditional image synthesis, we need to model all the different distributions for different semantic layouts . Therefore, in the conditional setting, the samples corresponding to different different ’s should be segregated, and the nearest neighbour search should be over only the samples that correspond to the segmentation associated with the ground truth. We also use a different distance metric , which is defined in Section 3.3. The modified algorithm is stated in Algorithm 1. The size of the random batch , the number of random vectors , the number of inner iterations , the size of minibatch and the learning rate

are hyperparameters.

  Input Training semantic segmentation maps and the corresponding ground truth images
  Initialize parameters for neural net

 epoch =

to  do
     Pick a random batch
     for  do
        Generate i.i.d random vectors
        for  do
        end for
     end for
     for  = 1 to  do
        Pick a random batch
     end for
  end for
Algorithm 1 Conditional IMLE for Image Synthesis

3.3 Architecture

To allow for direct comparability to Cascaded Refinement Networks (CRN) [3], which is the leading method for multimodal image synthesis from semantic layouts, we use the same architecture as CRN, with minor modifications to convert CRN into an implicit probabilistic model. We will first review CRN in 3.3.1 and discuss our improvements in 3.3.2.

3.3.1 Architecture

The Cascaded Refinement Network is a coarse-to-fine architecture that consists of multiple modules . Each module operates at one resolution and the resolution is doubled from one module to the next. The first module operates at , and thus the resolution for module is . All layers in the same module operate at the same resolution.

takes the semantic segmentation map (downsampled to ) as input and produces a feature output . All other modules take the concatenation of the semantic map (downsampled to ) and feature output (upsampled to ) as input and produces a feature output

. Note that bilinear interpolation is used for upsampling/downsampling. The final module is followed by a

convolutional layer that outputs the synthesized image with 3 channels.

Inside each module , there are two

convolutional layers with layer normalization and leaky ReLU activation. The number of channels for

is 1024 for to , 512 for and , 128 for and 32 for .

CRN uses a perceptual loss function based on VGG-19 features 

[29]. Given the ground truth image and synthesized image , the loss function is:


Here represents the feature outputs of the following layers in VGG-19: ’conv1_2’, ’conv2_2’, ’conv3_2’, ’conv4_2’, and ’conv5_2’. Hyperparameters are set such that the loss of each layer makes the same contribution to the total loss. We use this loss function as the distance metric in the IMLE objective.

3.3.2 Improvements for Multimodal Prediction

The original CRN synthesizes only one image for the same semantic layout input. To generate multiple images for the same input, Chen and Koltun [3] adopted the approach of [12] and increased the number of output images from 1 to . This allows the generation of different samples for the same input, but the number is fixed. As a result, if the number of modes is greater than , some modes will be missing from the prediction.

We adopt a different approach for modelling the multimodality. Instead of increasing the number of output channels, we add additional input channels to the architecture and feed random noise via these channels. This new model can be then interpreted as an implicit probabilistic model, which we can train using conditional IMLE.

Additional Noise Channels

We incorporate random noise by concatenating the semantic label map with a random vector reshaped to the appropriate size. is of size and hence should have size where is the number of noise channels. Let . now takes (downsampled to ) as input and the other modules take the concatenation of (downsampled to ) and feature output (upsampled to ) as input. Consequently, the only architectural change is to the first layer of each module, where the number of input channels increases by .

Noise Encoder

Because the input segmentation maps are provided at a high resolution, a noise input of size could be very high-dimensional. This can require generating many samples during training, which can be slow. To solve this issue, we propose forcing the noise to lie on a low-dimensional manifold, which improves the sample efficiency. To this end, we add a noise encoder module, which is a 3-layer convolutional network that takes and noise sampled from a Gaussian as input and outputs an encoded noise vector of size . We replace the original with the encoded and leave the rest of the architecture unchanged.

3.4 Dataset and Loss Rebalancing

In practice, we found datasets can be strongly biased towards objects with relatively common appearance. As a result, naïve training can result in limited diversity among the images generated by the trained model. To address this, we propose two strategies to rebalance the dataset and loss.

Dataset Rebalancing

We first rebalance the dataset to increase the chance of rare images being sampled when populating (as shown in Algorithm 1). To this end, for each image in the training set, we calculate the average pixel vector of each semantic class in that image. More concretely, we compute the following for each image :

For each category , we consider the set of average pixel vectors for that category in all training images, i.e.:

. We then fit a Gaussian kernel density estimate to this set and obtain an estimate of the distribution of average pixels of category

. Let denote the estimated probability density function (PDF) for category . Given the -th training example, we define the rarity score of category :

We allocate a portion of the batch in Algorithm 1 to each of the top five categories that have the largest overall area across the dataset. For each category, we sample training images based on the rarity score and effectively upweight images containing objects with rare appearance. The rationale for selecting the categories with the largest areas is because they tend to appear more frequently and be visually more prominent. If we were to allocate a fixed portion of the batch to rare categories, we would risk overfitting to images containing those categories.

Loss Rebalancing

The same training image can contain both common and rare objects. Therefore, we modify the loss function so that the objects with rare appearance are upweighted. For each training example , we define a rarity score mask :

We then normalize so that every entry lies in :

The mask is then applied to the loss function (1) and the new loss becomes:

Here is the rarity score mask downsampled to match the size of

4 Experiment

Figure 1: Comparison of histogram of hues between two datasets. Red is Cityscapes and blue is GTA-5.
(a) CRN
(b) Our model
Figure 2: Comparison of generated images for the same semantic layout
Figure 3: Samples generated by our model. For both (a) and (b), the top-left image at the top-left corner is the input semantic layout and the other 19 images are samples generated by our model conditioned on the same semantic layout.
(a) Change from daytime to night time
(b) Change of car colors
Figure 4: Images generated by interpolating between latent noise vectors.
Figure 5: Style consistency with the same random vector. (a) is the original input-output pair. We use the same random vector used in (a) and apply it to (b),(c),(d) and (e)

4.1 Dataset

The choice of dataset is very important for multimodal conditional image synthesis. The most common dataset in the unimodal setting is the Cityscapes dataset [4]. However, it is not suitable for the multimodal setting because most images in the dataset are taken under similar weather conditions and time of day and the amount of variation in object colours is limited. This lack of diversity limits what any multimodal method can do. On the other hand, the GTA-5 dataset [27], has much greater variation in terms of weather conditions and object appearance. To demonstrate this, we compare the colour distribution of both datasets and present the distributiion of hues of both datasets in Figure 1. As shown, Cityscapes is concentrated around a single mode in terms of hue, whereas GTA-5 has much greater variation in hue. Additionally, the GTA-5 dataset includes more 20000 images and so is much larger than Cityscapes.

4.2 Experimental Setting

We train our model on 12403 training images and evaluate on the validation set (6383 images). Due to computational resource limitations, we conduct experiments at the resolution. We add 10 noise channels and set the hyperparameters shown in Algorithm 1 to the following values: , , , and .

The leading method for image synthesis from semantic layouts in the multimodal setting is the CRN [3] with diversity loss that generates nine different images for each semantic segmentation map and is the baseline that we compare to.

4.3 Quantitative Comparison

Quantitative comparison aims to quantitatively compare the diversity as well as quality of the images generated by our model and CRN.

Diversity Evaluation

Our method is able to generate an arbitrary number of different images for the same input by simply feeding in different random noise vectors. However, the baseline can only generate nine images for the input. To allow for direct comparison, we use our model to generate 100 images for each semantic layout in the test set, we then use -means to divide the generated images into 9 clusters. Then we randomly pick one image from each cluster and compare the nine selected images with the nine images generated by CRN.

Then, for each method, we compute the variance over the nine images of all pixels that belong to a particular category and take the average over all spatial locations and colour channels. This yields an average variance for each category in each image. Next, we take the average of the average variance over the entire test set and obtain the mean variance for each category. Since the nine images generated by our model are randomly generated, we repeat this procedure 10 times to reduce stochasticity. Results are shown in Table


Image Quality Evaluation

We now evaluate the generated image quality by human evaluation. Since it is difficult for humans to compare images with different styles, we selected the images that are closest to the ground truth image in

distance among the images generated by CRN and our method. We then asked 62 human subjects to evaluate the images generated for 20 semantic layouts. For each semantic layout, they were asked to compare the image generated by CRN to the image generated by our method and judge which image exhibited more obvious synthetic patterns. The result is shown in Table 2.

Semantic Class Road Sidewalk Building Wall Fence Pole Traffic light Traffic sign Vegetation Terrain
Our model () 8.804 10.41 6.362 5.500 1.901 2.534 1.168 1.703 1.716 5.018
CRN () 7.725 6.941 3.071 2.760 0.7764 1.907 0.4820 1.378 0.8079 2.664
Semantic Class Sky Person Rider Car Truck Bus Train Motorcycle Bicycle Overall
Our model () 1.645 2.103 0.2772 1.390 1.872 0.3267 0.01217 0.1878 0.01628 3.415
CRN () 2.267 0.8848 0.1338 1.298 2.004 0.1791 0.002938 0.06913 0.005704 2.870
Table 1: Comparison of variance for each category and variance averaged over all categories. Our method outperforms CRN in most categories, indicating greater diversity in the images generated by our method.
(a) CRN
(b) Our model
Figure 6: Comparison of artifacts in generated images.
% of Images Containing More Artifacts
Our method
Table 2: Average percentage of images that are judged by humans to exhibit more obvious synthetic patterns. Lower is better.

4.4 Qualitative Evaluation

A qualitative comparison is shown in Fig. 2 where results generated by our model are obviously more diverse. Our method also generates fewer artifacts compared to CRN, which is especially interesting because the architecture and the distance metric are the same. As shown in Fig. 6, the images generated by CRN has grid-like artifacts which are not present in the images generated by our method. More examples generated by our model are shown in Fig. 3.


We also perform linear interpolation of noise vectors to evaluate the quality of the learned latent space of noise vectors. As shown in 4(a), by interpolating between the noise vectors corresponding to generated images during daytime and nighttime respectively, we obtain a smooth transition from daytime to nighttime. We also show the transition in car colour in 4(b). This suggests that the learned latent space is sensible and captures the variation along both the daytime-nightime axis and the colour axis. More examples and animations are available in the supplementary materials.

Scene Editing

A successful method for image synthesis from semantic layouts enables users to manually edit the semantic map to synthesize desired imagery. One can do this simply by adding/deleting objects or changing the class label of a certain object. In Figure 7 we show several such changes. Note that all four inputs use the same random vector; as shown, the images are highly consistent in terms of style, which is quite useful because the style should remain the same after editing the layout. We further demonstrate this in Fig. 5 where we apply the random vector used in (a) to vastly different segmentation maps in (b),(c),(d),(e) and the sunset style is preserved across the different segmentation maps.

Figure 7: Scene editing. (a) is the original input semantic map and the generated output. (b) adds a car on the road. (c) changes the grass on the left to road and change the side walk on the right to grass. (d) deletes our own car, changes the building on the right to tree and changes all road to grass.

5 Conclusion

We presented a new method based on IMLE for multimodal image synthesis from semantic layout. Unlike prior approaches, our method can generate arbitrarily many images for the same semantic layout and is easy to train. We demonstrated that our method can generate more diverse images with fewer artifacts compared to the leading approach [3], despite using the same architecture. In addition, our model is able to learn a sensible latent space of noise vectors without supervision. We showed that by taking the interpolations between noise vectors, our model can generate continuous changes. At the same time, using the same noise vector across different semantic layouts result in images of consistent style.


Appendix A Video of Interpolations

We generated a video that shows smooth transitions between different renderings of the same scene. Frames of the generated video are shown in Figure 8.

Figure 8: Frames of video generated by smoothly interpolating between latent noise vectors.

Appendix B Video Generation from Evolving Scene Layouts

We generated videos of a car moving farther away from the camera and then back towards the camera by generating individual frames independently using our model with different semantic segmentation maps as input. For the video to have consistent appearance, we must be able to consistently select the same mode across all frames. In Figure 9, we show that our model has this capability: we are able to select a mode consistently by using the same latent noise vector across all frames.

Figure 9: Frames from two videos of a moving car generated using our method. In both videos, we feed in scene layouts with cars of varying sizes to our model to generate different frames. In (a), we use the same latent noise vector across all frames. In (b), we interpolate between two latent noise vectors, one of which corresponds to a daytime scene and the other to a night time scene. The consistency of style across frames demonstrates that the learned space of latent noise vectors is semantically meaningful and that scene layout and style are successfully disentangled by our model.

Here we demonstrate one potential benefit of modelling multiple modes instead of a single mode. We tried generating a video from the same sequence of scene layouts using pix2pix [13], which only models a single mode. (For pix2pix, we used a pretrained model trained on Cityscapes, which is easier for the purposes of generating consistent frames because Cityscapes is less diverse than GTA-5.) In Figure 10, we show the difference between adjacent frames in the videos generated by our model and pix2pix. As shown, our model is able to generate consistent appearance across frames (as evidenced by the small difference between adjacent frames). On the other hand, pix2pix is not able to generate consistent appearance across frames, because it arbitrarily picks a mode to generate and does not permit control over which mode it generates.

Figure 10: Comparison of the difference between adjacent frames of synthesized moving car video. Darker pixels indicate smaller difference and lighter pixels indicate larger difference. (a) shows results for the video generated by our model. (b) shows results for the video generated by pix2pix [13].

Appendix C More Generated Samples

Figure 11: Samples generated by our model. The image at the top-left corner is the input semantic layout and the other 19 images are samples generated by our model conditioned on the same semantic layout.
Figure 12: Samples generated by our model. The image at the top-left corner is the input semantic layout and the other 19 images are samples generated by our model conditioned on the same semantic layout.