Log In Sign Up

Toward Multimodal Image-to-Image Translation

by   Jun-Yan Zhu, et al.

Many image-to-image translation problems are ambiguous, as a single input image may correspond to multiple possible outputs. In this work, we aim to model a distribution of possible outputs in a conditional generative modeling setting. The ambiguity of the mapping is distilled in a low-dimensional latent vector, which can be randomly sampled at test time. A generator learns to map the given input, combined with this latent code, to the output. We explicitly encourage the connection between output and the latent code to be invertible. This helps prevent a many-to-one mapping from the latent code to the output during training, also known as the problem of mode collapse, and produces more diverse results. We explore several variants of this approach by employing different training objectives, network architectures, and methods of injecting the latent code. Our proposed method encourages bijective consistency between the latent encoding and output modes. We present a systematic comparison of our method and other variants on both perceptual realism and diversity.


page 2

page 3

page 7

page 9


SingleGAN: Image-to-Image Translation by a Single-Generator Network using Multiple Generative Adversarial Learning

Image translation is a burgeoning field in computer vision where the goa...

SDIT: Scalable and Diverse Cross-domain Image Translation

Recently, image-to-image translation research has witnessed remarkable p...

Multi-Mapping Image-to-Image Translation with Central Biasing Normalization

Recent image-to-image translation tasks attempt to extend the model from...

Semantic Example Guided Image-to-Image Translation

Many image-to-image (I2I) translation problems are in nature of high div...

Simplifying Models with Unlabeled Output Data

We focus on prediction problems with high-dimensional outputs that are s...

Generative Transition Mechanism to Image-to-Image Translation via Encoded Transformation

In this paper, we revisit the Image-to-Image (I2I) translation problem w...

What can we learn about a generated image corrupting its latent representation?

Generative adversarial networks (GANs) offer an effective solution to th...

Code Repositories


The homework for Cutting-Edge of Deep Learning, aka CEDL, from NTHU

view repo

1 Introduction

Deep learning techniques have made rapid progress in conditional image generation. For example, networks have been used to inpaint missing image regions (Pathak et al., 2016; Yang et al., 2017; Isola et al., 2017), create sentence conditioned generations (Zhang et al., 2017a), add color to grayscale images (Iizuka et al., 2016; Larsson et al., 2016; Zhang et al., 2016; Isola et al., 2017), and sketch-to-photo Sangkloy et al. (2017); Isola et al. (2017). However, most techniques in this space have focused on generating a single result conditioned on the input. In this work, our focus is to model a distribution of potential results, as many of these problems may be multimodal or ambiguous in nature. For example, in a night-to-day translation task (see Figure 1), an image captured at night may correspond to many possible day images with different types of lighting, sky and clouds. There are two main goals of the conditional generation problem: producing results which are (1) perceptually realistic and (2) diverse, while remaining faithful to the input. This multimodal mapping from a high-dimensional input to a distribution of high-dimensional outputs makes the conditional generative modeling task challenging. In existing approaches, this leads to the common problem of mode collapse (Goodfellow, 2016), where the generator learns to generate only a small number of unique outputs. We systematically study a family of solutions to this problem, which learn a low-dimensional latent code for aspects of possible outputs which are not contained in the input image. The generator then produces an output conditioned on both the given input and this learned latent code.

We start with the pix2pix framework (Isola et al., 2017) which has previously been shown to produce good-quality results for a variety of image-to-image translation tasks. The trains a generator network, conditioned on the input image, with two losses: (1) a regression loss to produce a similar output to the known paired ground truth image and (2) a learned discriminator loss to encourage realism. The authors note that trivially appending a randomly drawn latent code did not help produce diverse results, and using random dropout at test time only helped marginally. Instead, we propose encouraging a bijection between the output and latent space. Not only do we perform the direct task of mapping from the input and latent code to the output, we also jointly learn an encoder from the output back to the latent space. This discourages two different latent codes from generating the same output (non-injective mapping). During training, the learned encoder is trained to find a latent code vector that corresponds to the ground truth output image, while passing enough information to the generator to resolve any ambiguities about the mode of output. For example, when generating a day image from a night one (Figure 1), the latent vector may encode information about the sky color, lighting effects on the ground and the density and shape of clouds. In both cases, applying the encoder and generator, in either order, should be consistent, resulting in either the same latent code or the same image.

Figure 1: Multimodal image-to-image translation using our proposed method: given an input image from one domain (night image of a scene), we aim to model a distribution of potential outputs (corresponding day images) in the target domain, producing both realistic and diverse results.

In this work, we instantiate this idea by exploring several objective functions inspired by the literature in unconditional generative modeling:

  • [leftmargin=0.1in]

  • cVAE-GAN

    (Conditional Variational Autoencoder GAN): One popular approach to model multimodal output distribution is by learning the distribution of latent encoding given the output as popularized by VAEs 

    (Kingma and Welling, 2014). In the conditional setup (similar to unconditional analogue (Larsen et al., 2016)

    ), we enforce that the latent distribution encoded by the desired output image maps back to the output via conditional generator. The latent distribution is regularized using KL-divergence to be close to a standard normal distribution so as to sample random codes during inference. This variational objective is then optimized jointly with the discriminator loss.

  • cLR-GAN (Conditional Latent Regressor GAN): Another approach to enforce mode-capture in latent encoding is to explicitly model the inverse mapping. Starting from a randomly sampled latent encoding, the conditional generator should result into an output which when given itself as input to the encoder should result back into the same latent code, enforcing self-consistency. This method could be seen as a conditional formulation of the “latent regressor" model (Donahue et al., 2016; Dumoulin et al., 2016) and also related to InfoGAN (Chen et al., 2016).

  • BicycleGAN: Finally, we combine both these approaches to enforce the connection between latent encoding and output in both directions jointly and achieve improved performance. We show that our method can produce both diverse and visually appealing results across a wide range of image-to-image translation problems, significantly more diverse than other baselines, including naively adding noise in the pix2pix

    framework. In addition to the loss function, we study the performance with respect to several encoder networks, as well as different ways of injecting the latent code into the generator network.

We perform a systematic evaluation of these variants by using real humans to judge photo-realism and an automated distance metric to assess output diversity. Code and data are available at

Figure 2: Overview: (a) Test time usage of all the methods. To produce a sample output, a latent code is first randomly sampled from a known distribution (e.g., a standard normal distribution). A generator maps input image (blue) and latent sample to produce output sample (yellow). (b) pix2pix+noise (Isola et al., 2017) baseline, with additional input (brown) that corresponds to . (c) cVAE-GAN (and cAE-GAN) start from ground truth target image and encode it into the latent space. The generator then attempts to map the input image along with a sampled back into original image . (d) cLR-GAN randomly samples a latent code from a known distribution, uses it to map into the output , and then tries to reconstruct the latent code from it. (e) Our hybrid BicycleGAN method combines constraints in both directions.

2 Related Work

Generative modeling

Parametric modeling of the natural image distribution is a challenging problem. Classically, this problem has been tackled using autoencoders (Hinton and Salakhutdinov, 2006; Vincent et al., 2008)

or Restricted Boltzmann machines 

(Smolensky, 1986). Variational autoencoders (Kingma and Welling, 2014)

provide an effective approach to model stochasticity within the network by reparametrization of a latent distribution. Other approaches to modeling stochasticity include autoregressive models 

(Oord et al., 2016a, b) which can model multimodality via sequential conditional prediction. These approaches are trained with a pixel-wise independent loss on samples of natural images using maximum likelihood and stochastic back-propagation. This is a disadvantage because two images, which are close regarding a pixel-wise independent metric, may be far apart on the manifold of natural images. Generative adversarial networks (Goodfellow et al., 2014) overcome this issue by learning the loss function using a discriminator network, and have recently been very successful (Denton et al., 2015; Radford et al., 2016; Donahue et al., 2016; Dumoulin et al., 2016; Reed et al., 2016; Zhao et al., 2017; Zhu et al., 2016; Arjovsky and Bottou, 2017; Zhang et al., 2017a; Chen et al., 2016). Our method builds on the conditional version of VAE (Kingma and Welling, 2014) and InfoGAN (Chen et al., 2016) or latent regressor (Donahue et al., 2016) model via alternating joint optimization to learn diverse and realistic samples. We revisit this connection in Section 3.4.

Conditional image generation

Potentially, all of the methods defined above could be easily conditioned. While conditional VAEs (Sohn et al., 2015), PixelCNN van den Oord et al. (2016), conditional autoregressive models (Oord et al., 2016b, a) have shown promise (Walker et al., 2016; Xue et al., 2016; Guadarrama et al., 2017), image-to-image conditional GANs have lead to a substantial boost in the quality of the results. However, the quality has been attained at the expense of multimodality, as the generator learns to largely ignore the random noise vector when conditioned on a relevant context (Pathak et al., 2016; Sangkloy et al., 2017; Xian et al., 2017; Yang et al., 2017; Isola et al., 2017; Zhu et al., 2017). In fact, it has even been shown that ignoring the noise leads to more stable training (Mathieu et al., 2016; Pathak et al., 2016; Isola et al., 2017).

Explicitly-encoded multimodality

One way to express multiple modes in the output is to encode them, conditioned on some mode-related context in addition to the input image. For example, color and shape scribbles and other interfaces were used as conditioning in iGAN (Zhu et al., 2016), pix2pix (Isola et al., 2017), Scribbler (Sangkloy et al., 2017)

and interactive colorization 

(Zhang et al., 2017b). An effective option explored by concurrent work (Ghosh et al., 2017; Chen and Koltun, 2017; Bansal et al., 2017) is to use a mixture of models. Though able to produce multiple discrete answers, these methods are unable to produce continuous changes. While there has been some degree of success for generating multimodal outputs in unconditional and text-conditional setups (Goodfellow et al., 2014; Nguyen et al., 2017; Reed et al., 2016; Dinh et al., 2017; Larsen et al., 2016), conditional image-to-image generation is still far from achieving the same results, unless explicitly encoded as discussed above. In this work, we learn conditional generation models for modeling multiple modes of output by enforcing tight connections in both image and latent space.

3 Multimodal Image-to-Image Translation

Our goal is to learn a multi-modal mapping between two image domains, for example, edges and photographs, or day and night images, etc. Consider the input domain which is to be mapped to an output domain . During training, we are given a dataset of paired instances from these domains,

, which is representative of a joint distribution

. It is important to note that there could be multiple plausible paired instances that would correspond to an input instance , but the training dataset usually contains only one such pair. However, given a new instance during test time, our model should be able to generate a diverse set of output ’s, corresponding to different modes in the distribution .

While conditional GANs have achieved success in image-to-image translation tasks (Isola et al., 2017; Zhu et al., 2017), they are primarily limited to generating deterministic output given the input image . On the other hand, we would like to learn the mapping that could sample the output from true conditional distribution given , and produce results which are both diverse and realistic. To do so, we learn a low-dimensional latent space , which encapsulates the ambiguous aspects of the output mode which are not present in the input image. For example, a sketch of a shoe could map to a variety of colors and textures, which could get compressed in this latent code. We then learn a deterministic mapping to the output. To enable stochastic sampling, we desire the latent code vector to be drawn from some prior distribution

; we use a standard Gaussian distribution

in this work.

We first discuss a simple extension of existing methods and discuss its strengths and weakness, motivating the development of our proposed approach in the subsequent subsections.

3.1 Baseline: pix2pix+noise ()

The recently proposed pix2pix model (Isola et al., 2017)

has shown high quality results in image-to-image translation setting. It uses conditional adversarial networks 

(Goodfellow et al., 2014; Mirza and Osindero, 2014) to help produce perceptually realistic results. GANs train a generator and discriminator by formulating their objective as an adversarial game. The discriminator attempts to differentiate between real images from the dataset and fake samples produced by the generator. Randomly drawn noise is added to attempt to induce stochasticity. We illustrate the formulation in Figure 2(b) and describe it below.


To encourage the output of the generator to match the input as well as stabilize the GANs training, we use an loss between the output and the ground truth image.


The final loss function uses the GAN and terms, balanced by .


In this scenario, there is no incentive for the generator to make use of the noise vector which encodes random information. It was also noted in the preliminary experiments in (Isola et al., 2017) that the generator simply ignored the added noise and hence the noise was removed in their final experiments. This observation is consistent with Pathak et al. (2016); Mathieu et al. (2016) and the mode collapse phenomenon observed in unconditional cases (Salimans et al., 2016; Goodfellow, 2016). In this paper, we explore different ways to explicitly enforce the generator to use the latent encoding by making it capture relevant information than being random.

3.2 Conditional Variational Autoencoder GAN: cVAE-GAN ()

One way to force the latent code to be “useful" is to directly map the ground truth to it using an encoding function . The generator then uses both the latent code and the input image to synthesize the desired output . The overall model can be easily understood as the reconstruction of , with latent encoding concatenated with the paired in the middle – similar to an autoencoder (Hinton and Salakhutdinov, 2006). This interpretation is better shown in Figure 2(c).

This approach has been successfully investigated in Variational Auto-Encoders (Kingma and Welling, 2014) in the unconditional scenario without the adversarial objective. Extending it to conditional scenario, the distribution of latent code using the encoder with a Gaussian assumption, . To reflect this, Equation 1 is modified to sampling using the re-parameterization trick, allowing direct back-propagation (Kingma and Welling, 2014).


We make the corresponding change in the loss term in Equation 2 as well to obtain . Further, the latent distribution encoded by is encouraged to be close to random gaussian so as to sample at inference (i.e., is not known).


where . This forms our cVAE-GAN objective, a conditional version of the VAE-GAN (Larsen et al., 2016) as


As a baseline, we also consider the deterministic version of this approach, i.e., dropping KL-divergence and encoding . We call it cAE-GAN and show comparison in the experiments. However, there is no guarantee in cAE-GAN on the distribution of the latent space , which makes test-time sampling of difficult.

3.3 Conditional Latent Regressor GAN: cLR-GAN ()

We explore another method of enforcing the generator network to utilize the latent code embedding , while staying close to the actual test time distribution , but from the latent code’s perspective. We start from the latent code , as shown in Figure 2(d), and then enforce that map back to the randomly drawn latent code with an loss. Note that the encoder

here is producing a point estimate for

, whereas the encoder in the previous section was predicting a Gaussian distribution.


We also include the discriminator loss (Equation 1) on to encourage the network to generate realistic results, and the full loss can be written as:


The loss for the ground truth image is not used. In this case, since the noise vector is randomly drawn, we do not want the predicted to be the ground truth, but rather a realistic result. The above objective bears similarity to the “latent regressor" model (Donahue et al., 2016; Dumoulin et al., 2016; Chen et al., 2016), where the generated sample is encoded to generate a latent vector.

3.4 Our Hybrid Model: BicycleGAN

We combine the cVAE-GAN and cLR-GAN objectives in a hybrid model. For cVAE-GAN

, the encoding is learned from real data, but a random latent code may not yield realistic images at test time – the KL loss may not be well optimized, and perhaps more importantly, the adversarial classifier

does not have a chance to see randomly sampled results during training. In cLR-GAN, the latent space is easily sampled from a simple distribution, but the generator is trained without the benefit of seeing ground truth input-output pairs. We propose to train with constraints in both directions, aiming to take advantage of both cycles ( and ), hence the name BicycleGAN.


where the hyper-parameters , , and control the importance of each term.

In the unconditional GAN setting, it has been observed that using samples both from random and encoded vector help further improves the results (Larsen et al., 2016). Hence, we also report one variant which is the full objective shown above (Equation 9), but without the reconstruction loss on the latent space . We call it cVAE-GAN++, as it is based on cVAE-GAN with additional loss , that encourages the discriminator to also see the randomly generated samples.

4 Implementation Details

The code and additional results are publicly available on our website. Please refer to supplement for more details about the datasets, architectures, and training procedures.

Figure 3: Alternatives for injecting z into generator. Latent code z is injected by spatial replication and concatenation into the generator network. We tried two alternatives, (left) injecting into the input layer and (right) every intermediate layer in the encoder.

Network architecture For generator , we use the U-Net (Ronneberger et al., 2015), which contains an encoder-decoder architecture, with symmetric skip connections. The architecture has been shown to produce strong results in the unimodal image prediction setting, when there is spatial correspondence between input and output pairs. For discriminator , we use two PatchGAN discriminators (Isola et al., 2017) at different scales, which aims to predict real vs. fake for and overlapping image patches, respectively. For the encoder , we experiment with two networks: (1) E_CNN: CNN with a few convolutional and downsampling layers and (2) E_ResNet: a classifier with several residual blocks He et al. (2016).

Training details We build our model on the Least Squares GANs (LSGANs) variant (Mao et al., 2017), which uses a least-squares objective instead of a cross entropy loss. LSGANs produce high quality results with stable training. We also find that not conditioning the discriminator leads to better results (also discussed in (Pathak et al., 2016)), and hence choose to do the same for all methods. We set the parameters , and in all our experiments. In practice, we tie the weights for the generators in the cVAE-GAN and cLR-GAN. We observe that using two separate discriminators yields slightly better visual results compared to discriminators with shared weights. We train our networks from scratch using Adam (Kingma and Ba, 2015) with a batch size of and with learning rate of . We choose latent dimension to be across all the datasets.

Injecting the latent code to generator . How to propagate the information encoded by latent code to the image generation process is critical to our applications. We explore two choices as shown in Figure 3: (1) add_to_input: We simply extend a -dimensional latent code to an

spatial tensor and concatenate it with the

input image. (2) add_to_all: Alternatively, we add to each intermediate layer of the network .

Figure 4: Example Results We show example results of our hybrid model BicycleGAN. The left column show shows the input. The second shows the ground truth output. The final four columns show randomly generated samples. We show results of our method on nightday, edgesshoes, edgeshandbags, and mapssatellites. Models and additional examples are available at
Figure 5: Qualitative method comparison We compare results on the labels facades dataset across different methods. The BicycleGAN method produces results which are both realistic and diverse.

5 Experiments

Realism Diversity AMT Fooling VGG-16 Method Rate [%] Distance Random real images 50.0% 3.520.021 pix2pix+noise (Isola et al., 2017) 27.932.40 % 0.338.002 cAE-GAN 13.641.80 % 2.304.012 cVAE-GAN 24.932.27 % 1.350.013 cVAE-GAN++ 29.192.43 % 1.425.014 cLR-GAN 29.232.48 % 111We found that cLR-GAN resulted in severe mode collapse, resulting in of the images producing the same result. Those images were omitted from this calculation.1.374.022 BicycleGAN 34.332.69 % 1.469.014
Figure 6: Realism vs Diversity. We measure diversity using average feature distance in the VGG-16 space using cosine distance summed across five layers, and realism using a real vs. fake Amazon Mechanical Turk test on the Google maps satellites task. The pix2pix+noise baseline produces little diversity. Using only cAE-GAN method produces large artifacts during sampling. The hybrid BicycleGAN method, which combines cVAE-GAN++ and cLR-GAN, produces results which have higher realism while maintaining diversity.

Datasets We test our method on several image-to-image translation problems from prior work, including edges photos (Yu and Grauman, 2014; Zhu et al., 2016), Google maps satellite (Isola et al., 2017), labels images (Cordts et al., 2016), and outdoor night day images (Laffont et al., 2014). These problems are all one-to-many mappings. We train all the models on images. Code and data are available at

Methods We train the following models described in Section 3: pix2pix+noise, cAE-GAN, cVAE-GAN, cVAE-GAN++, cLR-GAN and the hybrid model BicycleGAN.

5.1 Qualitative Evaluation

We show qualitative comparison results on Figure 5. We observe that pix2pix+noise typically produces a single realistic output, but does not produce any meaningful variation. cAE-GAN adds variation to the output, but typically reduces quality of results, as shown for an example on facades on Figure 4. We observe more variation in the cVAE-GAN, as the latent space is encouraged to encode information about ground truth outputs. However, the space is not densely populated, so drawing random samples may cause artifacts in the output. The cLR-GAN shows less variation in the output, and sometimes suffers from mode collapse. When combining these methods, however, in the hybrid method BicycleGAN, we observe results which are both diverse and realistic. We show example results in Figure 4. Please see our website for a full set of results.

5.2 Quantitative Evaluation

We perform a quantitative analysis on the diversity, realism, and latent space distribution on our six variants and baselines. We quantitatively test the Google maps satellites dataset.


We randomly draw samples from our model and compute average distance in a deep feature space. In the context of style transfer, image super-resolution 

(Johnson et al., 2016), and feature inversion (Dosovitskiy and Brox, 2016), pretrained networks have been used as a “perceptual loss" and explicitly optimized over. In the context of generative modeling, they have been used as a held-out “validation" score, for example to assess how semantic samples from a generative model (Salimans et al., 2016) or the semantic accuracy of a grayscale colorization (Zhang et al., 2016).

In Figure 6, we show the diversity-score using the cosine distance, averaged across spatial dimensions, and summed across the five conv layers preceding the pool layers on the VGG-16 network (Simonyan and Zisserman, 2014)

, pre-trained for Imagenet classification 

(Russakovsky et al., 2015). The maximum score is 5.0, as all the feature responses are nonnegative. For each method, we compute the average distance between 1900 pairs of randomly generated output images (sampled from 100 input images). Random pairs of ground truth real images in the domain produce an average variation of 3.520 using cosine distance. As we are measuring samples which correspond to a specific input, a system which stays faithful to the input should definitely not exceed this score.

The pix2pix system (Isola et al., 2017) produces a single point estimate. Adding noise to the system pix2pix+noise produces a diversity score of 0.338, confirming the finding in (Isola et al., 2017) that adding noise does not produce large variation. Using an cAE-GAN model to encode ground truth image into latent code does increase the variation. The cVAE-GAN, cVAE-GAN++, and BicycleGAN models all place explicit constraints on the latent space, and the cLR-GAN model places an implicit constraint through sampling. These four methods all produce similar diversity scores. We note that high diversity scores may also indicate that nonrealistic images are being generated, causing meaningless variation. Next, we investigate the visual realism of our samples.

Perceptual Realism To judge the visual realism of our results, we use human judgments, as proposed in (Zhang et al., 2016) and later used in (Isola et al., 2017; Zhu et al., 2017). The test sequentially presents a real and generated image to a human for 1 second each, in a random order, asks them to identify the fake, and measures the “fooling" rate. Figure 6(left) shows the realism across methods. The pix2pix+noise model achieves high realism score, but without large diversity, as discussed in the previous section. The cAE-GAN helps produce diversity, but this comes at a large cost to the visual realism. Because the distribution of the learned latent space is unclear, random samples may be from unpopulated regions of the space. Adding the KL-divergence loss in the latent space, used in the cVAE-GAN model recovers the visual realism. Furthermore, as expected, checking randomly drawn vectors in the cVAE-GAN++ model slightly increases realism. The cLR-GAN, which draws vectors from the predefined distribution randomly, produces similar realism and diversity scores. However, the cLR-GAN model resulted in large mode collapse - approximately of the outputs produced the same result, independent of the input image. The full hybrid BicycleGAN gets the best of both worlds, as it does not suffer from mode collapse and also has the highest realism score by a significant margin.

Encoder E_ResNet E_ResNet E_CNN E_CNN
Injecting add_to_all add_to_input add_to_all add_to_input
map satellite
Table 1: The encoding performance with respect to the different encoder architectures and methods of injecting . Here we report the reconstruction loss


Encoder architecture The pix2pix framework (Isola et al., 2017) have conducted extensive ablation studies on discriminators and generators. Here we focus on the performance of two encoders E_CNN and E_ResNet for our applications on maps and facades datasets, and we find that E_ResNet can better encode the output image, regarding the image reconstruction loss on validation datasets as shown in Table 1.

Methods of injecting latent code We evaluate two ways of injecting latent code : add_to_input and add_to_all (Section  4), regarding the same reconstruction loss . Table 1 shows that two methods give similar performance. This indicates that the U_Net Ronneberger et al. (2015) can already propagate the information well to the output without the additional skip connections from .

Figure 7: Different label facades results trained with varying length of the latent code .

Latent code length We study the BicycleGAN model results with respect to the varying number of dimensions of latent codes in Figure 7. A low-dimensional latent code may limit the amount of diversity that can be expressed by the model. On the contrary, a high-dimensional latent code can potentially encode more information about an output image at the cost of making sampling quite difficult. The optimal length of largely depends on individual datasets and applications, and how much ambiguity there is in the output.

6 Conclusions

In conclusion, we have evaluated a few methods for combating the problem of mode collapse in the conditional generative setting. We find that by combining multiple objectives for encouraging a bijective mapping between the latent and output spaces, we obtain results which are more realistic and diverse. We see many interesting avenues of future work, including directly enforcing a distribution in the latent space that encodes semantically meaningful attributes to allow for image-to-image transformations with user controllable parameters.


We thank Phillip Isola and Tinghui Zhou for helpful discussions. This work was supported in part by Adobe Inc., DARPA, AFRL, DoD MURI award N000141110688, NSF awards IIS-1633310, IIS-1427425, IIS-1212798, the Berkeley Artificial Intelligence Research (BAIR) Lab, and hardware donations from NVIDIA. JYZ is supported by Facebook Graduate Fellowship, RZ by Adobe Research Fellowship, and DP by NVIDIA Graduate Fellowship.


  • Arjovsky and Bottou [2017] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In ICLR, 2017.
  • Bansal et al. [2017] A. Bansal, Y. Sheikh, and D. Ramanan. Pixelnn: Example-based image synthesis. arXiv preprint arXiv:1708.05349, 2017.
  • Chen and Koltun [2017] Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. In ICCV, 2017.
  • Chen et al. [2016] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.
  • Cordts et al. [2016] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In CVPR, 2016.
  • Denton et al. [2015] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, 2015.
  • Dinh et al. [2017] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp. In ICLR, 2017.
  • Donahue et al. [2016] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. In ICLR, 2016.
  • Dosovitskiy and Brox [2016] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. In NIPS, 2016.
  • Dumoulin et al. [2016] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. Adversarially learned inference. In ICLR, 2016.
  • Ghosh et al. [2017] A. Ghosh, V. Kulharia, V. Namboodiri, P. H. Torr, and P. K. Dokania. Multi-agent diverse generative adversarial networks. arXiv preprint arXiv:1704.02906, 2017.
  • Goodfellow [2016] I. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
  • Goodfellow et al. [2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • Guadarrama et al. [2017] S. Guadarrama, R. Dahl, D. Bieber, M. Norouzi, J. Shlens, and K. Murphy. Pixcolor: Pixel recursive colorization. In BMVC, 2017.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • Hinton and Salakhutdinov [2006] G. E. Hinton and R. R. Salakhutdinov.

    Reducing the dimensionality of data with neural networks.

    Science, 313(5786):504–507, 2006.
  • Iizuka et al. [2016] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be color!: Joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. SIGGRAPH, 35(4), 2016.
  • Isola et al. [2017] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
  • Johnson et al. [2016] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
  • Kingma and Ba [2015] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Kingma and Welling [2014] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
  • Laffont et al. [2014] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient attributes for high-level understanding and editing of outdoor scenes. SIGGRAPH, 2014.
  • Larsen et al. [2016] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. In ICML, 2016.
  • Larsson et al. [2016] G. Larsson, M. Maire, and G. Shakhnarovich. Learning representations for automatic colorization. In ECCV, 2016.
  • Mao et al. [2017] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In ICCV, 2017.
  • Mathieu et al. [2016] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In ICLR, 2016.
  • Mirza and Osindero [2014] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • Nguyen et al. [2017] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. In CVPR, 2017.
  • Oord et al. [2016a] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu.

    Pixel recurrent neural networks.

    PMLR, 2016a.
  • Oord et al. [2016b] A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional image generation with pixelcnn decoders. In NIPS, 2016b.
  • Pathak et al. [2016] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
  • Radford et al. [2016] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
  • Reed et al. [2016] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. In ICML, 2016.
  • Ronneberger et al. [2015] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
  • Russakovsky et al. [2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
  • Salimans et al. [2016] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. arXiv preprint arXiv:1606.03498, 2016.
  • Sangkloy et al. [2017] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scribbler: Controlling deep image synthesis with sketch and color. In CVPR, 2017.
  • Simonyan and Zisserman [2014] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Smolensky [1986] P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, DTIC Document, 1986.
  • Sohn et al. [2015] K. Sohn, X. Yan, and H. Lee. Learning structured output representation using deep conditional generative models. In NIPS, 2015.
  • van den Oord et al. [2016] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In NIPS, 2016.
  • Vincent et al. [2008] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.

    Extracting and composing robust features with denoising autoencoders.

    In ICML, 2008.
  • Walker et al. [2016] J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using variational autoencoders. In ECCV, 2016.
  • Xian et al. [2017] W. Xian, P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Texturegan: Controlling deep image synthesis with texture patches. In arXiv preprint arXiv:1706.02823, 2017.
  • Xue et al. [2016] T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NIPS, 2016.
  • Yang et al. [2017] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li.

    High-resolution image inpainting using multi-scale neural patch synthesis.

    In CVPR, 2017.
  • Yu and Grauman [2014] A. Yu and K. Grauman. Fine-grained visual comparisons with local learning. In CVPR, 2014.
  • Zhang et al. [2017a] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017a.
  • Zhang et al. [2016] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In ECCV, 2016.
  • Zhang et al. [2017b] R. Zhang, J.-Y. Zhu, P. Isola, X. Geng, A. S. Lin, T. Yu, and A. A. Efros. Real-time user-guided image colorization with learned deep priors. SIGGRAPH, 2017b.
  • Zhao et al. [2017] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In ICLR, 2017.
  • Zhu et al. [2016] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In ECCV, 2016.
  • Zhu et al. [2017] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.