Mode Seeking Generative Adversarial Networks for Diverse Image Synthesis

03/13/2019 ∙ by Qi Mao, et al. ∙ 12

Most conditional generation tasks expect diverse outputs given a single conditional context. However, conditional generative adversarial networks (cGANs) often focus on the prior conditional information and ignore the input noise vectors, which contribute to the output variations. Recent attempts to resolve the mode collapse issue for cGANs are usually task-specific and computationally expensive. In this work, we propose a simple yet effective regularization term to address the mode collapse issue for cGANs. The proposed method explicitly maximizes the ratio of the distance between generated images with respect to the corresponding latent codes, thus encouraging the generators to explore more minor modes during training. This mode seeking regularization term is readily applicable to various conditional generation tasks without imposing training overhead or modifying the original network structures. We validate the proposed algorithm on three conditional image synthesis tasks including categorical generation, image-to-image translation, and text-to-image synthesis with different baseline models. Both qualitative and quantitative results demonstrate the effectiveness of the proposed regularization method for improving diversity without loss of quality.



There are no comments yet.


page 6

page 13

page 14

page 15

page 16

page 17

page 18

page 19

Code Repositories


MSGAN: Mode Seeking Generative Adversarial Networks for Diverse Image Synthesis (CVPR2019)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative adversarial networks (GANs) [8] have been shown to capture complex and high-dimensional image data with numerous applications effectively. Built upon GANs, conditional GANs (cGANs) [20] take external information as additional inputs. For image synthesis, cGANs can be applied to various tasks with different conditional contexts. With class labels, cGANs can be applied to categorical image generation. With text sentences, cGANs can be applied to text-to-image synthesis [22, 31]. With images, cGANs have been used in tasks including image-to-image translation [10, 11, 15, 16, 34, 35], semantic manipulation [30] and style transfer [29].

For most conditional generation tasks, the mappings are in nature multimodal, i.e., a single input context corresponds to multiple plausible outputs. A straightforward approach to handle multimodality is to take random noise vectors along with the conditional contexts as inputs, where the contexts determine the main content and noise vectors are responsible for variations. For instance, in the dog-to-cat image-to-image translation task [15], the input dog images decide contents like orientations of heads and positions of facial landmarks, while the noise vectors help the generation of different species. However, cGANs usually suffer from the mode collapse [8, 25] problem, where generators only produce samples from a single or few modes of the distribution and ignore other modes. The noise vectors are ignored or of minor impacts, since cGANs pay more attention to learn from the high-dimensional and structured conditional contexts.

There are two main approaches to address the mode collapse problem in GANs. A number of methods focus on discriminators by introducing different divergence metrics [1, 18] and optimization process [6, 19, 25]. The other methods use auxiliary networks such as multiple generators [7, 17] and additional encoders  [2, 4, 5, 26]. However, mode collapse is relatively less studied in cGANs. Some recent efforts have been made in the image-to-image translation task to improve diversity [10, 15, 35]

. Similar to the second category with the unconditional setting, these approaches introduce additional encoders and loss functions to encourage the one-to-one relationship between the output and the latent code. These methods either entail heavy computational overheads on training or require auxiliary networks that are often task-specific that cannot be easily extended to other frameworks.

In this work, we propose a mode seeking regularization method that can be applied to cGANs for various tasks to alleviate the mode collapse problem. Given two latent vectors and the corresponding output images, we propose to maximize the ratio of the distance between images with respect to the distance between latent vectors. In other words, this regularization term encourages generators to generate dissimilar images during training. As a result, generators can explore the target distribution, and enhance the chances of generating samples from different modes. On the other hand, we can train the discriminators with dissimilar generated samples to provide gradients from minor modes that are likely to be ignored otherwise. This mode seeking regularization method incurs marginal computational overheads and can be easily embedded in different cGAN frameworks to improve the diversity of synthesized images.

We validate the proposed regularization algorithm through an extensive evaluation of three conditional image synthesis tasks with different baseline models. First, for categorical image generation, we apply the proposed method on DCGAN [21] using the CIFAR-10 [13] dataset. Second, for image-to-image translation, we embed the proposed regularization scheme in Pix2Pix [11] and DRIT [15] using the facades [3], maps [11], Yosemite [34], and catdog [15] datasets. Third, for text-to-image synthesis, we incorporate StackGAN++ [31] with the proposed regularization term using the CUB-200-2011 [28] dataset. We evaluate the diversity of synthesized images using perceptual distance metrics [33].

However, the diversity metric alone cannot guarantee the similarity between the distribution of generated images and the distribution of real data. Therefore, we adopt two recently proposed bin-based metrics [23], the Number of Statistically-Different Bins (NDB) metric which determines the relative proportions of samples fallen into clusters predetermined by real data, and the Jensen-Shannon Divergence (JSD) distance which measures the similarity between bin distributions. Furthermore, to verify that we do not achieve diversity at the expense of realism, we evaluate our method with the Fréchet Inception Distance (FID) [9] as the metric for quality. Experimental results demonstrate that the proposed regularization method can facilitate existing models from various applications achieving better diversity without loss of image quality. Figure 1 shows the effectiveness of the proposed regularization method for existing models.

The main contributions of this work are:

  • We propose a simple yet effective mode seeking regularization method to address the mode collapse problem in cGANs. This regularization scheme can be readily extended into existing frameworks with marginal training overheads and modifications.

  • We demonstrate the generalizability of the proposed regularization method on three different conditional generation tasks: categorical generation, image-to-image translation, and text-to-image synthesis.

  • Extensive experiments show that the proposed method can facilitate existing models from different tasks achieving better diversity without sacrificing visual quality of the generated images.

Our code and pre-trained models are available at

Figure 2: Illustration of motivation. Real data distribution contains numerous modes. However, when mode collapse occurs, generators only produce samples from a few modes. From the data distribution when mode collapse occurs, we observe that for latent vectors and , the distance between their mapped images and will become shorter in a disproportionate rate when the distance between two latent vectors is decreasing. We present on the right the ratio of the distance between images with respect to the distance of the corresponding latent vectors, where we can spot an anomalous case (colored in red) where mode collapse occurs. The observation motivates us to leverage the ratio as the training objective explicitly.

2 Related Work

Conditional generative adversarial networks.

Generative adversarial networks [1, 8, 18, 21] have been widely used for image synthesis. With adversarial training, generators are encouraged to capture the distribution of real images. On the basis of GANs, conditional GANs synthesize images based on various contexts. For instances, cGANs can generate high-resolution images conditioned on low-resolution images [14], translate images between different visual domains [10, 11, 15, 16, 34, 35], generate images with desired style [29], and synthesize images according to sentences [22, 31]. Although cGANs have achieved success in various applications, existing approaches suffer from the mode collapse problem. Since the conditional contexts provide strong structural prior information for the output images and have higher dimensions than the input noise vectors, generators tend to ignore the input noise vectors, which are responsible for the variation of generated images. As a result, the generators are prone to produce images with similar appearances. In this work, we aim to address the mode collapse problem for cGANs.

Reducing mode collapse.

Some methods focus on the discriminator with different optimization process [19] and divergence metrics [1, 18] to stabilize the training process. The minibatch discrimination scheme [25] allows the discriminator to discriminate between whole mini-batches of samples instead of between individual samples. In [6], Durugkar et al. use multiple discriminators to address this issue. The other methods use auxiliary networks to alleviate the mode collapse issue. ModeGAN [2] and VEEGAN [26] enforce the bijection mapping between the input noise vectors and generated images with additional encoder networks. Multiple generators [7] and weight-sharing generators [17] are developed to capture more modes of the distribution. However, these approaches either entail heavy computational overheads or require modifications of the network structure, and may not be easily applicable to cGANs.

In the field of cGANs, some efforts [10, 15, 35] have been recently made to address the mode collapse issue on the image-to-image translation task. Similar to ModeGAN and VEEGAN, additional encoders are introduced to provide a bijection constraint between the generated images and input noise vectors. However, these approaches require other task-specific networks and objective functions. The additional components make the methods less generalizable and incur extra computational loads on training. In contrast, we propose a simple regularization term that imposes no training overheads and requires no modifications of the network structure. Therefore, the proposed method can be readily applied to various conditional generation tasks. Recently, the concurrent work [32] also adopts a loss term similar to our work for reducing mode collapse for cGANs.

3 Diverse Conditional Image Synthesis

3.1 Preliminaries

The training process of GANs can be formulated as a mini-max problem: a discriminator

learns to be a classifier by assigning higher discriminative values to the real data samples and lower ones to the generated ones. Meanwhile, a generator

aims to fool by synthesizing realistic examples. Through adversarial training, the gradients from will guide toward generating samples with the distribution similar to the real data one.

The mode collapse problem with GANs is well known in the literature. Several methods [2, 25, 26] attribute the missing mode to the lack of penalty when this issue occurs. Since all modes usually have similar discriminative values, larger modes are likely to be favored through the training process based on gradient descent. On the other hand, it is difficult to generate samples from minor modes.

The mode missing problem becomes worse in cGANs. Generally, conditional contexts are high-dimensional and structured (e.g., images and sentences) as opposed to the noise vectors. As such, the generators are likely to focus on the contexts and ignore the noise vectors, which account for diversity.

3.2 Mode Seeking GANs

In this work, we propose to alleviate the missing mode problem from the generator perspective. Figure 2 illustrates the main ideas of our approach. Let a latent vector from the latent code space be mapped to the image space . When mode collapse occurs, the mapped images are collapsed into a few modes. Furthermore, when two latent codes and are closer, the mapped images and are more likely to be collapsed into the same mode. To address this issue, we propose a mode seeking regularization term to directly maximize the ratio of the distance between and with respect to the distance between and ,


where denotes the distance metric.

The regularization term offers a virtuous circle for training cGANs. It encourages the generator to explore the image space and enhances the chances for generating samples of minor modes. On the other hand, the discriminator is forced to pay attention to generated samples from minor modes. Figure 2 shows a mode collapse situation where two close samples, and , are mapped onto the same mode . However, with the proposed regularization term, is mapped to , which belongs to an unexplored mode . With the adversarial mechanism, the generator will thus have better chances to generate samples of in the following training steps.

As shown in Figure 5, the proposed regularization term can be easily integrated with existing cGANs by appending it to the original objective function.


where denotes the original objective function and the weights to control the importance of the regularization. Here, can be as a simple loss function. For example, in categorical generation task,


where denote class labels, real images, and noise vectors, respectively. In image-to-image translation task [11],


where denotes input images and is the typical GAN loss. can be arbitrary complex objective function from any task, as shown in Figure 5 (b). We name the proposed method as Mode Seeking GANs (MSGANs).

(a) Proposed regularization
(b) Applying proposed regularization on StackGAN++
Figure 5: Proposed regularization. (a) We propose a regularization term that maximizes the ratio of the distance between generated images with respect to the distance between their corresponding input latent codes. (b) The proposed regularization method can be applied to arbitrary cGANs. Take StackGAN++ [31], a model for text-to-image synthesis, as an example, we easily apply the regularization term regardless of the complex tree-like structure of the original model.

4 Experiments

We evaluate the proposed regularization method through extensive quantitative and qualitative evaluation. We apply MSGANs to the baseline models from three representative conditional image synthesis tasks: categorical generation, image-to-image translation, and text-to-image synthesis. Note that we augment the original objective functions with the proposed regularization term while maintaining original network architectures and hyper-parameters. We employ norm distance as our distance metrics for both and and set the hyper-parameter in all experiments. More implementation and evaluation details, please refer to the appendixes.

Metrics Models airplane automobile bird cat deer
dog frog horse ship truck
Table 1: NDB and JSD results on the CIFAR-10 dataset.

4.1 Evaluation Metrics

We conduct evaluations using the following metrics. FID. To evaluate the quality of the generated images, we use FID [9]

to measure the distance between the generated distribution and the real one through features extracted by Inception Network 

[27]. Lower FID values indicate better quality of the generated images.

LPIPS. To evaluate diversity, we employ LPIPS [33] following  [10, 15, 35]. LIPIS measures the average feature distances between generated samples. Higher LPIPS score indicates better diversity among the generated images.

NDB and JSD. To measure the similarity between the distribution between real images and generated one, we adopt two bin-based metrics, NDB and JSD, proposed in [23]. These metrics evaluate the extent of mode missing of generative models. Following [23]

, the training samples are first clustered using K-means into different bins which can be viewed as modes of the real data distribution. Then each generated sample is assigned to the bin of its nearest neighbor. We calculate the bin-proportions of the training samples and the synthesized samples to evaluate the difference between the generated distribution and the real data distribution. NDB score and JSD of the bin-proportion are then computed to measure the mode collapse. Lower NDB score and JSD mean the generated data distribution approaches the real data distribution better by fitting more modes. Please refer to 

[23] for more details.

4.2 Conditioned on Class Label

We first validate the proposed method on categorical generation. In categorical generation, networks take class labels as conditional contexts to synthesize images of different categories. We apply the regularization term to the baseline framework DCGAN [21].

We conduct experiments on the CIFAR-10 [13] dataset which includes images of ten categories. Since images in the CIFAR-10 dataset are of size and upsampling degrades the image quality, we do not compute LPIPS in this task. Table 1 and Table 2 present the results of NDB, JS, and FID. MSGAN mitigates the mode collapse issue in most classes while maintaining image quality.

Table 2: FID results on the CIFAR-10 dataset.
Figure 6: Diversity comparison. The proposed regularization term helps Pix2Pix learn more diverse results.
Datasets Facades
Pix2Pix [11] MSGAN BicycleGAN [35]
Datasets Maps
Pix2Pix [11] MSGAN BicycleGAN [35]
Table 3: Quantitative results on the facades and maps dataset.
Figure 7: Diversity comparison. We compare MSGAN with DRIT on the dog-to-cat, cat-to-dog, and winter-to-summer translation tasks. Our model produces more diverse samples over DRIT.
Figure 8: Visualization of the bins on dogcat translation. The translated results of DRIT collapse into few modes, while the generated image of MSGAN fit the real data distribution better.

4.3 Conditioned on Image

Image-to-image translation aims to learn the mapping between two visual domains. Conditioned on images from the source domain, models attempt to synthesize corresponding images in the target domain. Despite the multimodal nature of the image-to-image translation task, early work [11, 34] abandons noise vectors and performs one-to-one mapping since the latent codes are easily ignored during training as shown in [11, 35]. To achieve multimodality, several recent attempts [10, 15, 35] introduce additional encoder networks and objective functions to impose a bijection constraint between the latent code space and the image space. To demonstrate the generalizability, we apply the proposed method to a unimodal model Pix2Pix [11] using paired training data and a multimodal model DRIT [15] using unpaired images.

Figure 9: Diversity comparison. We show examples of StackGAN++ [31] and MSGAN on the CUB-200-2011 dataset of text-to-image synthesis. When the text code is fixed, the latent codes in MSGAN help to generate more diverse appearances and poses of birds as well as different backgrounds.
Datasets Summer2Winter Winter2Summer
Datasets Cat2Dog Dog2Cat
Table 4: Quantitative results of the Yosemite (SummerWinter) and the CatDog dataset.
Conditioned on text descriptions Conditioned on text codes
StackGAN++ [31] MSGAN StackGAN++ [31] MSGAN
Table 5: Quantitative results on the CUB-200-2011 dataset. We conduct experiments in two settings: 1) Conditioned on text descriptions, where every description can be mapped to different text codes. 2) Conditioned on text codes, where the text codes are fixed so that their effects are excluded.
Figure 10: Linear interpolation between two latent codes in MSGAN.

Image synthesis results with linear-interpolation between two latent codes in the dog-to-cat translation and text-to-image synthesis.

4.3.1 Conditioned on Paired Images

We take Pix2Pix as the baseline model. We also compare MSGAN to BicycleGAN [35] which generates diverse images with paired training images. For fair comparisons, architectures of the generator and the discriminator in all methods follow the ones in BicycleGAN [35].

We conduct experiments on the facades and maps datasets. MSGAN obtains consistent improvements on all metrics over Pix2Pix. Moreover, MSGAN demonstrates comparable diversity to BicycleGAN, which applies an additional encoder network. Figure. 6 and Table. 3 demonstrate the qualitative and quantitative results, respectively.

4.3.2 Conditioned on Unpaired Images

We choose DRIT [15], one of the state-of-the-art frameworks to generate diverse images with unpaired training data, as the baseline framework. Though DRIT synthesizes diverse images in most cases, mode collapse occurs in some challenging shape-variation cases (e.g., translation between cats and dogs). To demonstrate the robustness of the proposed method, we evaluate on the shape-preserving Yosemite (summerwinter) [34] dataset and the catdog [15] dataset that requires shape variations.

As the quantitative results exhibited in Table. 4, MSGAN performs favorably against DRIT in all metrics on both datasets. Especially in the challenging catdog dataset, MSGAN obtains substantial diversity gains. From the statistical point of view, we visualize the bin proportions of the dog-to-cat translation in Figure. 8. The graph shows the severe mode collapse issue of DRIT and the substantial improvement with the proposed regularization term. Qualitatively, Figure. 7 shows that MSGAN discovers more modes without the loss of visual quality.

4.4 Conditioned on Text

Text-to-image synthesis targets at generating images conditioned on text descriptions. We integrate the proposed regularization term on StackGAN++ [31] using the CUB-200-2011 [28]

dataset. To improve diversity, StackGAN++ introduces a Conditioning Augmentation (CA) module that re-parameterizes text descriptions into text codes of the Gaussian distribution. Instead of applying the regularization term on the semantically meaningful text codes, we focus on exploiting the latent codes randomly sampled from the prior distribution. However, for a fair comparison, we evaluation MSGAN against StackGAN++ in two settings: 1) Perform generation without fixing text codes for text descriptions. In this case, text codes also provide variations for output images. 2) Perform generation with fixed text codes. In this setting, the effects of text codes are excluded.

Table. 5 presents quantitative comparisons between MSGAN and StackGAN++. MSGAN improves the diversity of StackGAN++ and maintains visual quality. To better illustrate the role that latent codes play for the diversity, we show qualitative comparisons with the text codes fixed. In this setting, we do not consider the diversity resulting from CA. Figure. 9 illustrates that latent codes of StackGAN++ have minor effects on the variations of the image. On the contrary, latent codes of MSGAN contribute to various appearances and poses of birds.

4.5 Interpolation of Latent Space in MSGANs

We perform linear interpolation between two given latent codes and generate corresponding images to have a better understanding of how well MSGANs exploit the latent space. Figure. 10 shows the interpolation results on the dog-to-cat translation and the text-to-image synthesis task. In the dog-to-cat translation, we can see the coat colors and patterns varies smoothly along with the latent vectors. In the text-to-image synthesis, both orientations of birds and the appearances of footholds change gradually with the variations of the latent codes.

5 Conclusions

In this work, we present a simple but effective mode seeking regularization term on the generator to address the model collapse in cGANs. By maximizing the distance between generated images with respect to that between the corresponding latent codes, the regularization term forces the generators to explore more minor modes. The proposed regularization method can be readily integrated with existing cGANs framework without imposing training overheads and modifications of network structures. We demonstrate the generalizability of the proposed method on three different conditional generation tasks including categorical generation, image-to-image translation, and text-to-image synthesis. Both qualitative and quantitative results show that the proposed regularization term facilitates the baseline frameworks improving the diversity without sacrificing visual quality of the generated images.


  • [1] M. Arjovsky and L. Bottou (2017) Wasserstein generative adversarial networks. In ICML, Cited by: §1, §2, §2.
  • [2] T. Che, Y. Li, A. P. Jacob, and W. Li (2017) Mode regularized generative adversarial networks. In ICLR, Cited by: §1, §2, §3.1.
  • [3] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In CVPR, Cited by: Table 9, §1.
  • [4] J. Donahue and T. Darrell (2017) Adversarial feature learning. In ICLR, Cited by: §1.
  • [5] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, and A. Courville (2017) Adversarially learned inference. In ICLR, Cited by: §1.
  • [6] I. Durugkar and S. Mahadevan (2017) Generative multi-adversarial networks. In ICLR, Cited by: §1, §2.
  • [7] A. Ghosh, V. Kulharia, V. Namboodiri, and P. K. Dokania (2018) Multi-agent diverse generative adversarial networks. In CVPR, Cited by: §1, §2.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §1, §1, §2.
  • [9] M. Heusel, H. Ramsauer, T. Unterthiner, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, Cited by: §1, §4.1.
  • [10] X. Huang, M. Liu, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In ECCV, Cited by: §1, §1, §2, §2, §4.1, §4.3.
  • [11] P. Isola and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    In CVPR, Cited by: Appendix A, Table 7, Table 8, Table 9, §1, §1, §2, §3.2, §4.3, Table 3.
  • [12] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: Appendix A.
  • [13] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: Appendix A, Table 9, §1, §4.2.
  • [14] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, and W. Shi (2017)

    Photo-realistic single image super-resolution using a generative adversarial network

    In CVPR, Cited by: §2.
  • [15] H. Lee, H. Tseng, J. Huang, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In ECCV, Cited by: Appendix A, Table 9, §1, §1, §1, §1, §2, §2, §4.1, §4.3.2, §4.3, Table 4.
  • [16] M. Liu and J. Kautz (2017) Unsupervised image-to-image translation networks. In NIPS, Cited by: §1, §2.
  • [17] M. Liu and O. Tuzel (2016) Coupled generative adversarial networks. In NIPS, Cited by: §1, §2.
  • [18] X. Mao, Q. Li, H. Xie, R. YK, and S. P. Smolley (2017) Least squares generative adversarial networks. In ICCV, Cited by: §1, §2, §2.
  • [19] L. Metz, B. Poole, and J. Sohl-Dickstein (2017) Unrolled generative adversarial networks. In ICLR, Cited by: §1, §2.
  • [20] S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §1.
  • [21] A. Radford and S. Chintala (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, Cited by: Appendix A, Figure 11, Table 9, §1, §2, §4.2.
  • [22] S. Reed, Z. Akata, X. Yan, L. Logeswaran, and H. Lee (2016) Generative adversarial text to image synthesis. In ICML, Cited by: §1, §2.
  • [23] E. Richardson and Y. Weiss (2018) On GANs and GMMs. In NIPS, Cited by: Appendix B, §1, §4.1.
  • [24] O. Ronneberger and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In miccai, Cited by: Appendix A.
  • [25] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, and X. Chen (2016) Improved techniques for training GANs. In NIPS, Cited by: §1, §1, §2, §3.1.
  • [26] A. Srivastava, L. Valkoz, C. Russell, and C. Sutton (2017) VEEGAN: reducing mode collapse in GANs using implicit variational learning. In NIPS, Cited by: §1, §2, §3.1.
  • [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, Cited by: §4.1.
  • [28] C. Wah, S. Branson, P. Welinder, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: Table 9, §1, §4.4.
  • [29] M. Wand (2016) Precomputed real-time texture synthesis with markovian generative adversarial networks. In ECCV, Cited by: §1, §2.
  • [30] T. Wang, M. Liu, J. Zhu, A. Tao, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional GANs. In CVPR, Cited by: §C.2, §1.
  • [31] T. Xu, H. Li, S. Zhang, X. Wang, and D. Metaxas (2018) StackGAN++: realistic image synthesis with stacked generative adversarial networks. TPAMI. Cited by: Appendix A, Table 9, §1, §1, §2, Figure 5, Figure 9, §4.4, Table 5.
  • [32] D. Yang, S. Hong, Y. Jang, T. Zhao, and H. Lee (2019) Diversity-sensitive conditional generative adversarial networks. In ICLR, Cited by: §2.
  • [33] R. Zhang, P. Isola, A. A. Efros, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In CVPR, Cited by: §1, §4.1.
  • [34] J. Zhu, T. Park, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: Appendix A, Table 9, §1, §1, §2, §4.3.2, §4.3.
  • [35] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, and E. Shechtman (2017) Toward multimodal image-to-image translation. In NIPS, Cited by: Table 8, §1, §1, §2, §2, §4.1, §4.3.1, §4.3, Table 3.

Appendix A Implementation Details

Table 9 summarizes the datasets and baseline models used on various tasks. For all of the baseline methods, we incorporate the original objective functions with the proposed regularization term. Note that we remain the original network architecture design and use the default setting of hyper-parameters for the training.


Since the images in the CIFAR-10 [13] dataset are of size , we modify the structure of the generator and discriminator in DCGAN [21], as shown in Table 10. We use the batch size of , learning rate of and Adam [12] optimizer with and to train both the baseline and MSGAN network.


We adopt the generator and discriminator in BicycleGAN [34] to build the Pix2Pix [11] model. Same as BicycleGAN, we use a U-Net network [24] for the generator, and inject the latent codes into every layer of the generator. The architecture of the discriminator is a two-scale PatchGAN network [11]. For the training, both Pix2Pix and MSGAN framework use the same hyper-parameters as the officially released version 111


DRIT [15] involves two stages of image-to-image translations in the training process. We only apply the mode seeking regularization term to generators in the first stage, which is modified on the officially released code 222


StackGAN++ [31] is a tree-like structure with multiple generators and discriminators. We use the output images from the last generator and input latent codes to calculate the mode seeking regularization term. The implementation is based on the officially released code 333

Appendix B Evaluation Details

We employ the official implementation of FID 444, NDB and JSD 555, and LPIPS 666 For NDB and JSD, we use the K-means method on training samples to obtain the clusters. Then the generated samples are assigned to the nearest cluster to compute the bin proportions. As suggested by the author of [23], there are at least training samples for each cluster. Therefore, we cluster the number of bins  in all tasks, where denotes the number of training samples for computing the clusters. We have verified that the performance is consistent within a large range of . For evaluation, we randomly generate images for a given conditional context on various tasks. We conduct five independent trials and report the mean and standard derivation based on the result of each trial. More evaluation details of one trial are presented as follows.

Conditioned on Class Label.

We randomly generate images for each class label. We use all the training samples and the generated samples to compute FID. For NDB and JSD, we employ the training samples in each class to calculate clusters.

Conditioned on Image.

We randomly generate images for each input image in the test set. For LPIPS, we randomly select pairs of the images of each context in the test set to compute LPIPS and average all the values for this trial. Then, we randomly choose input images and their corresponding generated images to form generated samples. We use the generated samples and all samples in training set to compute FID. For NDB and JSD, we employ all the training samples for clustering and choose bins for facades, and bins for other datasets.

Conditioned on Text.

We randomly select sentences and generate images for each sentence, which forms generated samples. Then, we randomly select samples for computing FID, and clustering them into bins for NDB and JSD. For LPIPS, we randomly choose pairs for each sentence and average the values of all the pairs for this trial.

Appendix C Ablation Study on the Regularization Term

c.1 The Weighting Parameter

To analyze the influence of the regularization term, we conduct an ablation study by varying the weighting parameter on image-to-image translation task using the facades dataset. Table. 6 presents the quantitative results with diverse . It can be observed that increasing improves the diversity of the generated images. Nevertheless, as the weighting parameter becomes larger than a threshold value (), the training becomes unstable, which yields low quality, and even low diversity synthesized images. As a result, we empirically set the weighting parameter for all experiments.

Table 6: Quantitative results with different on the facades dataset.

c.2 The Design Choice of the Distance Metric

We have explored other design choice of the distance metric. We conduct experiments using discriminator feature distance in our regularization term in a way similar to feature matching loss [30],


where denotes the layer of the discriminator. We apply it to Pix2Pix on the facades dataset. Table. 7 shows that MSGAN using feature distance also obtains improvement over Pix2Pix. However, MSGAN using distance has higher diversity. Therefore, we employ MSGAN using distance for all experiments.

Table 7: Quantitative results on the facades dataset.

Appendix D Computational Overheads

We compare MSGAN with Pix2Pix, BicycleGAN in terms of training time, memory consumption, and model parameters on an NVIDIA TITAN X GPU. Table. 8 shows that our method incurs marginal overheads. However, BicycleGAN requires longer time per iteration and larger memory with an additional encoder and another discriminator network.

Model Time (s) Memory (MB) Parameters (M)
Pix2Pix [11]
BicycleGAN [35]
Table 8: Comparisons of computational overheads on the facades dataset.

Appendix E Additional Results

We present more results of categorical generation, image-to-image translation, and text-to-image synthesis in Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, and Figure 17, respectively.

Context Class Label Paired Images Unpaired Images Text
Dataset CIFAR-10 [13] Facades [3] Maps [11] Yosemite [34] Cat Dog [15] CUB-200-2011 [28]
Summer Winter Cat Dog
train test train test train test train test train test train test train test train test
Baseline DCGAN [21] Pix2Pix [11] DRIT [15] StackGAN++ [31]
Table 9: Statistics of different generation tasks. We summarize the number of training and testing images in each generation task. The baseline model used for each task is also provided.
Layer Generator Discriminator

Dconv(N512-K4-S1-P0), BN, Relu

Conv(N128-K4-S2-P1), Leaky-Relu
Dconv(N256-K4-S2-P1), BN, Relu Conv(N256-K4-S2-P1), BN, Leaky-Relu
Dconv(N128-K4-S2-P1), BN, Relu Conv(N512-K4-S2-P1), BN, Leaky-Relu
Dconv(N3-K4-S2-P1), Tanh Conv(N1-K4-S1-P0), Sigmoid
Table 10: The architecture of the generator and discriminator of DCGAN

. We employ the following abbreviation: N= Number of filters, K= Kernel size, S= Stride size, P= Padding size. “Conv”, “Dconv”,“BN” denote the convolutional layer, transposed convolutional layer and batch normalization, respectively.

Figure 11: More categorical generation results of CIFAR-10. We show the results of DCGAN [21] with the proposed mode seeking regularization term on categorical generation task.
Figure 12: More image-to-image translation results of facades and maps. Top three rows: facades, bottom three rows: maps.
Figure 13: More image-to-image translation results of Yosemite, SummerWinter.
Figure 14: More image-to-image translation results of Yosemite, WinterSummer.
Figure 15: More image-to-image translation results of CatDog.
Figure 16: More image-to-image translation results of DogCat.
Figure 17: More text-to-image synthesis results of CUB-200-2011.