MSGAN: Mode Seeking Generative Adversarial Networks for Diverse Image Synthesis (CVPR2019)
Most conditional generation tasks expect diverse outputs given a single conditional context. However, conditional generative adversarial networks (cGANs) often focus on the prior conditional information and ignore the input noise vectors, which contribute to the output variations. Recent attempts to resolve the mode collapse issue for cGANs are usually task-specific and computationally expensive. In this work, we propose a simple yet effective regularization term to address the mode collapse issue for cGANs. The proposed method explicitly maximizes the ratio of the distance between generated images with respect to the corresponding latent codes, thus encouraging the generators to explore more minor modes during training. This mode seeking regularization term is readily applicable to various conditional generation tasks without imposing training overhead or modifying the original network structures. We validate the proposed algorithm on three conditional image synthesis tasks including categorical generation, image-to-image translation, and text-to-image synthesis with different baseline models. Both qualitative and quantitative results demonstrate the effectiveness of the proposed regularization method for improving diversity without loss of quality.READ FULL TEXT VIEW PDF
We propose a simple yet highly effective method that addresses the
Generative Adversarial Networks (GANs) have long been used to understand...
In this paper, we study the problem of generating a set ofrealistic and
We propose an image synthesis approach that provides stratified navigati...
Many image-to-image translation problems are ambiguous, as a single inpu...
Recent advances in conditional image generation tasks, such as image-to-...
Many tasks in computer vision and graphics fall within the framework of
MSGAN: Mode Seeking Generative Adversarial Networks for Diverse Image Synthesis (CVPR2019)
Generative adversarial networks (GANs)  have been shown to capture complex and high-dimensional image data with numerous applications effectively. Built upon GANs, conditional GANs (cGANs)  take external information as additional inputs. For image synthesis, cGANs can be applied to various tasks with different conditional contexts. With class labels, cGANs can be applied to categorical image generation. With text sentences, cGANs can be applied to text-to-image synthesis [22, 31]. With images, cGANs have been used in tasks including image-to-image translation [10, 11, 15, 16, 34, 35], semantic manipulation  and style transfer .
For most conditional generation tasks, the mappings are in nature multimodal, i.e., a single input context corresponds to multiple plausible outputs. A straightforward approach to handle multimodality is to take random noise vectors along with the conditional contexts as inputs, where the contexts determine the main content and noise vectors are responsible for variations. For instance, in the dog-to-cat image-to-image translation task , the input dog images decide contents like orientations of heads and positions of facial landmarks, while the noise vectors help the generation of different species. However, cGANs usually suffer from the mode collapse [8, 25] problem, where generators only produce samples from a single or few modes of the distribution and ignore other modes. The noise vectors are ignored or of minor impacts, since cGANs pay more attention to learn from the high-dimensional and structured conditional contexts.
There are two main approaches to address the mode collapse problem in GANs. A number of methods focus on discriminators by introducing different divergence metrics [1, 18] and optimization process [6, 19, 25]. The other methods use auxiliary networks such as multiple generators [7, 17] and additional encoders [2, 4, 5, 26]. However, mode collapse is relatively less studied in cGANs. Some recent efforts have been made in the image-to-image translation task to improve diversity [10, 15, 35]
. Similar to the second category with the unconditional setting, these approaches introduce additional encoders and loss functions to encourage the one-to-one relationship between the output and the latent code. These methods either entail heavy computational overheads on training or require auxiliary networks that are often task-specific that cannot be easily extended to other frameworks.
In this work, we propose a mode seeking regularization method that can be applied to cGANs for various tasks to alleviate the mode collapse problem. Given two latent vectors and the corresponding output images, we propose to maximize the ratio of the distance between images with respect to the distance between latent vectors. In other words, this regularization term encourages generators to generate dissimilar images during training. As a result, generators can explore the target distribution, and enhance the chances of generating samples from different modes. On the other hand, we can train the discriminators with dissimilar generated samples to provide gradients from minor modes that are likely to be ignored otherwise. This mode seeking regularization method incurs marginal computational overheads and can be easily embedded in different cGAN frameworks to improve the diversity of synthesized images.
We validate the proposed regularization algorithm through an extensive evaluation of three conditional image synthesis tasks with different baseline models. First, for categorical image generation, we apply the proposed method on DCGAN  using the CIFAR-10  dataset. Second, for image-to-image translation, we embed the proposed regularization scheme in Pix2Pix  and DRIT  using the facades , maps , Yosemite , and catdog  datasets. Third, for text-to-image synthesis, we incorporate StackGAN++  with the proposed regularization term using the CUB-200-2011  dataset. We evaluate the diversity of synthesized images using perceptual distance metrics .
However, the diversity metric alone cannot guarantee the similarity between the distribution of generated images and the distribution of real data. Therefore, we adopt two recently proposed bin-based metrics , the Number of Statistically-Different Bins (NDB) metric which determines the relative proportions of samples fallen into clusters predetermined by real data, and the Jensen-Shannon Divergence (JSD) distance which measures the similarity between bin distributions. Furthermore, to verify that we do not achieve diversity at the expense of realism, we evaluate our method with the Fréchet Inception Distance (FID)  as the metric for quality. Experimental results demonstrate that the proposed regularization method can facilitate existing models from various applications achieving better diversity without loss of image quality. Figure 1 shows the effectiveness of the proposed regularization method for existing models.
The main contributions of this work are:
We propose a simple yet effective mode seeking regularization method to address the mode collapse problem in cGANs. This regularization scheme can be readily extended into existing frameworks with marginal training overheads and modifications.
We demonstrate the generalizability of the proposed regularization method on three different conditional generation tasks: categorical generation, image-to-image translation, and text-to-image synthesis.
Extensive experiments show that the proposed method can facilitate existing models from different tasks achieving better diversity without sacrificing visual quality of the generated images.
Our code and pre-trained models are available at https://github.com/HelenMao/MSGAN/.
Generative adversarial networks [1, 8, 18, 21] have been widely used for image synthesis. With adversarial training, generators are encouraged to capture the distribution of real images. On the basis of GANs, conditional GANs synthesize images based on various contexts. For instances, cGANs can generate high-resolution images conditioned on low-resolution images , translate images between different visual domains [10, 11, 15, 16, 34, 35], generate images with desired style , and synthesize images according to sentences [22, 31]. Although cGANs have achieved success in various applications, existing approaches suffer from the mode collapse problem. Since the conditional contexts provide strong structural prior information for the output images and have higher dimensions than the input noise vectors, generators tend to ignore the input noise vectors, which are responsible for the variation of generated images. As a result, the generators are prone to produce images with similar appearances. In this work, we aim to address the mode collapse problem for cGANs.
Some methods focus on the discriminator with different optimization process  and divergence metrics [1, 18] to stabilize the training process. The minibatch discrimination scheme  allows the discriminator to discriminate between whole mini-batches of samples instead of between individual samples. In , Durugkar et al. use multiple discriminators to address this issue. The other methods use auxiliary networks to alleviate the mode collapse issue. ModeGAN  and VEEGAN  enforce the bijection mapping between the input noise vectors and generated images with additional encoder networks. Multiple generators  and weight-sharing generators  are developed to capture more modes of the distribution. However, these approaches either entail heavy computational overheads or require modifications of the network structure, and may not be easily applicable to cGANs.
In the field of cGANs, some efforts [10, 15, 35] have been recently made to address the mode collapse issue on the image-to-image translation task. Similar to ModeGAN and VEEGAN, additional encoders are introduced to provide a bijection constraint between the generated images and input noise vectors. However, these approaches require other task-specific networks and objective functions. The additional components make the methods less generalizable and incur extra computational loads on training. In contrast, we propose a simple regularization term that imposes no training overheads and requires no modifications of the network structure. Therefore, the proposed method can be readily applied to various conditional generation tasks. Recently, the concurrent work  also adopts a loss term similar to our work for reducing mode collapse for cGANs.
The training process of GANs can be formulated as a mini-max problem: a discriminator
learns to be a classifier by assigning higher discriminative values to the real data samples and lower ones to the generated ones. Meanwhile, a generatoraims to fool by synthesizing realistic examples. Through adversarial training, the gradients from will guide toward generating samples with the distribution similar to the real data one.
The mode collapse problem with GANs is well known in the literature. Several methods [2, 25, 26] attribute the missing mode to the lack of penalty when this issue occurs. Since all modes usually have similar discriminative values, larger modes are likely to be favored through the training process based on gradient descent. On the other hand, it is difficult to generate samples from minor modes.
The mode missing problem becomes worse in cGANs. Generally, conditional contexts are high-dimensional and structured (e.g., images and sentences) as opposed to the noise vectors. As such, the generators are likely to focus on the contexts and ignore the noise vectors, which account for diversity.
In this work, we propose to alleviate the missing mode problem from the generator perspective. Figure 2 illustrates the main ideas of our approach. Let a latent vector from the latent code space be mapped to the image space . When mode collapse occurs, the mapped images are collapsed into a few modes. Furthermore, when two latent codes and are closer, the mapped images and are more likely to be collapsed into the same mode. To address this issue, we propose a mode seeking regularization term to directly maximize the ratio of the distance between and with respect to the distance between and ,
where denotes the distance metric.
The regularization term offers a virtuous circle for training cGANs. It encourages the generator to explore the image space and enhances the chances for generating samples of minor modes. On the other hand, the discriminator is forced to pay attention to generated samples from minor modes. Figure 2 shows a mode collapse situation where two close samples, and , are mapped onto the same mode . However, with the proposed regularization term, is mapped to , which belongs to an unexplored mode . With the adversarial mechanism, the generator will thus have better chances to generate samples of in the following training steps.
As shown in Figure 5, the proposed regularization term can be easily integrated with existing cGANs by appending it to the original objective function.
where denotes the original objective function and the weights to control the importance of the regularization. Here, can be as a simple loss function. For example, in categorical generation task,
where denote class labels, real images, and noise vectors, respectively. In image-to-image translation task ,
where denotes input images and is the typical GAN loss. can be arbitrary complex objective function from any task, as shown in Figure 5 (b). We name the proposed method as Mode Seeking GANs (MSGANs).
We evaluate the proposed regularization method through extensive quantitative and qualitative evaluation. We apply MSGANs to the baseline models from three representative conditional image synthesis tasks: categorical generation, image-to-image translation, and text-to-image synthesis. Note that we augment the original objective functions with the proposed regularization term while maintaining original network architectures and hyper-parameters. We employ norm distance as our distance metrics for both and and set the hyper-parameter in all experiments. More implementation and evaluation details, please refer to the appendixes.
We conduct evaluations using the following metrics. FID. To evaluate the quality of the generated images, we use FID 
to measure the distance between the generated distribution and the real one through features extracted by Inception Network. Lower FID values indicate better quality of the generated images.
LPIPS. To evaluate diversity, we employ LPIPS  following [10, 15, 35]. LIPIS measures the average feature distances between generated samples. Higher LPIPS score indicates better diversity among the generated images.
NDB and JSD. To measure the similarity between the distribution between real images and generated one, we adopt two bin-based metrics, NDB and JSD, proposed in . These metrics evaluate the extent of mode missing of generative models. Following 
, the training samples are first clustered using K-means into different bins which can be viewed as modes of the real data distribution. Then each generated sample is assigned to the bin of its nearest neighbor. We calculate the bin-proportions of the training samples and the synthesized samples to evaluate the difference between the generated distribution and the real data distribution. NDB score and JSD of the bin-proportion are then computed to measure the mode collapse. Lower NDB score and JSD mean the generated data distribution approaches the real data distribution better by fitting more modes. Please refer to for more details.
We first validate the proposed method on categorical generation. In categorical generation, networks take class labels as conditional contexts to synthesize images of different categories. We apply the regularization term to the baseline framework DCGAN .
We conduct experiments on the CIFAR-10  dataset which includes images of ten categories. Since images in the CIFAR-10 dataset are of size and upsampling degrades the image quality, we do not compute LPIPS in this task. Table 1 and Table 2 present the results of NDB, JS, and FID. MSGAN mitigates the mode collapse issue in most classes while maintaining image quality.
Image-to-image translation aims to learn the mapping between two visual domains. Conditioned on images from the source domain, models attempt to synthesize corresponding images in the target domain. Despite the multimodal nature of the image-to-image translation task, early work [11, 34] abandons noise vectors and performs one-to-one mapping since the latent codes are easily ignored during training as shown in [11, 35]. To achieve multimodality, several recent attempts [10, 15, 35] introduce additional encoder networks and objective functions to impose a bijection constraint between the latent code space and the image space. To demonstrate the generalizability, we apply the proposed method to a unimodal model Pix2Pix  using paired training data and a multimodal model DRIT  using unpaired images.
|DRIT ||MSGAN||DRIT ||MSGAN|
|DRIT ||MSGAN||DRIT ||MSGAN|
|Conditioned on text descriptions||Conditioned on text codes|
|StackGAN++ ||MSGAN||StackGAN++ ||MSGAN|
We take Pix2Pix as the baseline model. We also compare MSGAN to BicycleGAN  which generates diverse images with paired training images. For fair comparisons, architectures of the generator and the discriminator in all methods follow the ones in BicycleGAN .
We conduct experiments on the facades and maps datasets. MSGAN obtains consistent improvements on all metrics over Pix2Pix. Moreover, MSGAN demonstrates comparable diversity to BicycleGAN, which applies an additional encoder network. Figure. 6 and Table. 3 demonstrate the qualitative and quantitative results, respectively.
We choose DRIT , one of the state-of-the-art frameworks to generate diverse images with unpaired training data, as the baseline framework. Though DRIT synthesizes diverse images in most cases, mode collapse occurs in some challenging shape-variation cases (e.g., translation between cats and dogs). To demonstrate the robustness of the proposed method, we evaluate on the shape-preserving Yosemite (summerwinter)  dataset and the catdog  dataset that requires shape variations.
As the quantitative results exhibited in Table. 4, MSGAN performs favorably against DRIT in all metrics on both datasets. Especially in the challenging catdog dataset, MSGAN obtains substantial diversity gains. From the statistical point of view, we visualize the bin proportions of the dog-to-cat translation in Figure. 8. The graph shows the severe mode collapse issue of DRIT and the substantial improvement with the proposed regularization term. Qualitatively, Figure. 7 shows that MSGAN discovers more modes without the loss of visual quality.
dataset. To improve diversity, StackGAN++ introduces a Conditioning Augmentation (CA) module that re-parameterizes text descriptions into text codes of the Gaussian distribution. Instead of applying the regularization term on the semantically meaningful text codes, we focus on exploiting the latent codes randomly sampled from the prior distribution. However, for a fair comparison, we evaluation MSGAN against StackGAN++ in two settings: 1) Perform generation without fixing text codes for text descriptions. In this case, text codes also provide variations for output images. 2) Perform generation with fixed text codes. In this setting, the effects of text codes are excluded.
Table. 5 presents quantitative comparisons between MSGAN and StackGAN++. MSGAN improves the diversity of StackGAN++ and maintains visual quality. To better illustrate the role that latent codes play for the diversity, we show qualitative comparisons with the text codes fixed. In this setting, we do not consider the diversity resulting from CA. Figure. 9 illustrates that latent codes of StackGAN++ have minor effects on the variations of the image. On the contrary, latent codes of MSGAN contribute to various appearances and poses of birds.
We perform linear interpolation between two given latent codes and generate corresponding images to have a better understanding of how well MSGANs exploit the latent space. Figure. 10 shows the interpolation results on the dog-to-cat translation and the text-to-image synthesis task. In the dog-to-cat translation, we can see the coat colors and patterns varies smoothly along with the latent vectors. In the text-to-image synthesis, both orientations of birds and the appearances of footholds change gradually with the variations of the latent codes.
In this work, we present a simple but effective mode seeking regularization term on the generator to address the model collapse in cGANs. By maximizing the distance between generated images with respect to that between the corresponding latent codes, the regularization term forces the generators to explore more minor modes. The proposed regularization method can be readily integrated with existing cGANs framework without imposing training overheads and modifications of network structures. We demonstrate the generalizability of the proposed method on three different conditional generation tasks including categorical generation, image-to-image translation, and text-to-image synthesis. Both qualitative and quantitative results show that the proposed regularization term facilitates the baseline frameworks improving the diversity without sacrificing visual quality of the generated images.
The cityscapes dataset for semantic urban scene understanding. In CVPR, Cited by: Table 9, §1.
Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: Appendix A, Table 7, Table 8, Table 9, §1, §1, §2, §3.2, §4.3, Table 3.
Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, Cited by: §2.
The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, Cited by: §1, §4.1.
Table 9 summarizes the datasets and baseline models used on various tasks. For all of the baseline methods, we incorporate the original objective functions with the proposed regularization term. Note that we remain the original network architecture design and use the default setting of hyper-parameters for the training.
Since the images in the CIFAR-10  dataset are of size , we modify the structure of the generator and discriminator in DCGAN , as shown in Table 10. We use the batch size of , learning rate of and Adam  optimizer with and to train both the baseline and MSGAN network.
We adopt the generator and discriminator in BicycleGAN  to build the Pix2Pix  model. Same as BicycleGAN, we use a U-Net network  for the generator, and inject the latent codes into every layer of the generator. The architecture of the discriminator is a two-scale PatchGAN network . For the training, both Pix2Pix and MSGAN framework use the same hyper-parameters as the officially released version 111https://github.com/junyanz/BicycleGAN/.
DRIT  involves two stages of image-to-image translations in the training process. We only apply the mode seeking regularization term to generators in the first stage, which is modified on the officially released code 222https://github.com/HsinYingLee/DRIT.
StackGAN++  is a tree-like structure with multiple generators and discriminators. We use the output images from the last generator and input latent codes to calculate the mode seeking regularization term. The implementation is based on the officially released code 333https://github.com/hanzhanggit/StackGAN-v2.
We employ the official implementation of FID 444https://github.com/bioinf-jku/TTUR, NDB and JSD 555https://github.com/eitanrich/gans-n-gmms, and LPIPS 666https://github.com/richzhang/PerceptualSimilarity. For NDB and JSD, we use the K-means method on training samples to obtain the clusters. Then the generated samples are assigned to the nearest cluster to compute the bin proportions. As suggested by the author of , there are at least training samples for each cluster. Therefore, we cluster the number of bins in all tasks, where denotes the number of training samples for computing the clusters. We have verified that the performance is consistent within a large range of . For evaluation, we randomly generate images for a given conditional context on various tasks. We conduct five independent trials and report the mean and standard derivation based on the result of each trial. More evaluation details of one trial are presented as follows.
Conditioned on Class Label.
We randomly generate images for each class label. We use all the training samples and the generated samples to compute FID. For NDB and JSD, we employ the training samples in each class to calculate clusters.
Conditioned on Image.
We randomly generate images for each input image in the test set. For LPIPS, we randomly select pairs of the images of each context in the test set to compute LPIPS and average all the values for this trial. Then, we randomly choose input images and their corresponding generated images to form generated samples. We use the generated samples and all samples in training set to compute FID. For NDB and JSD, we employ all the training samples for clustering and choose bins for facades, and bins for other datasets.
Conditioned on Text.
We randomly select sentences and generate images for each sentence, which forms generated samples. Then, we randomly select samples for computing FID, and clustering them into bins for NDB and JSD. For LPIPS, we randomly choose pairs for each sentence and average the values of all the pairs for this trial.
To analyze the influence of the regularization term, we conduct an ablation study by varying the weighting parameter on image-to-image translation task using the facades dataset. Table. 6 presents the quantitative results with diverse . It can be observed that increasing improves the diversity of the generated images. Nevertheless, as the weighting parameter becomes larger than a threshold value (), the training becomes unstable, which yields low quality, and even low diversity synthesized images. As a result, we empirically set the weighting parameter for all experiments.
We have explored other design choice of the distance metric. We conduct experiments using discriminator feature distance in our regularization term in a way similar to feature matching loss ,
where denotes the layer of the discriminator. We apply it to Pix2Pix on the facades dataset. Table. 7 shows that MSGAN using feature distance also obtains improvement over Pix2Pix. However, MSGAN using distance has higher diversity. Therefore, we employ MSGAN using distance for all experiments.
We compare MSGAN with Pix2Pix, BicycleGAN in terms of training time, memory consumption, and model parameters on an NVIDIA TITAN X GPU. Table. 8 shows that our method incurs marginal overheads. However, BicycleGAN requires longer time per iteration and larger memory with an additional encoder and another discriminator network.
|Context||Class Label||Paired Images||Unpaired Images||Text|
|Dataset||CIFAR-10 ||Facades ||Maps ||Yosemite ||Cat Dog ||CUB-200-2011 |
|Baseline||DCGAN ||Pix2Pix ||DRIT ||StackGAN++ |
Dconv(N512-K4-S1-P0), BN, Relu
|Dconv(N256-K4-S2-P1), BN, Relu||Conv(N256-K4-S2-P1), BN, Leaky-Relu|
|Dconv(N128-K4-S2-P1), BN, Relu||Conv(N512-K4-S2-P1), BN, Leaky-Relu|
|Dconv(N3-K4-S2-P1), Tanh||Conv(N1-K4-S1-P0), Sigmoid|
. We employ the following abbreviation: N= Number of filters, K= Kernel size, S= Stride size, P= Padding size. “Conv”, “Dconv”,“BN” denote the convolutional layer, transposed convolutional layer and batch normalization, respectively.