VARGAN: Variance Enforcing Network Enhanced GAN

09/05/2021 ∙ by Sanaz Mohammadjafari, et al. ∙ Ryerson University 0

Generative adversarial networks (GANs) are one of the most widely used generative models. GANs can learn complex multi-modal distributions, and generate real-like samples. Despite the major success of GANs in generating synthetic data, they might suffer from unstable training process, and mode collapse. In this paper, we introduce a new GAN architecture called variance enforcing GAN (VARGAN), which incorporates a third network to introduce diversity in the generated samples. The third network measures the diversity of the generated samples, which is used to penalize the generator's loss for low diversity samples. The network is trained on the available training data and undesired distributions with limited modality. On a set of synthetic and real-world image data, VARGAN generates a more diverse set of samples compared to the recent state-of-the-art models. High diversity and low computational complexity, as well as fast convergence, make VARGAN a promising model to alleviate mode collapse.



There are no comments yet.


page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GANs) have emerged as a powerful generative model at learning data distributions. GANs have displayed a great potential in generating high quality images (Karras et al., 2017; Brock et al., 2018)

, and have been successfully used for image super-resolution

(Bin et al., 2017; Ledig et al., 2017)

and image-to-image translation

(Isola et al., 2017; Zhu et al., 2017a). GANs consist of a generator network responsible for generating samples similar to the true distribution, and a discriminator network that discriminates between the samples from the true and generated distributions. GANs aim to generate a set of samples that represent the true distribution, preserve the image quality and cover all the modes. Achieving these goals rely on several factors such as model structure, network’s objective functions, parameter tuning and training procedure. Although GANs have shown great potential in generating synthetic data, their unstable training procedure can cause non-convergence, and mode collapse (Salimans et al., 2016).

Mode collapse refers to generating a limited number of modes from a multi-modal training data, and failing to generate a representative set of samples. This problem has limited GANs’ potential in real-world applications such as MRI scan generation for brain segmentation (Mok and Chung, 2018) and electromagnetic engineered surface (EES) generation to improve the radio signal coverage in telecommunication (Mohammadjafari et al., 2021)

. A large and growing body of literature has investigated various methods to alleviate the mode collapse issue in GANs through modified loss functions

(Arjovsky et al., 2017; Che et al., 2016; Gulrajani et al., 2017)

, altering neural network structures

(Zuo et al., 2019; Lin et al., 2018), adapting different training methods (Metz et al., 2016) and latent space regularization (Gurumurthy et al., 2017; Li et al., 2021). Regardless of the methodological differences, the proposed studies fall short of identifying a global solution.

In this work, we focus on the problem of mode collapse in GANs for image generation. Particularly, we are interested in binary and gray-scale images because of GANs’ practical applications. For instance, effective usage of GANs in generating binary EES designs has been established in (Mohammadjafari et al., 2021). However, cases of mode collapse were observed in the results, which affected the sample generation process. Moreover, research on image generation has been mostly restricted to colored image datasets. Therefore, our research can offer particular insights into binary and gray-scale image generation. Furthermore, binary and gray-scale image generation process requires a less complex model structure, and as a result, has lower computational complexity compared to the colored images. On the other hand, there could be certain challenges with binary images. For instance, category of binary EES designs changes rapidly by small changes in the image, which makes the training process of GANs for these binary images more challenging, requiring more research in this direction.

Research goal

In this paper, we propose a novel GAN architecture called variance enforcing GAN (VARGAN) to alleviate the mode collapse problem in image generation, and increase the diversity of generated samples. We evaluate our method on two widely used datasets in the literature namely, 2D synthetic data and stacked MNIST, as well as on the real-life practical problem of generating EES designs. Furthermore, considering the practical importance of conditional GANs, we examine the impact of conditioning on the mode collapse.


The main contribution of our paper is proposing a novel GAN architecture called VARGAN, which uses a third network called variance enforcing network (VarNet) to increase the number of generated modes, and reduce the mode collapse. Major contributions of our study can be summarized as follows:

  • Our proposed method introduces a third network trained to compute the generated samples’ diversity. VARGAN shows performance improvement over state-of-the-art GAN models in the literature.

  • We explore the impact of conditioning and labels’ length on mode collapse. We hypothesize that conditioning the GANs on auxiliary information can majorly impact the number of uniquely generated modes. The results show how VARGAN’s performance, unlike other models, is not impacted by the length of conditioning labels.

  • We perform an extensive numerical study to evaluate GAN variants’ performance on addressing the mode collapse using a variety of synthetic and real datasets. The results are analyzed based on various performance metrics employed in recent studies. Specifically, we provide a detailed comparative analysis for different GAN architectures on a practical problem, EES generation, which has a significant application in telecommunication industry. Alleviating mode collapse problem and increasing the generated unique designs can greatly contribute to EES generation process by identifying designs that are not otherwise easy to obtain.

Organization of the paper:

The rest of this paper is organized as follows. Section 2 summarizes the recent work on methods to address the mode collapse issue in GANs. In Section 3, generic GAN structures are presented, and the methodology of the proposed architecture, as well as its training process are explained. Section 3 also covers the comparison of proposed method with the state-of-the-art GAN models. Section 4 provides a discussion on the model performance for different types of datasets. Finally, the paper is concluded in Section 5, and possible future directions are discussed.

2 Background

In this section, we review recent studies on GANs with a special focus on mode collapse issue. A summary of the recent papers is reported in Table 1, which contains information on the methodology, GAN structures and datasets used in the numerical experiments.

Salimans et al. (2016) proposed several approaches based on modifications in GANs’ architecture and loss functions. Feature matching, historical averaging and mini-batch discrimination are examples of the methods proposed to improve the convergence and mode collapse in GANs. Mini-batch discrimination approach adds a new component to the last layer of discriminator by computing the distance among the last layer’s samples. The authors claim that closer samples are more likely to be created by the generator, and their proposed enhancements help with discriminator’s decision making and improves GAN’s convergence.

Paper Proposed method Methodology GAN structure Dataset
Salimans et al. (2016) Minibatch discrimination
Distance of extracted features
in the last layer of discriminator
added to discriminator’s loss as penalty
Che et al. (2016) Mode regularized GAN
Jointly train an encoder
add penalty factors to generator’s loss function
ConvGAN Stacked MNIST
Arjovsky et al. (2017) Wasserstein GAN Wasserstein distance plus weight clipping ConvGAN LSUN-Bedrooms
Gulrajani et al. (2017) WGAN-GP
Replace weight clipping in Wasserstein GANs
with an enforced Lipschitz constraint
ConvGAN LSUN-Bedrooms

Tolstikhin et al. (2017)
AdaGAN Mixture of GANs trained on reweighted samples ConvGAN MNIST
Stacked MNIST
FF GAN Synthetic data
Park et al. (2018) MEGAN Mixture of GANs ConvGAN CelebA
LSUN-Church outdoor

Lin et al. (2018)
PacGAN Discriminator’s architecture receives samples at a time ConvGAN Stacked MNIST
FF GAN Synthetic data
Ghosh et al. (2018) MADGAN Multi-generators FF GAN Synthetic data
ConvGAN Stacked MNIST
cCGAN Night-to-day

Elfeki et al. (2019)
Use negative correlations within a subset
as diversity measure added to generator’s loss
FF GAN Synthetic data
ConvGAN Stacked MNIST

Mao et al. (2019)
Image difference divided by noise difference
added to generator’s loss function
Conditioned on labels CIFAR-10
Conditioned on Images Winter-to-summer
Text to image synthesis CUB-200-2011
Our study VARGAN
A third network trained on samples’ diversity
adds a penalty to the generator’s loss to encourage diversity
FF GAN Synthetic data
propriety dataset
gray scale images
Table 1: Summary of the relevant papers addressing the mode collapse (FF: Feed-forward, ConvGAN: Convolutional GAN, cCGAN: Conditional Convolutional GAN).

Altering the loss function to stabilize GANs’ training, and to increase the covered modes is investigated in various studies. These methods include introducing new distance metrics such as Wasserstein distance (Arjovsky et al., 2017), incorporating regularization penalties for cost functions of the networks (Che et al., 2016; Gulrajani et al., 2017), and encouraging diversity in generated samples through an unsupervised penalty loss based on the samples in last layer of the discriminator (Elfeki et al., 2019). Previous studies often evaluated the impact of modifications in neural networks’ structures by changing GANs’ networks architectures, and incorporation of multiple networks to cover the missed modes (Park et al., 2018; Hoang et al., 2018; Srivastava et al., 2017; Zhong et al., 2019). For instance, Lin et al. (2018) altered the discriminator’s architecture to receive merged samples of the data with the same label (fake or real) and generate one label instead of . Their proposed method, PacGAN, showed great improvement in fake samples’ diversity. GANs construction using multiple networks is inspired by the theory established for mode collapse in (Arjovsky and Bottou, 2017) where disjoint distribution of real and generated data is considered as the source of instability and missed modes. Multi-agent GAN (MADGAN) is one particular approach that contains multiple generators and one discriminator network to encourage different generators toward separate modes of the data, and increase the variety of generated samples (Ghosh et al., 2018). Tolstikhin et al. (2017) proposed adaptive GAN (AdaGAN), which incrementally adds a new model to a mixture of GANs, and evaluates a new GAN model on re-weighted samples.

Majority of the studies investigating mode collapse consider vanilla GANs structures and exclude conditional GANs as shown in Table 1. However, conditional GAN architectures have a wide range of applications in many areas (Isola et al., 2017; Zhu et al., 2017a; Mohammadjafari et al., 2021). They also suffer from mode collapse, which has been investigated in a limited number of studies. Zhu et al. (2017b) proposed a new hybrid GAN structure namely, Bicycle GAN, which incorporates bidirectional mapping from latent code to output. This study was implemented on particular conditional tasks, and has a significant computational complexity overhead. Mao et al. (2019) proposed a new general approach called MSGAN by adding a regularization term to the generator’s loss function that is applicable to various GAN architectures. MSGAN approach encourages the generator to map two different samples conditioned on the same context to be as distant from each other as possible.

Our VARGAN method utilizes some of the aforementioned enhancements such as modified loss functions and adding a new penalty term to the generator’s loss (e.g., see (Elfeki et al., 2019)), which computes the generated samples’ diversity. However, in contrast to the previous approaches, our proposed penalty term is calculated by a third network trained on various sets of training samples with their relative diversity level. Different than multi-network approaches, our method does not incorporate a new generator or a discriminator, and the third network is solely responsible to provide feedback to the generator. Moreover, our approach incorporates the same strategy of modified structures as (Lin et al., 2018) for its new network’s architecture.

3 Methodology

In this section, we describe our methods, datasets and the experimental setup. We first review the structure of the vanilla GAN and its training procedure, and then explain the VARGAN architecture in detail. In addition, we discuss how our proposed framework compares to the existing GAN architectures.

3.1 Standard GAN models

A generic GAN architecture consists of two networks known as the generator and the discriminator, both competing against each other (see Figure 1). The generator learns the distribution

from a uniform or normal distribution input noise

mapped through the function to the samples. is a differentiable function, usually defined as a neural network with a set of parameters . The discriminator , which is also a differentiable function with parameters

, maps input samples to a probability value representing whether each sample is real or generated. Both networks are trained simultaneously, where discriminator aims to maximize and generator tries to minimize the objective value. Therefore, GAN is modelled as a min-max game with a value function

, that is

Figure 1: GAN structure consists of two competing networks, namely, discriminator and generator

The discriminator aims to maximize the value function by generating a probability value of one for the samples from the real data and zero for the samples from the generated distribution . On the other hand, generator minimizes the objective value by trying to trick the discriminator into outputting one for the generated samples. The training procedure of the GANs’ networks is illustrated in Figure 4.

(a) Discriminator training
(b) Generator training
Figure 4: GAN model training procedure including discriminator and generator feedback signals

Vanilla GAN has no control on the category of the generated samples. Therefore, conditional GANs were proposed to incorporate supplemental data as labels in both the generator and discriminator, and generate samples of a desirable category (Mirza and Osindero, 2014)

. Convolutional GANs are another important variant of GANs that use convolutional neural network (CNN) instead of feed-forward structure in their networks

(Radford et al., 2015).

3.2 Vargan

We propose a new GAN framework called VARGAN illustrated in Figure 5 by introducing a third network called VarNet to the vanilla GAN structure. The VarNet is used to reduce the mode collapse by encouraging diversity in the generated samples, which results in increased number of generated modes. Below, we describe VARGAN architecture and the training procedure.

Figure 5: VARGAN structure including generator, discriminator and VarNet networks

3.2.1 Data samples

Discriminator network in GANs utilizes two sets of data for its training process, namely, the real and fake data. GAN aims to learn the real data distribution with number of unique modes. Therefore, is the target and highest number of unique modes in that a generator can create. It is important to note that for some datasets with unknown number of unique modes, we can select different values to encourage the diversity. Fake data distribution is created by the generator network that needs to improve in quality and diversity to reach the true distribution.

In the VARGAN framework, we introduce another set of data called limited modality data , which defines the undesirable distributions with restricted number of modes. The limited modality data is used for VarNet training process. The undesirable distributions include lower number of modes than the target data with different sample distributions. Different than real and fake data, limited modality data is created by following a specific set of rules and using the real data.

The first step in creating is to define the number of modes that data sample covers, which is less than the target number of modes. Then, we select different data points from the batch of training data and repeat each of them times to create a uniform set of different categories in the batch, where represents the batch size. Each created batch of data with different modes has a relative Mode Coverage Ratio (MCR) value between zero and one, which is defined as follows:


MCR value of zero indicates a single unique design in the sample, representing the lowest level of diversity. As the number of modes increase, the MCR value increases as well. Equation (2) shows the MCR value as a function of the modes () divided by the target number of modes (), representing the percentage of generated modes. Parameters , and are the constants that control the MCR trajectory. Comparison of Figure (b)b and (a)a shows how controls the final value of MCR for high number of modes by using two different values for . Figure 8 also illustrates how controls the initial value of MCR. A small MCR value for low number of covered modes can increase the generator’s loss at the beginning steps of the training process. This can adversely affect the model convergence, and needs to be controlled. Moreover, a high MCR value for low number of modes can prevent the generator from further increasing the diversity. Parameter controls the slope of the MCR, and since slope affects the model’s convergence process, affects the convergence to the target MCR value. These constants are determined through preliminary analysis (see details in Appendix A). Generation of limited modality data and the relative MCR remains a challenging part of the VARGAN framework, since it enforces a few constraints on the number of selected modes. One constraint is that the number of modes should divide the batch size because of the VarNet structure explained in the following sections.

(a) MCR formulation trajectory with
(b) MCR formulation trajectory with
Figure 8:

MCR formulation trajectory with respect to the number of generated modes divided by the target number of modes using different hyperparameter values for

, and

3.2.2 VARGAN architecture

The VARGAN framework maintains the same discriminator and generator architecture as the standard GANs. The new network, VarNet is a differentiable function created by a neural network with parameters that associates each set of input samples to an MCR value. As shown in Figure 5, the VarNet receives a sample data from real data distribution , and a sample data from limited modality data , and maps them to a single scalar value between zero and one representing the level of diversity of the samples. The VarNet network employs the PacGAN structure (Lin et al., 2018) with a degree of packing equal to the batch size. The packing degree refers to the number of augmented samples with the same label. In our case, packing degree is equal to the batch number indicating that VarNet receives the whole batch of the data and maps it to one MCR value. Note that since packing number is equal to , needs to divide batch size to have a reasonable MCR value for each batch. A sample of limited data set with and batch size of six passed to a VarNet is illustrated in Figure 9.

Figure 9: Limited modality data with and passed to a sample VarNet structure with with one output

3.2.3 VARGAN training process

The VARGAN training procedure is presented in Figure 12. The loss function for VarNet training is as follows:


The network receives two sets of inputs for the training process. The first one is a sample of training data with the target distribution and a high MCR value as labels. The second input is a set of limited modality data with different MCR values. The VarNet aims to minimize loss value by mapping the training data samples to label one and the limited modality data to the relative MCR. The structure and loss function of the discriminator for VARGAN remain the same compared to generic GAN architecture (see Figure (a)a). However, the generator’s loss function has a new addition provided in Equation (4) as follows:


The new term calculates the loss between the target level of MCR, which is one, and the MCR of the generator’s outputs measured by VarNet. This term is also multiplied with a positive coefficient factor , which controls the impact of VarNet on generator’s loss. The value of parameter is selected based on hyperparameter tuning experiments and for the majority of the experiments, it is found to be one.

(a) VarNet training
(b) Generator training
Figure 12: Training procedures VARGAN networks including VarNet and generator (discriminator’s training process is similar to vanilla GAN and is not illustrated).

3.2.4 Comparison with other architectures

VARGAN employs similar strategies to other available GAN architectures to deal with mode collapse issue. Our proposed framework incorporates the generated samples diversity in each iteration to penalize the generator towards a multi-modal output. The idea of penalizing the generator for the diversity is also discussed in (Elfeki et al., 2019) where Determinantal Point Process (DPP) is used to compute the diversity from the features of the discriminator’s last layer. However, in our work, a third network is trained to evaluate the samples diversity. Also note that our approach differs from multi-network models. VARGAN does not rely on any additional generator or discriminator networks. The third network (VarNet) sends an additional feedback to the generator to help with the diversity. We incorporate the same methodology of modified structures as (Lin et al., 2018) for the VarNet architecture to help the model differentiate between samples. However, the main contribution of our model to address the mode collapse issue is the additional penalty term added to the generator. Moreover, mini-batch discrimination (Salimans et al., 2016) incorporates a distance measure in the discriminator architecture to help distinguish between real and fake samples, which is not employed in our VARGAN architecture.

3.3 Datasets

We consider three particular datasets in our numerical study, which are described below.

  • Synthetic data: We experiment with a number of synthetic datasets that have been considered in previous studies. The 2D ring dataset is a mixture of eight 2D Gaussians with mean

    , and standard deviation of 0.01 in each dimension for

    . We also experiment with a 2D grid dataset, which is a mixture of 25 and 36, 2D Gaussians with mean and standard deviation of 0.05 in each dimension for where is the root square of number of mixture modes. Examples of synthetic data are illustrated in Figure 13.

    Figure 13: Synthetic data
  • Stacked MNIST: Stacked MNIST data contains three channels, each presenting one digit from the MNIST dataset. Therefore, the stacked MNIST dataset covers 1000 different modes illustrated in Figure 14.

    Figure 14: Stacked MNIST data
  • EES dataset: Original EES dataset consists of full-wave electromagnetic simulations with the element shape represented by a binary image and its corresponding transmission response over frequency. The EES element shapes are 2D binary images with 8-fold-symmetry where a single octant of the image contains sufficient information to fully describe the topology. The exhaustive search space includes 32,768 unique designs, which are all the possible combinations of the binary 15 bits, representing of the image. Similar dataset is generated for 1919 designs, though unlike the 99 design dataset, it only contains a small subset of the corresponding search space.

    A sample set of EES elements is shown in Figure 17. The high pass (HP) and low pass (LP) designs are the two main categories of EES designs. In the EES dataset, the high pass designs, unlike low pass designs, have edge to edge connection, which is illustrated in Figure (a)a. As mentioned above, the whole search space is available for 99 designs; however, for 1919 designs, 100,000 samples form the respective dataset. The dataset used in this paper is synthetically generated based on the structural configurations of the original dataset considered in (Mohammadjafari et al., 2021).

    (a) High pass
    (b) Low pass
    Figure 17: EES dataset

3.4 Experimental setup

We evaluate our proposed VARGAN architecture on three different datasets and compare the performance with vanilla GAN and other popular GAN variants from the literature. Models include GANs with mini-batch discrimination (MB) (Salimans et al., 2016), PacGAN (Lin et al., 2018) with packing factor of four, MADGAN (Ghosh et al., 2018) with four generators, GDPP (Elfeki et al., 2019) and PacVARGAN. PacVARGAN is a new model with a discriminator network incorporating the PacGAN structure with a packing factor of four and an additional VarNet network. Although, GAN structures in our experiments are not conditional, in the second part of our experiments, we investigate the effect of conditioning on the mode collapse. We add the categories as labels to the best performing unconditioned model and compare the performance with baseline vanilla GAN models. Only EES dataset is used for conditional GAN experiments.

The feed-forward (FF) GAN architecture is similar for all the datasets, whereas the convolutional GAN structure is dataset dependent, and reported in Appendix B for stacked MNIST and EES datasets. For the synthetic 2D data, a feed-forward architecture is selected, since data does not benefit from a convolutional structure. On the other hand, for stacked MNIST and EES, both feed-forward and convolutional neural networks are employed.

For synthetic and stacked MNIST data, each model is trained on 100,000 training samples and tested on 26,000 samples. For all the networks, Adam optimizer with a learning rate of is used. For the EES data, each model is trained on 100,000 samples for designs, and on all the designs in the search space for

designs. The models are evaluated on 20,000 samples per category. The model is trained for 50, 30 and 400 epochs for synthetic, stacked MNIST and EES data. In VARGAN, MCR parameter values, namely,

, , , and are selected as 1, 10, 5, and 1 across all datasets based on our hyperparameter tuning experiments (see Appendix A

). In our numerical study, to minimize the random initialization effect, each experiment is repeated five times and average performance values are reported. All the models were implemented using Python and PyTorch library, and experiments were performed on a NVIDIA GeForce RTX 2070 GPU with 8 GB of GPU RAM.

4 Numerical results

In this section, we first present performance metrics used to evaluate GAN architectures. Then, we report the performance of GAN architectures over a variety of datasets. Finally, we investigate the impact of conditioning on mode collapse in GANs.

4.1 Performance metrics

Evaluating mode collapse is a challenging task, except in cases such as synthetic data where the modes are explicitly defined. Therefore, in this work, each GAN architecture is evaluated based on a set of popular performance metrics relevant to the dataset specifications. Since the labels are available for each dataset, a pretrained classifier is used to determine the

number of modes generated by the models. The number of modes and Kullback-Leibler (KL) divergence, which measures the distance between the generated and real distributions are reported for each model over all the datasets. Percentage of high-quality samples is another metric introduced in (Lin et al., 2018) for evaluating model performance over a synthetic dataset. It measures the proportion of generated samples that are closer than 3 standard deviations to the center of the mode. Finally, for the stacked MNIST data, we use the inception score to evaluate the quality and diversity of the generated samples (Salimans et al., 2016). It is important to note that the run time performance in each experiment is reported based on the model’s training time in seconds, and does not reflect the theoretical complexity.

4.2 Comparison of GAN architectures

In this section, we report performance evaluation results for different GAN architectures over synthetic and real-world datasets. Along with our extensive comparative analysis, we also investigate the convergence behaviours of the models over training epochs in the training process.

4.2.1 Results with synthetic dataset

We have reported the number of generated modes, KL divergence and percentage of high-quality samples for each synthetic dataset variants separately. As shown in Table 2, for 2D grids with 25 and 36 modes, VARGAN captures all the modes with low divergence metric while generating high quality samples, and requiring less training time.


Model Metrics
High quality
Time (sec)

Ring 8 modes

GAN 6.20 1.72 0.87 0.03 0.69 0.24 2,731 56

GAN + MB 4.00 2.60 0.56 0.30 0.80 0.45 693 26

PacGAN 8.00 0.00 0.71 0.09 0.03 0.02 2,790 6

MAD-GAN 3.66 2.42 0.25 0.36 1.25 0.57 1,393 73

GDPP 5.00 1.26 0.60 0.20 1.18 0.21 1,962 182
VARGAN 4.80 0.40 0.71 0.23 1.05 0.09 1,480 31
PacVARGAN 8.00 0.00 0.32 0.03 0.17 0.07 1,477 16

Grid 25 modes

GAN 25.00 0.00 0.85 0.04 0.45 0.14 3,608 124
GAN + MB 5.60 9.77 0.76 0.38 1.94 1.55 962 2
PacGAN 25.00 0.00 0.66 0.07 0.15 0.04 3,642 89

MAD-GAN 24.80 0.40 0.63 0.23 0.57 0.89 1,679 81

GDPP 25.00 0.00 0.67 0.02 0.26 0.04 2,463 85
VARGAN 25.00 0.00 0.80 0.04 0.16 0.06 814 23
PacVARGAN 25.00 0.00 0.59 0.02 0.15 0.05 833 3

Grid 36 modes

GAN 36.00 0.00 0.84 0.06 0.18 0.06 6,154 201
GAN + MB 21.80 17.39 0.66 0.34 0.75 1.41 1,136 4
PacGAN 36.00 0.00 0.72 0.03 0.09 0.03 6,183 228

MAD-GAN 32.00 6.92 0.57 0.28 0.39 0.40 1,804 3

GDPP 35.80 0.40 0.70 0.02 0.13 0.11 2,712 306
VARGAN 36.00 0.00 0.73 0.02 0.05 0.01 761 6
PacVARGAN 36.00 0.00 0.66 0.01 0.11 0.04 774 4

Table 2: Performance metrics for synthetic datasets averaged over 5 repeats (all the GANs have FF structure, best performing model is bolded)

We note that VARGAN does not perform well for 2D ring with 8 modes, whereas PacGAN generates a high number of modes as well as high quality samples. Overall, it is observed that PacGAN consistently generates more unique samples with better quality for all three groups of synthetic data. However, the training times for PacGAN is much higher than other models, and is almost equal to the vanilla GAN model. Our results for PacGAN model on 2D ring and 2D grid with 25 modes are consistent with the findings in (Lin et al., 2018). By combining the PacGAN properties with VARGAN, PacVARGAN shows both high number of generated modes and low run times. We also find that mini-batch discrimination does not perform well for different designs. The reason can be the network’s feed-forward architecture that cannot benefit from mini-batch discrimination approach in addressing the mode collapse. MADGAN outperforms mini-batch discrimination, but still does not perform as well as other models. Moreover, GDPP generates high number of unique designs that match the findings of Elfeki et al. (2019). We find that VARGAN outperforms GDPP in terms of percentage of high quality samples and KL divergence.

Further analysis over the 50 epoch training process for 2D ring data shows that VARGAN has a quicker convergence process, and leads to a better performance for 2D ring data at the initial epochs compared to the other models. Result for 2D grid data with 36 modes is illustrated in Figure 21 and shows a great performance for the VARGAN in number of generated modes, as well as high quality samples. Although all the models except GAN+MB converge to target number of modes, the variance over the repeats points to a better performance for VARGAN. Note that we repeated this analysis with other two synthetic data variants, which lead to similar observations (see Appendix C for more details). Overall, these results largely confirm the fast convergence of VARGAN on earlier epochs of the training. Furthermore, they point to the methodological/architectural differences between different GAN variants.

(a) Number of modes
(b) KL divergence
(c) Percentage of high quality samples
Figure 21: Comparison of different GAN models on synthetic 2D grid data with 36 modes (results are averaged over 5 repeats)

4.2.2 Results with stacked MNIST dataset

Table 3 summarizes the performance of convolutional (Conv) and feed-forward (FF) structures for different GAN models over stacked MNIST data. The results show that, overall, all the models perform well for both structures. FF (vanilla) GAN captures all the modes, and attains a high inception score and a low KL divergence. GDPP and VARGAN with FF structures also generate a high number of modes and a low KL divergence. Results with convolutional structures show VARGAN as the best performing model with high number of modes, low KL divergence and high inception score. Compared to its FF counterpart, VARGAN performance drops slightly with convolutional structure, however, run time improves significantly. In contrast, PacVARGAN performance benefits significantly from the convolutional structure. The results also confirm that the MB method fails to perform well with FF structure (similar to synthetic data), and its performance improve with convolutional structure. It is worth mentioning that FF PacGAN model performs slightly worse than its convolutional counterpart. One possible reason for this behaviour might be the PacGAN discriminator structure which combines samples, and changes the image data input in the FF model case. On other hand, in convolutional models, PACGAN channel factor gets multiplied by and image structure remains the same.


Models Metrics
Time (sec)


GAN 1000 0 1.74 0.01 0.13 0.00 981 0

GAN + MB 790 257 1.61 0.35 1.10 1.18 1,447 2
PacGAN 920 10 1.82 0.05 0.79 0.05 1,094 9

MAD-GAN 994 11 1.56 0.02 0.72 0.12 1,589 30

GDPP 995 2 1.74 0.04 0.40 0.03 2,467 14

VARGAN 999 0 1.73 0.02 0.24 0.00 4,686 17

PacVARGAN 794 42 1.95 0.10 1.19 0.14 4,862 1


GAN 956 23 1.88 0.06 0.39 0.11 1,203 2
GAN + MB 916 54 2.09 0.09 0.76 0.19 1,978 0

PacGAN 935 20 2.09 0.11 0.84 0.13 1,981 1
MAD-GAN 959 11 1.99 0.05 0.87 0.11 5,532 7

GDPP 943 52 2.18 0.07 0.69 0.23 3,334 20

VARGAN 997 1 2.16 0.02 0.34 0.06 2,336 11
PacVARGAN 996 2 2.14 0.04 0.30 0.03 2,243 3

Table 3: Performance metrics for stacked MNIST dataset (with 1000 modes) averaged over 5 repeats (best performing model is bolded)

We note that the number of uniquely generated modes for convolutional PacGAN, and GDPP models matches the results in (Lin et al., 2018; Elfeki et al., 2019) studies. We also examine the models’ performance over the training epochs using GANs with convolutional structures (see Appendix C). We find that VARGAN and PacVARGAN show fast early convergence in terms of all three metrics, which makes them great candidates for artificial data generation. In addition, earlier convergence can be beneficial in terms of speed of GAN training.

4.2.3 Results with EES dataset

Table 4 shows the model performance for feed-forward and convolutional structures over 99 and 1919 EES designs. Overall, results indicate that convolutional models generally have a better performance compared to feed-forward models. For designs, for the FF structures, VARGAN, GDPP and GAN create high number of unique modes as well as achieving reasonable accuracy with GDPP achieving the best KL divergence. On the other hand, training time is significantly higher for GDPP compared to GAN and VARGAN. PacGAN performs poorly for FF structures for both 99 and 1919 designs, similar to stacked MNIST. MADGAN with five generators does not perform well for FF structure, and completely collapse and generate one high quality sample with this structure.



Models Modes Accuracy

Time (sec)


GAN 14,572 94 6,653 112 0.3121 0.0150 0.6774 0.0164 0.00047 0.00067 677 9
GAN + MB 11,654 5,828 5,071 2,534 0.4284 0.2480 0.5707 0.2505 0.17226 0.34417 672 12

PacGAN 2,603 757 1,148 303 0.2970 0.0181 0.7078 0.0180 0.00066 0.00026 663 1

MAD-GAN - - - - -

GDPP 14,580 175 6,577 191 0.3118 0.0131 0.6862 0.0132 0.00026 0.00042 2,969 19

VARGAN 14,793 137 6,388 259 0.2945 0.0128 0.7015 0.0166 0.00060 0.00054 996 13
PacVARGAN 1,939 1,070 826 431 0.2716 0.0615 0.7264 0.0497 0.01053 0.01516 929 4


GAN 14,928 107 6,592 159 0.1822 0.0150 0.4035 0.0177 0.00040 0.00020 3,704 1,083
GAN + MB 15,009 118 6,716 95 0.1727 0.0011 0.3739 0.0148 0.00010 0.00008 18,133 1,805
PacGAN 14,881 146 6,714 210 0.2315 0.0168 0.4976 0.0264 0.00062 0.00029 1,384 32

MAD-GAN 359 469 144 135 0.0431 0.0773 0.0301 0.0496 0.43074 0.34093 2,414 20

GDPP 11,899 5,892 5,303 2,602 0.1487 0.0267 0.2916 0.1264 0.00292 0.00579 3,036 66
VARGAN 14,962 164 6,659 123 0.1608 0.0218 0.3537 0.0448 0.00014 0.00015 1,841 487
PacVARGAN 14,829 224 6,729 179 0.2131 0.0065 0.4567 0.0207 0.00054 0.00059 3,072 1,130


GAN 13,148 7,349 6,351 3,135 0.3374 0.0911 0.6576 0.0942 0.07343 0.05317 439 12
GAN + MB 9,698 5,824 4,443 2,139 0.3621 0.1208 0.6260 0.1191 0.06655 0.03573 448 29

PacGAN 2,232 1,271 1,078 741 0.4046 0.3167 0.5928 0.3152 0.26998 0.25194 369 30

MAD-GAN - - - - -
GDPP 5,232 5,206 2,625 1,196 0.4412 0.2078 0.5669 0.2069 0.09441 0.07688 1,241 38

VARGAN 13,534 6,718 5,204 2,954 0.2784 0.0917 0.7155 0.0895 0.11176 0.06120 673 52
PacVARGAN 11,266 3,077 4,476 1,696 0.2810 0.0426 0.7153 0.0443 0.10882 0.03999 637 21


GAN 237 229 258 346 0.6799 0.2050 0.3216 0.2050 0.17950 0.22395 975 60

GAN + MB 521 611 550 670 0.6138 0.2615 0.3839 0.2600 0.20468 0.25432 3,766 60

PacGAN 15,343 1,180 7,260 1,714 0.6805 0.0442 0.3193 0.0471 0.07148 0.03324 908 11
MAD-GAN 3,166 2,037 689 860 0.0275 0.0477 0.1431 0.1108 0.26012 0.08118 2,256 39
GDPP 501 451 271 329 0.7499 0.3189 0.2509 0.3192 0.42096 0.24309 1,562 73
VARGAN 1,848 1,316 896 415 0.6114 0.2360 0.3878 0.2379 0.15896 0.18456 578 50

PacVARGAN 13,965 1,039 5,318 1,606 0.7399 0.0672 0.2574 0.0696 0.13430 0.08211 467 50

Table 4: Performance metrics for feed-forward and convolutional structures on 99 and 1919 EES designs with 2 categories averaged over 5 repeats (Best performing model is bolded)

In convolutional models, GAN, GAN+MB, VARGAN and PacVARGAN generate the highest number of high pass designs. Lowest KL divergence values are also achieved by GAN+MB. However, the training time for GAN+MB is excessively high, and approximately 10 times higher than VARGAN model for the designs.

Intuitively, we expect the performance to deteriorate for the convolutional models in designs compared to designs. The input in this case is a two dimensional matrix with two symmetrical parts, and the model should figure out its symmetry to generate acceptable designs. This creates a more difficult task compared to the feed-forward models and design generation. It is important to note that the entire design search space is not available in case. We observe that all the models outperform the vanilla GAN with high number of uniquely generated modes. Moreover, PacGAN, VARGAN and PacVARGAN report the lowest KL divergence compared to the rest of the models. Overall, for designs, best performance is achieved by PACGAN with convolutional structures, followed up by PacVARGAN. In general, we observe more robustness for the convolutional structures as evidenced by lower standard deviation over a variety of performance metrics. We also find that the convergence of models over training epochs does not show any significant superiority for VARGAN compared to other GAN models for the EES dataset.

4.3 Impact of conditioning on mode collapse

Mao et al. (2019)

argue that conditioning distracts the focus of generator from input noise, which is responsible to create variant designs. As a result, the generator is encouraged to create deterministic outputs based on the conditioned vector, causing mode collapse. To investigate the effect of conditioning on mode collapse, we have selected the best performing models from Section 

4.2.3 to experiment with their conditional version. Results for this experiment are reported in Table 5.





Modes Accuracy




GDPP No 14,580 175 6,577 191 0.3118 0.0131 0.6862 0.0132 0.00026 0.00042
GDPP Yes 11,441 198 6,146 122 0.9366 0.0072 0.9778 0.0073 0.06611 0.00292

MSGAN Yes 11,513 252 6,701 200 0.9556 0.0091 0.9820 0.0053 0.07120 0.00339

VARGAN No 14,793 137 6,388 259 0.2945 0.0128 0.7015 0.0166 0.00060 0.00054
VARGAN Yes 11,338 364 6,083 321 0.9576 0.0062 0.9789 0.0030 0.07266 0.00307


PacGAN No 14,881 146 6,714 210 0.2315 0.0168 0.4976 0.0264 0.00062 0.00029
PacGAN Yes 11,572 3,856 5,585 1,917 0.5229 0.0361 0.7368 0.0893 0.01101 0.00637
MSGAN Yes 14,962 135 7,280 72 0.5672 0.0185 0.8410 0.0168 0.00660 0.00225
VARGAN No 14,962 164 6,659 123 0.1608 0.0218 0.3537 0.0448 0.00014 0.00015
VARGAN Yes 4,927 6,264 2,217 2,851 0.4067 0.3674 0.4236 0.3543 0.10876 0.13333
PacVARGAN No 14,829 224 6,729 179 0.2131 0.0065 0.4567 0.0207 0.00054 0.00059
PacVARGAN Yes 9,207 5,348 4,191 2,581 0.4111 0.1828 0.7138 0.1223 0.00792 0.00647



GAN No 13,148 7,349 6,351 3,135 0.3374 0.0911 0.6576 0.0942 0.07343 0.05317

GAN Yes 4,712 6,440 1,356 1,472 0.1264 0.0649 0.8762 0.0641 0.34576 0.17531

MSGAN Yes 6,634 8,488 3,639 4,469 0.2830 0.1837 0.7371 0.1644 0.20766 0.26400
VARGAN No 13,534 6,718 5,204 2,954 0.2784 0.0917 0.7155 0.0895 0.11176 0.06120
VARGAN Yes 7,628 9,566 2,896 3,541 0.1296 0.1506 0.6807 0.3634 0.43210 0.28316


PacGAN No 15,343 1,180 7,260 1,714 0.6805 0.0442 0.3193 0.0471 0.07148 0.03324
PacGAN Yes 13,393 8,004 4,361 2,720 0.6765 0.1081 0.1165 0.0771 0.19064 0.12417

MSGAN Yes 2,450 2,747 1,353 1,462 0.7131 0.2232 0.2592 0.2164 0.24830 0.25369
VARGAN No 1,848 1,316 896 415 0.6114 0.2360 0.3878 0.2379 0.15896 0.18456
VARGAN Yes 1,775 637 751 407 0.6184 0.1575 0.2080 0.1580 0.12574 0.11315
PacVARGAN No 13,965 1,039 5,318 1,606 0.7399 0.0672 0.2574 0.0696 0.13430 0.08211

PacVARGAN Yes 1,126 776 675 561 0.7705 0.2262 0.1176 0.1161 0.32746 0.24215

Table 5: Comparison of conditioned and unconditioned models on EES designs (bolded models present the best performing model among the conditioned ones)

We observe that, for all the models, conditioning reduces the number of uniquely generated modes over both designs. Mao et al. (2019)’s MSGAN for conditional GANs outperforms all the conditioned versions of the models, except the PacGAN convolutional version for designs. We note that the conditional version of VARGAN for the feed-forward models outperforms the MSGAN in both uniquely generated modes and KL divergence. Moreover, conditional versions of VARGAN and PacVARGAN consistently lead to worse performance compared to their unconditional versions, indicating that there is no need to use a conditioned model to create reasonable number of samples for each category/mode for VARGAN variants.

EES data with 8 categories

EES designs of size can be further categorized to have eight classes (Mohammadjafari et al., 2021). We next compare the effect of increasing the number of conditioned categories on diversity of generated samples. Specifically, we use best performing unconditioned models and condition them on two and eight categories. Table 6 presents the results for convolutional GAN, PacGAN, MSGAN and VARGAN variants for eight categories.



Total samples


1 2,521 238 2,664 35 2,740 15 1,543 1,170 2,605 279 2,804
2 2,410 116 2,531 20 2,566 12 1,452 1,116 2,479 166 2,614
3 2,945 169 3,035 45 3,130 14 1,730 1,323 2,972 317 3,206
4 7,819 446 8,322 27 8,359 25 4,702 3,665 8,011 613 8,556
5 5,093 311 5,344 28 5,399 23 2,933 2,350 5,101 622 5,537
SUM 20,788 21,896 22,194 12,364 21,168 22,717


1 3,788 231 3,994 45 4,026 32 2,216 1,758 3,816 373 4,193
2 4,083 266 4,353 17 4,370 24 2,465 1,959 4,090 537 4,467
3 1,278 73 1,342 5 1,366 6 761 581 1,265 182 1,391
SUM 9,149 9,689 9,762 5,442 9,171 10,051
KL 0.00407 0.00283 0.00337 0.00183 0.00156 0.00100 0.08914 0.10644 0.00394 0.00431
Table 6: Performance metrics reported for EES designs using convolutional models for 8 categories averaged over 5 repeats (HP 1: Band Pass, HP 2: Band Pass Stop, HP 3: High Pass, HP 4: True High Pass, HP 5: Very High Pass, LP 1: Band Stop, LP 2: Low Pass, LP 3: Stop Pass)

As expected, PacGAN, MSGAN and PacVARGAN lead to a high performance in this case. Figure 24

provides a comparison of number of unique designs generated for 2 category and 8 category EES datasets. The results show an increase in number of covered modes as the number of conditioned categories increases. The reason behind this result might be the one hot encoding structure of the labels. The large length of labels increases the randomness and diversity of the samples and encourages the generator to avoid creating deterministic results. It is worth mentioning that VARGAN performance is not dependent on the length of conditions. The variation in samples generated by VARGAN is encouraged by another factor (i.e., MCR values), which removes the limitation to control the length and number of categories.

(a) High pass designs unique modes (right: 8 category data, left: 2 category data)
(b) Low pass designs unique modes (right: 8 category data, left: 2 category data)
Figure 24: Comparison of total generated unique modes for high pass and low pass designs for 2 and 8 categories over designs

5 Conclusions

In this paper, we proposed a new GAN architecture called VARGAN to alleviate the mode collapse issue. VARGAN incorporates a new additional network that measures the samples’ diversity. The new network’s loss on generated samples is used to penalize the generator and to introduce diversity in the generated samples. We compared the performance of our method with the state-of-the-art GAN architectures on three different datasets. Results show a high performance for the proposed VARGAN architecture, fast convergence and low training times. We also investigated the effect of conditioning on mode collapse in GANs. Our experiments indicate that conditioning reduces the number of generated modes and induces mode collapse, which can be a cause of deterministic sample generation based on the auxiliary information. We also experimented with MSGAN, which is proposed for reducing mode collapse in conditional GANs. Our results show that, among the conditional models, for feed-forward structures, conditional VARGAN shows slightly better performance compared to MSGAN, whereas, for convolutional structures, MSGAN outperforms conditional VARGAN. Furthermore, we examine the effect of conditioning length on the mode collapse. Our analysis with EES dataset shows that mode collapse happens when generator ignores the noise randomness, and creates deterministic results based on conditions. We observe a resistance of VARGAN to the condition vector length, which points out to the different architecture of VARGAN compared to the existing models.

Our study contributes to a better understanding of methodologies to address mode collapse issue, however, evaluating mode collapse remains a challenging task. In this study, we limit the evaluation of the experiments to the performance metrics discussed in previous studies on mode collapse in GANs. In our analysis, we solely focus on one network architecture extracted from other studies in the literature, which may affect the models’ performance. In future research, we plan to design more extensive hyperparameter tuning experiments to check the model’s capabilities across different architectures. We also have selected one architecture for our VarNet model inspired by Lin et al. (2018), which can be further improved to better serve the purpose of diversity evaluation. Specifically, we can integrate a computational loss based on the designs’ differences to our generator’s loss function.


  • M. Arjovsky and L. Bottou (2017) Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862. Cited by: §2.
  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §1, Table 1, §2.
  • H. Bin, C. Weihai, W. Xingming, and L. Chun-Liang (2017) High-quality face image sr using conditional generative adversarial networks. arXiv preprint arXiv:1707.00737. Cited by: §1.
  • A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §1.
  • T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li (2016) Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136. Cited by: §1, Table 1, §2.
  • M. Elfeki, C. Couprie, M. Riviere, and M. Elhoseiny (2019) GDPP: learning diverse generations using determinantal point processes. In

    International Conference on Machine Learning

    pp. 1774–1783. Cited by: Table 1, §2, §2, §3.2.4, §3.4, §4.2.1, §4.2.2.
  • A. Ghosh, V. Kulharia, V. P. Namboodiri, P. H. Torr, and P. K. Dokania (2018) Multi-agent diverse generative adversarial networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 8513–8521. Cited by: Table 1, §2, §3.4.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §1, Table 1, §2.
  • S. Gurumurthy, R. Kiran Sarvadevabhatla, and R. Venkatesh Babu (2017) Deligan: generative adversarial networks for diverse and limited data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 166–174. Cited by: §1.
  • Q. Hoang, T. D. Nguyen, T. Le, and D. Phung (2018) MGAN: training generative adversarial nets with multiple generators. In International conference on learning representations, Cited by: §2.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §1, §2.
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §1.
  • C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §1.
  • W. Li, L. Xu, Z. Liang, S. Wang, J. Cao, T. C. Lam, and X. Cui (2021)

    JDGAN: enhancing generator on extremely limited data via joint distribution

    Neurocomputing 431, pp. 148–162. Cited by: §1.
  • Z. Lin, A. Khetan, G. Fanti, and S. Oh (2018) Pacgan: the power of two samples in generative adversarial networks. In Advances in neural information processing systems, pp. 1498–1507. Cited by: §1, Table 1, §2, §2, §3.2.2, §3.2.4, §3.4, §4.1, §4.2.1, §4.2.2, §5.
  • Q. Mao, H. Lee, H. Tseng, S. Ma, and M. Yang (2019) Mode seeking generative adversarial networks for diverse image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1429–1437. Cited by: Table 1, §2, §4.3, §4.3.
  • L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein (2016) Unrolled generative adversarial networks. CoRR abs/1611.02163. External Links: Link, 1611.02163 Cited by: §1.
  • M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §3.1.
  • S. Mohammadjafari, O. Ozyegen, M. Cevik, E. Kavurmacioglu, J. Ethier, and A. Basar (2021) Designing mm-wave electromagnetic engineered surfaces using generative adversarial networks. Neural Computing and Applications, pp. 1–15. Cited by: §1, §1, §2, 3rd item, §4.3.
  • T. C. Mok and A. C. Chung (2018) Learning data augmentation for brain tumor segmentation with coarse-to-fine generative adversarial networks. In International MICCAI Brainlesion Workshop, pp. 70–80. Cited by: §1.
  • D. K. Park, S. Yoo, H. Bahng, J. Choo, and N. Park (2018) Megan: mixture of experts of generative adversarial networks for multimodal image generation. arXiv preprint arXiv:1805.02481. Cited by: Table 1, §2.
  • A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §3.1.
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §1, Table 1, §2, §3.2.4, §3.4, §4.1.
  • A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton (2017) Veegan: reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems, pp. 3308–3318. Cited by: §2.
  • I. O. Tolstikhin, S. Gelly, O. Bousquet, C. Simon-Gabriel, and B. Schölkopf (2017) Adagan: boosting generative models. In Advances in Neural Information Processing Systems, pp. 5424–5433. Cited by: Table 1, §2.
  • P. Zhong, Y. Mo, C. Xiao, P. Chen, and C. Zheng (2019) Rethinking generative mode coverage: a pointwise guaranteed approach. arXiv preprint arXiv:1902.04697. Cited by: §2.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017a) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1, §2.
  • J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017b) Multimodal image-to-image translation by enforcing bi-cycle consistency. In Advances in neural information processing systems, pp. 465–476. Cited by: §2.
  • Z. Zuo, L. Zhao, H. Zhang, Q. Mo, H. Chen, Z. Wang, A. Li, L. Qiu, W. Xing, and D. Lu (2019) LDMGAN: reducing mode collapse in gans with latent distribution matching. . Cited by: §1.

Appendix A Selection of MCR formulation and its parameters

We have explored three different formulations for the MCR values. Equation (5) presents a linear relationship between MCR value and percentage of covered modes.


Equation (6) models a relationship that achieves early convergence to reward the model with low generator error.


Equation (7) models a relationship where it avoids rewarding the generator too early to keep it motivated.


Figure 25 shows the trajectory of different formulations based on the percentage of covered modes.

Figure 25: MCR value based on number of generated modes divided by target modes for different equations. Left, middle and right figures in order present Equation (5),  (6) and  (7).

Firstly, we have experimented with constant values of , and for different formulations on synthetic data with 36 modes. Our results illustrated in Figure 29 indicate that Equation (6) is presenting a suitable trajectory for MCR values over the percentage of covered modes.

(a) Number of modes
(b) KL divergence
(c) Percentage of high quality samples
Figure 29: Comparison of different MCR formulations on synthetic 2D grid data with 36 modes averaged over 5 repeats

Finally, we have used different values of , and on synthetic data with 36 modes. The hyperparameter optimization results are illustrated in Figure 33.

(a) Number of modes
(b) KL divergence
(c) Percentage of high quality samples
Figure 33: Comparison of different hyperparameters for Equation (6) on synthetic 2D grid data with 36 modes averaged over 5 repeats

Equation (6) with , and of 1, 10 and 5 shows an early convergence for all the metrics compared to other formulations and hyperparameters. As shown in Figure 25, , and of 1, 10 and 5 enforce a low initial MCR value with moderate speed of convergence to MCR value of one. In other words, VARGAN performance does not improve by defining a large MCR value for initial low mode coverage or a fast convergence trajectory.

Appendix B Architecture of models

In this section, a detailed summary of GAN architectures is presented. Table 7 presents the feed-forward model architecture used for all the datasets.

max width=

Network Number of units Activation function Regularization
Generator 512 LeakyReLU -
1,024 LeakyReLU -
2,048 LeakyReLU -
4,096 LeakyReLU -
Output size Sigmoid -
Discriminator 2,048 LeakyReLU -
1,024 LeakyReLU Dropout(0.3)
512 LeakyReLU Dropout(0.3)
1 Sigmoid -

Table 7: Structure of FF GAN model

Table 8 shows the convolutional model architecture used for stacked MNIST dataset. We have modified the architecture for other GAN variants to implement the specific details of their methodology.

max width=

Network Layer # of channels Kernel size Activation function

Linear - - ReLU
Conv 256 4 ReLU
Conv 128 4 ReLU
Conv 64 4 ReLU
Conv 3 4 Tanh
Discriminator Conv 64 4 LeakyReLU
Conv 128 4 LeakyReLU
Conv 256 4 LeakyReLU
Conv 512 4 LeakyReLU
Linear - - Sigmoid

Table 8: Structure of convolutional GAN model for stacked MNIST dataset

Convolutional model architecture used for EES dataset is reported in Table 9. Number of convolutional layers is changed to implement both and designs.

max width=

Network Layer # of channels Kernel size Activation function
Generator Linear - - LeakyReLU
Conv 256 3 LeakyReLU
Con 128 3 LeakyReLU
Conv 1 3 Sigmoid

Conv 64 3 LeakyReLU
Conv 128 3 LeakyReLU
Conv 256 3 LeakyReLU
Linear - - Sigmoid

Table 9: Structure of convolutional GAN model for EES dataset

Appendix C Detailed results

In this section, performance comparison of GAN models over the epochs is presented. Figure 37 and  41 illustrate the convergence of the models for synthetic data with 8 and 25 modes based on different performance metrics and over epochs. VARGAN and GDPP models seem to have early convergence in the beginning epochs for synthetic data with 8 modes. VARGAN shows great early convergence on synthetic data with 25 modes as well, and the rest of the models follow it by a large gap.

(a) Number of modes
(b) KL divergence
(c) Percentage of high quality samples
Figure 37: Comparison of different GAN models on synthetic 2D ring data with 8 modes averaged over 5 repeats
(a) Number of modes
(b) KL divergence
(c) Percentage of high quality samples
Figure 41: Comparison of different GAN models on synthetic 2D grid data with 25 modes averaged over 5 repeats

Figure 45 presents the convergence of performance metrics for convolutional GAN models on stacked MNIST data. Both VARGAN and PacVARGAN show great early convergence on all metrics.

(a) Number of modes
(b) KL divergence
(c) Inception Score
Figure 45: Performance comparison of different GAN models with convolutional GAN structure for stacked MNIST dataset averaged over 5 repeats