1 Introduction
Generative adversarial networks (GANs) are an innovative technique for training generative models to produce realistic examples from a data distribution [14]. Suppose we are given i.i.d. samples
from an unknown probability distribution
over some highdimensional space (e.g., images). The goal of generative modeling is to learn a model that enables us to produce samples fromthat are not in the training data. Classical approaches to this problem typically search over a parametric family (e.g., a Gaussian mixture), and fit parameters to maximize the likelihood of the observed data. Such likelihoodbased methods suffer from the curse of dimensionality in realworld datasets, such as images. Deep neural networkbased generative models were proposed to cope with this problem
[22, 15, 14]. However, these modern generative models can be difficult to train, in large part because it is challenging to evaluate their likelihoods. Generative adversarial networks made a breakthrough in training such models, with an innovative training method that uses a minimax formulation whose solution is approximated by iteratively training two competing neural networks—hence the name “adversarial networks”.GANs have attracted a great deal of interest recently. They are able to generate realistic, crisp, and original examples of images [14, 9] and text [47]. This is useful in image and video processing (e.g. frame prediction [45]
, image superresolution
[25], and imagetoimage translation
[17]), as well as dialogue systems or chatbots—applications where one may need realistic but artificially generated data. Further, they implicitly learn a latent, lowdimensional representationof arbitrary highdimensional data. Such embeddings have been hugely successful in the area of natural language processing (e.g. word2vec
[31]). GANs have the potential to provide such an unsupervised solution to learning representations that capture semantics of the domain to arbitrary data structures and applications.Primer on GANs.
Neuralnetworkbased generative models are trained to map a (typically lower dimensional) random variable
from a standard distribution (e.g. spherical Gaussian) to a domain of interest, like images. In this context, a generator is a function , which is chosen from a rich class of parametric functions like deep neural networks. In unsupervised generative modeling, one of the goals is to train the parameters of such a generator from unlabelled training data drawn independently from some real world dataset (such as celebrity faces in CelebA [29] or natural images from CIFAR100 [23]), in order to produce examples that are realistic but different from the training data.A breakthrough in training such generative models was achieved by the innovative idea of GANs [14]. GANs train two neural networks: one for the generator and the other for a discriminator . These two neural networks play a dynamic minimax game against each other. An analogy provides the intuition behind this idea. The generator is acting as a forger trying to make fake coins (i.e., samples), and the discriminator is trying to detect which coins are fake and which are real. If these two parties are allowed to play against each other long enough, eventually both will become good. In particular, the generator will learn to produce coins that are indistinguishable from real coins (but preferably different from the training coins he was given).
Concretely, we search for (the parameters of) neural networks and that optimize the following type of minimax objective:
(1)  
where is the distribution of the real data, and
is the distribution of the input code vector
. Here is a function that tries to distinguish between real data and generated samples, whereas is the mapping from the latent space to the data space. Critically, [14] shows that the global optimum of (1) is achieved if and only if , where is the generated distribution of . We refer to Section 4 for a detailed discussion of this minimax formulation. The solution to the minimax problem (1) can be approximated by iteratively training two “competing” neural networks, the generator and discriminator. Each model can be updated individually by backpropagating the gradient of the loss function to each model’s parameters.
Mode Collapse in GANs.
One major challenge in training GAN is a phenomenon known as mode collapse, which collectively refers to the lack of diversity in generated samples. One manifestation of mode collapse is the observation that GANs commonly miss some of the modes when trained on multimodal distributions. For instance, when trained on handwritten digits with ten modes, the generator might fail to produce some of the digits [37]. Similarly, in tasks that translate a caption into an image, generators have been shown to generate series of nearlyidentical images [35]. Mode collapse is believed to be related to the training instability of GANs—another major challenge in GANs.
Several approaches have been proposed to fight mode collapse, e.g. [10, 11, 40, 37, 30, 6, 36, 33]. We discuss prior work on mode collapse in detail in Section 6. Proposed solutions rely on modified architectures [10, 11, 40, 37], loss functions [6, 1], and optimization algorithms [30]
. Although each of these proposed methods is empirically shown to help mitigate mode collapse, it is not well understood how the proposed changes relate to mode collapse. Previouslyproposed heuristics fall short of providing rigorous explanations on why they achieve empirical gains, especially when those gains are sensitive to architecture hyperparameters.
Our Contributions.
In this work, we examine GANs through the lens of binary hypothesis testing. By viewing the discriminator as performing a binary hypothesis test on samples (i.e., whether they were drawn from distribution or ), we can apply insights from classical hypothesis testing literature to the analysis of GANs. In particular, this hypothesistesting viewpoint provides a fresh perspective and understanding of GANs that leads to the following contributions:

The first contribution is conceptual: we propose a formal mathematical definition of mode collapse that abstracts away the geometric properties of the underlying data distributions (see Section 4.1). This definition is closely related to the notions of false alarm and missed detection in binary hypothesis testing (see Section 4.3). Given this definition, we provide a new interpretation of the pair of distributions as a twodimensional region called the mode collapse region, where is the true data distribution and the generated one. The mode collapse region provides new insights on how to reason about the relationship between those two distributions (see Section 4.1).

The second contribution is analytical: through the lens of hypothesis testing and mode collapse regions, we show that if the discriminator is allowed to see samples from the th order product distributions and instead of the usual target distribution and generator distribution , then the corresponding loss when training the generator naturally penalizes generator distributions with strong mode collapse (see Section 4.2). Hence, a generator trained with this type of discriminator will be encouraged to choose a distribution that exhibits less mode collapse. The region interpretation of mode collapse and corresponding data processing inequalities provide the analysis tools that allows us to prove strong and sharp results with simple proofs (see Section 5). This follows a long tradition in information theory literature (e.g. [41, 8, 7, 48, 44, 28, 18, 19, 20]) where operational interpretations of mutual information and corresponding data processing inequalities have given rise to simple proofs of strong technical results.

The third contribution is algorithmic: based on the insights from the region interpretation of mode collapse, we propose a new GAN framework to mitigate mode collapse, which we call PacGAN. PacGAN can be applied to any existing GAN, and it requires only a small modification to the discriminator architecture (see Section 2). The key idea is to pass
“packed” or concatenated samples to the discriminator, which are jointly classified as either real or generated. This allows the discriminator to do binary hypothesis testing based on the product distributions
, which naturally penalizes mode collapse (as we show in Section 4.2). We demonstrate on benchmark datasets that PacGAN significantly improves upon competing approaches in mitigating mode collapse (see Section 3). Further, unlike existing approaches on jointly using multiple samples, e.g. [37], PacGAN requires no hyper parameter tuning and incurs only a slight overhead in the architecture.
Outline.
This paper is structured as follows: we present the PacGAN framework in Section 2, and evaluate it empirically according to the metrics and experiments proposed in prior work (Section 3). In Section 4, we propose a new definition of mode collapse, and provide analyses showing that PacGAN mitigates mode collapse. The proofs of the main results are provided in Section 5. Finally, we describe in greater detail the related work on GANs in general and mode collapse in particular in Section 6.
2 PacGAN: A novel framework for mitigating mode collapse
We propose a new framework for mitigating mode collapse in GANs. We start with an arbitrary existing GAN^{1}^{1}1For a list of some popular GANs, we refer to the GAN zoo: https://github.com/hindupuravinash/theganzoo, which is typically defined by a generator architecture, a discriminator architecture, and a loss function. Let us call this triplet the mother architecture.
The PacGAN framework maintains the same generator architecture and loss function as the mother architecture, and makes a slight change only to the discriminator. That is, instead of using a discriminator that maps a single (either from real data or from the generator) to a (soft) label, we use an augmented discriminator that maps samples, jointly coming from either real data or the generator, to a single (soft) label. These samples are drawn independently from the same distribution—either real (jointly labelled as ) or generated (jointly labelled as ). We refer to the concatenation of samples with the same label as packing, the resulting concatenated discriminator as a packed discriminator, and the number of concatenated samples as the degree of packing. We call this approach a framework instead of an architecture, because the proposed approach of packing can be applied to any existing GAN, using any architecture and any loss function, as long as it uses a discriminator of the form that classifies a single input sample.
We propose the nomenclature “Pac(X)()” where (X) is the name of the mother architecture, and () is an integer that refers to how many samples are packed together as an input to the discriminator. For example, if we take an original GAN and feed the discriminator three packed samples as input, we call this “PacGAN3”. If we take the celebrated DCGAN [34] and feed the discriminator four packed samples as input, we call this “PacDCGAN4”. When we refer to the generic principle of packing, we use PacGAN without an subsequent integer.
How to pack a discriminator. Note that there are many ways to change the discriminator architecture to accept packed input samples. We propose to keep all hidden layers of the discriminator exactly the same as the mother architecture, and only increase the number of nodes in the input layer by a factor of . For example, in Figure 1, suppose we start with a mother architecture in which the discriminator is a fullyconnected feedforward network. Here, each sample lies in a space of dimension , so the input layer has two nodes. Now, under PacGAN2, we would multiply the size of the input layer by the packing degree (in this case, two), and the connections to the first hidden layer would be adjusted so that the first two layers remain fullyconnected, as in the mother architecture. The gridpatterned nodes in Figure 1 represent input nodes for the second sample.
Similarly, when packing a DCGAN, which uses convolutional neural networks for both the generator and the discriminator, we simply stack the images into a tensor of depth
. For instance, the discriminator for PacDCGAN5 on the MNIST dataset of handwritten images [24] would take an input of size , since each individual blackandwhite MNIST image is pixels. Only the input layer and the number of weights in the corresponding first convolutional layer will increase in depth by a factor of five. By modifying only the input dimension and fixing the number of hidden and output nodes in the discriminator, we can focus purely on the effects of packing in our numerical experiments in Section 3.How to train a packed discriminator.
Just as in standard GANs, we train the packed discriminator with a bag of samples from the real data and the generator. However, each minibatch in the stochastic gradient descent now consists of
packed samples. Each packed sample is of the form , where the label is for real data and for generated data, and the independent samples from either class are jointly treated as a single, higherdimensional feature . The discriminator learns to classify packed samples jointly. Intuitively, packing helps the discriminator detect mode collapse because lack of diversity is more obvious in a set of samples than in a single sample. Fundamentally, packing allows the discriminator to observe samples from product distributions, which highlight mode collapse more clearly than unmodified data and generator distributions. We make this statement precise in Section 4.Notice that the computational overhead of PacGAN training is marginal, since only the input layer of the discriminator gains new parameters. Furthermore, we keep all training hyperparameters identical to the mother architecture, including the stochastic gradient descent minibatch size, weight decay, learning rate, and the number of training epochs. This is in contrast with other approaches for mitigating mode collapse that require significant computational overhead and/or delicate hyperparameter selection
[11, 10, 37, 40, 30].Computational complexity.
The exact computational complexity overhead of PacGAN (compared to GANs) is architecturedependent, but can be computed in a straightforward manner. For example, consider a discriminator with fullyconnected layers, each containing nodes. Since the discriminator has a binary output, the th layer has a single node, and is fully connected to the previous layer. We seek the computational complexity of a single minibatch parameter update, where each minibatch contains samples. Backpropagation in such a network is dominated by the matrixvector multiplication in each hidden layer, which has complexity per input sample, assuming a naive implementation. Hence the overall minibatch update complexity is . Now suppose the input layer is expanded by a factor of . If we keep the same number of minibatch elements, the perminibatch cost grows to . We find that in practice, even or give good results.
3 Experiments
On standard benchmark datasets, we compare PacGAN to several baseline GAN architectures, some of which are explicitly proposed to mitigate mode collapse: GAN [14], DCGAN [37], VEEGAN [40], Unrolled GANs [30], and ALI [11]. We also implicitly compare against BIGAN [10], which is conceptually identical to ALI. To isolate the effects of packing, we make minimal choices in the architecture and hyperparameters of our packing implementation. For each experiment, we evaluate packing by taking a standard, baseline GAN implementation that was not designed to prevent mode collapse, and adding packing in the discriminator. In particular, our goal for this section is to reproduce experiments from existing literature, apply the packing framework to the simplest GAN among those in the baseline, and showcase how packing affects the performance.
Metrics.
For consistency with prior work, we measure several previouslyused metrics. On datasets with clear, known modes (e.g., Gaussian mixtures, labelled datasets), prior papers have counted the number of modes that are produced by a generator [10, 30, 40]. In labelled datasets, this number can be evaluated using a thirdparty trained classifier that classifies the generated samples [37]
. In Gaussian Mixture Models (GMMs), for example in
[40], a mode is considered lost if there is no sample in the generated test data within standard deviations from the center of that mode. In [40], is set to be three for 2Dring and 2Dgrid, and ten for 1200Dsynthetic. A second metric used in [40] is the number of highquality samples, which is the proportion of the samples that are within standard deviation from the center of a mode. Finally, thereverse KullbackLeibler divergence
over the modes has been used to measure the quality of mode collapse as follows. Each of the generated test samples is assigned to its closest mode; this induces an empirical, discrete distribution with an alphabet size equal to the number of observed modes in the generated samples. A similar induced discrete distribution is computed from the real data samples. The reverse KL divergence between the induced distribution from generated samples and the induced distribution from the real samples is used as a metric. Each of these three metrics has shortcomings—for example, the number of observed modes does not account for class imbalance among generated modes, and all of these metrics only work for datasets with known modes. Defining an appropriate metric for evaluating GANs is an active research topic [42, 46, 38].Datasets.
We use a number of synthetic and real datasets for our experiments, all of which have been studied or proposed in prior work. The 2Dring [40] is a mixture of eight twodimensional spherical Gaussians with means
and variances
in each dimension for . The 2Dgrid [40] is a mixture of 25 twodimensional spherical Gaussians with means and variances in each dimension for .To examine real data, we use the MNIST dataset [24], which consists of 70,000 images of handwritten digits, each pixels. Unmodified, this dataset has 10 modes, one for each digit. As done in Moderegularized GANs [6], Unrolled GANs [30] and VEEGAN [40], we augment the number of modes by stacking the images. That is, we generate a new dataset of 128,000 images, in which each image consists of three randomlyselected MNIST images that are stacked into a image in RGB. This new dataset has (with high probability) modes. We refer to this as the stacked MNIST dataset.
3.1 Synthetic data experiments from VEEGAN [40]
Our first experiment evaluates the number of modes and the number of highquality samples for the 2Dring and the 2Dgrid. Results are reported in Table 1. The first four rows are copied directly from Table 1 in [40]. The last three rows contain our own implementation of PacGANs. We do not make any choices in the hyperparameters, the generator architecture, the discriminator architecture, and the loss. Our implementation attempts to reproduce the VEEGAN architecture to the best of our knowledge, as described below.
Architecture and hyperparameters. All of the GANs we implemented in this experiment use the same overall architecture, which is chosen to match the architecture in VEEGAN’s code [40]
. The generators have two hidden layers, 128 units per layer with ReLU activation, trained with batch normalization
[16]. The input noise is a two dimensional spherical Gaussian with zero mean and unit variance. The discriminator has one hidden layer, 128 units on that layer. The hidden layer uses LinearMaxout with 5 maxout pieces, and no batch normalization is used in the discriminator.We train each GAN with 100,000 total samples, and a minibatch size of 100 samples; training is run for 200 epochs. The discriminator’s loss function is , except for VEEGAN which has an additional regularization term. The generator’s loss function is . Adam [21] stochastic gradient descent is applied with the generator weights and the discriminator weights updated once per minibatch. At testing, we use 2500 samples from the learned generator for evaluation. Each metric is evaluated and averaged over 10 trials.
2Dring  2Dgrid  
Modes  high quality  Modes  high quality  
(Max 8)  samples  (Max 25)  samples  
GAN [14]  1.0  99.30 %  3.3  0.5 % 
ALI [11]  2.8  0.13 %  15.8  1.6 % 
Unrolled GAN [30]  7.6  35.60 %  23.6  16.0 % 
VEEGAN [40]  8.0  52.90 %  24.6  40.0 % 
PacGAN2 (ours)  8.00.0  78.57.7 %  24.60.9  65.813.4 % 
PacGAN3 (ours)  8.00.0  84.06.1 %  24.90.3  71.413.8 % 
PacGAN4 (ours)  8.00.0  82.711.3 %  25.00.0  76.07.1 % 
for two synthetic mixtures of Gaussians: number of modes captured by the generator and percentage of high quality samples. Our results are averaged over 10 trials shown with the standard error.
Results.
Table 1 shows that PacGAN outperforms or matches the baseline schemes, both in the number of modes captured and the percentage of high quality samples. As expected, increasing the degree of packing seems to increase the average number of modes found, though the increases are marginal for easy tasks. In the 2D grid and ring, we find that PacGAN slightly outperforms VEEGAN, but the proposed datasets seem not to be challenging enough to highlight meaningful differences. However, one can clearly see the gain of packing by comparing the GAN in the first row (which is the mother architecture) and PacGANs in the last rows. The simple change we make to the mother architecture according to the principle of packing makes a significant difference in performance, and the overhead of changes made to the mother architecture are minimal compared to the baselines [11, 30, 40].
Note that maximizing the number of highquality samples is not necessarily indicative of a good generative model. First, we expect some fraction of probability mass to lie outside the “highquality” boundary, and that fraction increases with the dimensionality of the dataset. For reference, we find empirically that the expected fraction of highquality samples in the true data distribution for the 2D ring and grid are both 98.9%, which corresponds to the theoretical ratio for a single 2D Gaussian. These values are higher than the fractions found by PacGAN, indicating room for improvement. However, a generative model could output 100% highquality points by learning very few modes (as is the case for GANs in the 2D ring in Table 1).
Note that our goal is not to compete with the baselines of ALI, Unrolled GAN, and VEEGAN, but to showcase the improvement that can be obtained with packing. In this spirit, we can easily apply our framework to other baselines and test “PacALI”, “PacUnrolledGAN”, and “PacVEEGAN”. In fact, we expect that most GAN architectures can be packed to improve sample quality. However, for these benchmark tests, we see that packing the simplest GAN is sufficient.
3.2 Stacked MNIST experiments
In our next experiments, we evaluate mode collapse on the stacked MNIST dataset (described at the beginning of Section 3). These experiments are direct comparisons to analogous experiments in VEEGAN [40] and Unrolled GANs [30]. For these evaluations, we generate 26,000 samples from the generator. Each of the three channels in each sample is classified by a pretrained thirdparty MNIST classifier, and the resulting three digits determine which of the
modes the sample belongs to. We measure the number of modes captured, as well as the KL divergence between the generated distribution over modes and the expected true one (i.e., a uniform distribution over the 1,000 modes).
Hyperparameters.
For these experiments, we train each GAN on 128,000 samples, with a minibatch size of 64. The generator’s loss function is log(D(generated data)), and the discriminator’s loss function is log(D(real data))log(1D(generated data)). We update the generator parameters twice and the discriminator parameters once in each minibatch, and train the generators over 50 epochs. For testing, we generate 26,000 samples, and evaluate the empirical KL divergence and number of modes covered. Finally, we average these values over 10 runs of the entire pipeline.
3.2.1 Veegan [40] Experiment
In this experiment, we replicate Table 2 from [40], which measured the number of observed modes in a generator trained on the stacked MNIST dataset, as well as the KL divergence of the generated mode distribution.
Architecture.
In line with prior work [40], we used a DCGANlike architecture for these experiments, which is based on the code at https://github.com/carpedm20/DCGANtensorflow. In particular, the generator and discriminator architectures are as follows:
Generator:
layer  number of outputs  kernel size  stride  activation function 

Input:  100  
Fully connected  2*2*512  ReLU  
Transposed Convolution  4*4*256  5*5  2  ReLU 
Transposed Convolution  7*7*128  5*5  2  ReLU 
Transposed Convolution  14*14*64  5*5  2  ReLU 
Transposed Convolution  28*28*3  5*5  2  Tanh 
Discriminator (for PacDCGAN):
layer  number of outputs  kernel size  stride  BN  activation function 

Input: or  28*28*(3*)  
Convolution  14*14*64  5*5  2  LeakyReLU  
Convolution  7*7*128  5*5  2  Yes  LeakyReLU 
Convolution  4*4*256  5*5  2  Yes  LeakyReLU 
Convolution  2*2*512  5*5  2  Yes  LeakyReLU 
Fully connected  1  Sigmoid 
Results.
Results are shown in Table 2. Again, the first four rows are copied directly from [40]. The last three rows are computed using a basic DCGAN, with packing in the discriminator. We find that packing gives good mode coverage, reaching all 1,000 modes in every trial. Given a DCGAN that can capture at most 99 modes on average (our mother architecture), the principle of packing, which is a small change in the architecture, is able to improve performance to capture all 1,000 modes. Again we see that packing the simplest DCGAN is sufficient to fully capture all the modes in this benchmark tests, and we do not pursue packing more complex baseline architectures. Existing approaches to mitigate mode collapse, such as ALI, Unrolled GANs, and VEEGAN, are not able to capture as many modes.
Stacked MNIST  
Modes (Max 1000)  KL  
DCGAN [34]  99.0  3.40 
ALI [11]  16.0  5.40 
Unrolled GAN [30]  48.7  4.32 
VEEGAN [40]  150.0  2.95 
PacDCGAN2 (ours)  1000.00.0  0.060.01 
PacDCGAN3 (ours)  1000.00.0  0.060.01 
PacDCGAN4 (ours)  1000.00.0  0.070.01 
Note that other classes of GANs may also be able to learn most or all of the modes if tuned properly. For example, [30] reports that regular GANs can learn all 1,000 modes even without unrolling if the discriminator is large enough, and if the discriminator is half the size of the generator, unrolled GANs recover up to 82% of the modes when the unrolling parameter is increased to 10. To explore this effect, we conduct further experiments on unrolled GANs in Section 3.2.2.
3.2.2 Unrolled GAN [30] Experiment
This experiment is designed to replicate Table 1 from Unrolled GANs [30]. Unrolled GANs exploit the observation that iteratively updating discriminator and generator model parameters can contribute to training instability. To mitigate this, they update model parameters by computing the loss function’s gradient with respect to sequential discriminator updates, where is called the unrolling parameter. [30] reports that unrolling improves mode collapse as increases, at the expense of greater training complexity.
Unlike Section 3.2.1, which reported a single metric for unrolled GANs, this experiment studies the effect of the unrolling parameter and the discriminator size on the number of modes learned by a generator. The key differences between these trials and the unrolled GAN row in Table 2 are threefold: (1) the unrolling parameters are different, (2) the discriminator sizes are different, and (3) the generator and discriminator architectures are chosen according to Appendix E in [30].
Results.
Our results are reported in Table 3. The first four rows are copied from [30]. As before, we find that packing seems to increase the number of modes covered. Additionally, in both experiments, PacDCGAN finds more modes on average than Unrolled GANs with , with lower reverse KL divergences between the mode distributions. This suggests that packing has a more pronounced effect than unrolling. However, note that the standard error for PacDCGANs is larger than that reported in [30]; this may be due to our relatively small sample size of 10.
D is size of G  D is size of G  
Modes (Max 1000)  KL  Modes (Max 1000)  KL  
DCGAN [34]  30.620.73  5.990.42  628.0140.9  2.580.75 
Unrolled GAN, 1 step [30]  65.434.75  5.910.14  523.655.77  2.440.26 
Unrolled GAN, 5 steps [30]  236.463.30  4.670.43  732.044.98  1.660.09 
Unrolled GAN, 10 steps [30]  327.274.67  4.660.46  817.437.91  1.430.12 
PacDCGAN2 (ours)  370.8244.34  3.331.02  877.151.96  0.990.13 
PacDCGAN3 (ours)  534.3103.68  2.110.52  851.698.60  1.020.34 
PacDCGAN4 (ours)  557.7101.37  2.060.61  896.072.83  0.820.25 
4 Theoretical analyses of PacGAN
In this section, we propose a formal and natural mathematical definition of mode collapse, which abstracts away domainspecific details (e.g. images vs. time series). For a target distribution and a generator distribution , this definition describes mode collapse through a twodimensional representation of the pair as a region.
Mode collapse is a phenomenon commonly reported in the GAN literature [13, 35, 43, 32, 3], which can refer to two distinct concepts: the generative model loses some modes that are present in the samples of the target distribution. For example, despite being trained on a dataset of animal pictures that includes lizards, the model never generates images of lizards. Two distant points in the code vector are mapped to the same or similar points in the sample space . For instance, two distant latent vectors and map to the same picture of a lizard [13]. Although these phenomena are different, and either one can occur without the other, they are generally not explicitly distinguished in the literature, and it has been suggested that the latter may cause the former [13]. In this paper, we focus on the former notion, as it does not depend on how the generator maps a code vector to the sample , and only focuses on the quality of the samples generated. In other words, we assume here that two generative models with the same marginal distribution over the generated samples should not be treated differently based on how random code vectors are mapped to the data sample space. The second notion of mode collapse would differentiate two such architectures, and is beyond the scope of this work. The proposed region representation relies purely on the properties of the generated samples, and not on the generator’s mapping between the latent and sample spaces.
We analyze how the proposed idea of packing changes the training of the generator. We view the discriminator’s role as providing a surrogate for a desired loss to be minimized—surrogate in the sense that the actual desired losses, such as JensenShannon divergence or total variation distances, cannot be computed exactly and need to be estimated. Consider the standard GAN discriminator with a crossentropy loss:
(2) 
where the maximization is over the family of discriminators (or the discriminator weights, if the family is a neural network of a fixed architecture), the minimization is over the family of generators, and is drawn from the distribution of the real data, is drawn from the distribution of the code vector, typically a lowdimensional Gaussian, and we denote the resulting generator distribution as . The role of the discriminator under this GAN scenario is to provide the generator with an approximation (or a surrogate) of a loss, which in the case of cross entropy loss turns out to be the JensenShannon divergence, defined as , where is the KullbackLeibler divergence. This follows from the fact that, if we search for the maximizing discriminator over the space of all functions, the maximizer turns out to be [14]. In practice, we search over some parametric family of discriminators, and we can only compute sample average of the losses. This provides an approximation of the JensenShannon divergence between and . The outer minimization over the generator tries to generate samples such that they are close to the real data in this (approximate) JensenShannon divergence, which is one measure of how close the true distribution and the generator distribution are.
In this section, we show a fundamental connection between the principle of packing and mode collapse in GAN. We provide a complete understanding of how packing changes the loss as seen by the generator, by focusing on (as we did to derive the JensenShnnon divergence above) the optimal discriminator over a family of all measurable functions; the population expectation; and the  loss function of the form:
subject to 
The first assumption allows us to bypass the specific architecture of the discriminator used, which is common when analyzing neural network based discriminators (e.g. [5]). The second assumption can be potentially relaxed and the standard finite sample analysis can be applied to provide bounds similar to those in our main results in Theorems 3, 4, and 5. The last assumption gives a loss of the total variation distance over the domain . This follows from the fact that (e.g. [13]),
This discriminator provides (an approximation of) the total variation distance, and the generator tries to minimize the total variation distance . The reason we make this assumption is primarily for clarity and analytical tractability: total variation distance highlights the effect of packing in a way that is cleaner and easier to understand than if we were to analyze JensenShannon divergence. We discuss this point in more detail in Section 4.2. In sum, these three assumptions allow us to focus purely on the impact of packing on the mode collapse of resulting discriminator.
We want to understand how this 01 loss, as provided by such a discriminator, changes with the degree of packing . As packed discriminators see packed samples, each drawn i.i.d. from one joint class (i.e. either real or generated), we can consider these packed samples as a single sample that is drawn from the product distribution: for real and for generated. The resulting loss provided by the packed discriminator is therefore .
We first provide a formal mathematical definition of mode collapse in Section 4.1, which leads to a twodimensional representation of any pair of distributions as a modecollapse region. This region representation provides not only conceptual clarity regarding mode collapse, but also proof techniques that are essential to proving our main results on the fundamental connections between the strength of mode collapse in a pair and the loss seen by a packed discriminator (Section 4.2). The proofs of these results are provided in Section 5. In Section 4.3, we show that the proposed mode collapse region is equivalent to what is known as the hypothesis testing region
for type I and type II errors in binary hypothesis testing. This allows us to use strong mathematical techniques from binary hypothesis testing including the data processing inequality and the reverse data processing inequalities.
4.1 Mathematical definition of mode collapse as a twodimensional region
Although no formal and agreedupon definition of mode collapse exists in the GAN literature, mode collapse is declared for a multimodal target distribution if the generator assigns a significantly smaller probability density in the regions surrounding a particular subset of modes. One major challenge in addressing such a mode collapse is that it involves the geometry of : there is no standard partitioning of the domain respecting the modular topology of , and even heuristic partitions are typically computationally intractable in high dimensions. Hence, we drop this geometric constraint, and introduce a purely analytical definition.
Definition 1.
A target distribution and a generator exhibit mode collapse for some if there exists a set such that and .
This definition provides a formal measure of mode collapse for a target and a generator ; intuitively, larger and smaller indicate more severe mode collapse. That is, if a large portion of the target in some set in the domain is missing in the generator , then we declare mode collapse.
A key observation is that two pairs of distributions can have the same total variation distance while exhibiting very different mode collapse patterns. To see this, consider a toy example in Figure 4, with a uniform target distribution over . Now consider all generators at a fixed total variation distance of from . We compare the intensity of mode collapse for two extreme cases of such generators. is uniform over and is a mixture of two uniform distributions, as shown in Figure 4. They are designed to have the same total variations distance, i.e. , but exhibits an extreme mode collapse as the whole probability mass in is lost, whereas captures a more balanced deviation from .
Definition 1 captures the fact that has more mode collapse than , since the pair exhibits mode collapse, whereas the pair exhibits only mode collapse, for the same value of . However, the appropriate way to precisely represent mode collapse (as we define it) is to visualize it through a twodimensional region we call the mode collapse region. For a given pair , the corresponding mode collapse region is defined as the convex hull of the region of points such that exhibit mode collapse, as shown in Figure 4.
(3) 
where denotes the convex hull. This definition of region is fundamental in the sense that it is a sufficient statistic that captures the relations between and . This assertion is made precise in Section 4.3 by making a strong connection between the mode collapse region and the type I and type II errors in binary hypothesis testing. That connection allows us to prove a sharp result on how the loss, as seen by the discriminator, evolves under PacGAN in Section 5. For now, we can use this region representation of a given targetgenerator pair to detect the strength of mode collapse occurring for a given generator.
Typically, we are interested in the presence of mode collapse with a small and a much larger ; this corresponds to a sharplyincreasing slope near the origin in the mode collapse region. For example, the middle panel in Figure 4 depicts the mode collapse region (shaded in gray) for a pair of distributions that exhibit significant mode collapse; notice the sharplyincreasing slope at . The right panel in Figure 4 illustrates the same region for a pair of distributions that do not exhibit strong mode collapse, resulting a region with a much gentler slope at .
Similarly, if the generator assigns a large probability mass compared to the target distribution on a subset, we call it a mode augmentation, and give a formal definition below.
Definition 2.
A pair of a target distribution and a generator has an mode augmentation for some if there exists a set such that and .
Note that we distinguish mode collapse and augmentation strictly here, for analytical purposes. In GAN literature, both collapse and augmentation contribute to the observed “mode collapse” phenomenon, which loosely refers to the lack of diversity in the generated samples.
4.2 Evolution of the region under product distributions
The toy example generators and from Figure 4 could not be distinguished using only their total variation distances from , despite exhibiting very different mode collapse properties. This suggests that the original GAN (with 01 loss) may be vulnerable to mode collapse, as it has no way to distinguish distributions in which mode collapse does or does not happen. We prove in Theorem 4 that a discriminator that packs multiple samples together can distinguish modecollapsing generators. Intuitively, packed samples are effectively drawn from the product distributions and . We show in this section that there is a fundamental connection between the strength of mode collapse of and the loss as seen by the packed discriminator .
Intuition via toy examples. Concretely, consider the example from the previous section and recall that denote the product distribution resulting from packing together independent samples from . Figure 5 illustrates how the mode collapse region evolves over , the degree of packing. This evolution highlights a key insight: the region of a modecollapsing generator expands much faster as increases compared to the region of a nonmodecollapsing generator. This implies that the total variation distance of increases more rapidly as we pack more samples, compared to . This follows from the fact that the total variation distance between and the generator can be determined directly from the upper boundary of the mode collapse region (see Section 4.3.2 for the precise relation). In particular, a larger mode collapse region implies a larger total variation distance between and the generator. The total variation distances and , which were explicitly chosen to be equal at in our example, grow farther apart with increasing , as illustrated in the right figure below. This implies that if we use a packed discriminator, the modecollapsing generator will be heavily penalized for having a larger loss, compared to the nonmodecollapsing .
Evolution of total variation distances. In order to generalize the intuition from the above toy examples, we first analyze how the total variation evolves for the set of all pairs that have the same total variation distance when unpacked (i.e., when ). The solutions to the following optimization problems give the desired upper and lower bounds, respectively, on total variation distance for any distribution pair in this set with a packing degree of :
(4)  
where the maximization and minimization are over all probability measures and . We give the exact solution in Theorem 3, which is illustrated pictorially in Figure 6 (left).
Theorem 3.
Although this is a simple statement that can be proved in several different ways, we introduce in Section 5 a novel geometric proof technique that critically relies on the proposed mode collapse region. This particular technique will allow us to generalize the proof to more complex problems involving mode collapse in Theorem 4, for which other techniques do not generalize. Note that the claim in Theorem 3 has nothing to do with mode collapse. Still, we use the mode collapse region definition purely as a proof technique for this claim.
For any given value of and , the bounds in Theorem 3 are easy to evaluate numerically, as shown below in the left panel. Within this achievable range, some pairs have rapidly increasing total variation, occupying the upper part of the region (shown in red, middle panel of Figure 6), and some pairs have slowly increasing total variation, occupying the lower part as shown in blue in the right panel in Figure 6. In particular, the evolution of the modecollapse region of a pair of th power distributions is fundamentally connected to the strength of mode collapse in the original pair . This means that for a modecollapsed pair , the thpower distribution will exhibit a different total variation distance evolution than a nonmodecollapsed pair . As such, these two pairs can be distinguished by a packed discriminator. Making such a claim precise for a broad class of modecollapsing and nonmodecollapsing generators is challenging, as it depends on the target and the generator , each of which can be a complex high dimensional distribution, like natural images. The proposed region interpretation, endowed with the hypothesis testing interpretation and the data processing inequalities that come with it, is critical: it enables the abstraction of technical details and provides a simple and tight proof based on geometric techniques on twodimensional regions.
Evolution of total variation distances with mode collapse. We analyze how the total variation evolves for the set of all pairs that have the same total variations distances when unpacked, with , and have mode collapse for some . The solution of the following optimization problem gives the desired range of total variation distances:
(8)  
has mode collapse 
where the maximization and minimization are over all probability measures and , and the mode collapse constraint is defined in Definition 1. mode collapsing pairs have total variation at least by definition, and when , the feasible set of the above optimization is empty. Otherwise, the next theorem establishes that modecollapsing pairs occupy the upper part of the total variation region; that is, total variation increases rapidly as we pack more samples together (Figure 6, middle panel). This follows from the fact that any pair with total variation distance inherently exhibits mode collapse. One implication is that distribution pairs at the top of the total variation evolution region are those with the strongest mode collapse. Another implication is that a pair with strong mode collapse (i.e., with larger and smaller in the constraint) will be penalized more under packing, and hence a generator minimizing an approximation of will be unlikely to select a distribution that exhibits such strong mode collapse.
Theorem 4.
For all and a positive integer , if then the solution to the maximization in (8) is , and the solution to the minimization in (8) is
(9)  
where , , , and are the
th order product distributions of discrete random variables distributed as
(10)  
(11)  
(12)  
(13) 
If , then the optimization in (8) has no solution and the feasible set is an empty set.
A proof of this theorem is provided in Section 5.2, which critically relies on the proposed mode collapse region representation of the pair , and the celebrated result by Blackwell from [4]. The solutions in Theorem 4 can be numerically evaluated for any given choices of as we show in Figure 7.
Analogous results to the above theorem can be shown for pairs that exhibit mode augmentation (as opposed to mode collapse). These results are omitted for brevity, but the results and analysis are straightforward extensions of the proofs for mode collapse. This holds because total variation distance is a metric, and therefore symmetric.
Evolution of total variation distances without mode collapse. We next analyze how the total variation evolves for the set of all pairs that have the same total variations distances when unpacked, with , and do not have mode collapse for some . Because of the symmetry of the total variation distance, mode augmentation in Definition 2 is equally damaging as mode collapse, when it comes to how fast total variation distances evolve. Hence, we characterize this evolution for those family of pairs of distributions that do not have either mode collapse or augmentation. The solution of the following optimization problem gives the desired range of total variation distances:
(14)  
does not have mode  does not have mode  
where the maximization and minimization are over all probability measures and , and the mode collapse and augmentation constraints are defined in Definitions 1 and 2, respectively.
It is not possible to have and and satisfy the mode collapse and mode augmentation constraints (see Section 5.3 for a proof). Similarly, it is not possible to have and and satisfy the constraints. Hence, the feasible set is empty when . On the other hand, when , no pairs with total variation distance can have mode collapse. In this case, the optimization reduces to the simpler one in (4) with no mode collapse constraints. Nontrivial solution exists in the middle regime, i.e. . The lower bound for this regime, given in equation (18), is the same as the lower bound in (5), except it optimizes over a different range of values. For a wide range of parameters , , and , those lower bounds will be the same, and even if they differ for some parameters, they differ slightly. This implies that the pairs with weak mode collapse will occupy the bottom part of the evolution of the total variation distances (see Figure 6 right panel), and also will be penalized less under packing. Hence a generator minimizing (approximate) is likely to generate distributions with weak mode collapse.
Theorem 5.
For all and a positive integer , if , then the maximum and the minimum of (14) are the same as those of the optimization (4) provided in Theorem 3.
If and then the solution to the maximization in (14) is
(15) 
where and are the th order product distributions of discrete random variables distributed as
(16)  
(17) 
The solution to the minimization in (14) is
(18) 
where and are defined as in Theorem 3.
If and then the solution to the maximization in (14) is
(19) 
where and are the th order product distributions of discrete random variables distributed as
(20)  
(21) 
The solution to the minimization in (14) is
(22) 
where and are defined as in Theorem 3.
If , then the optimization in (14) has no solution and the feasible set is an empty set.
A proof of this theorem is provided in Section 5.3, which also critically relies on the proposed mode collapse region representation of the pair and the celebrated result by Blackwell from [4]. The solutions in Theorem 5 can be numerically evaluated for any given choices of as we show in Figure 8.
The benefit of packing degree . We give a practitioner the choice of the degree of packing, namely how many samples to jointly pack together. There is a natural tradeoff between computational complexity (which increases gracefully with ) and the additional distinguishability, which we illustrate via an example. Consider the goal of differentiating two families of targetgenerator pairs, one with mode collapse and one without:
(23) 
As both families have the same total variation distances, they cannot be distinguished by an unpacked discriminator. However, a packed discriminator that uses samples jointly can differentiate those two classes and even separate them entirely for a certain choices of parameters, as illustrated in Figure 9. In red, we show the achievable for (the bounds in Theorem (4)). In blue is shown a similar region for (the bounds in Theorem (5)). Although the two families are strictly separated (one with and another with ), a nonpacked discriminator cannot differentiate those two families as the total variation is the same for both. However, as you pack mode samples, the packed discriminator becomes more powerful in differentiating the two hypothesized families. For instance, for , the total variation distance completely separates the two families.
In general, the overlap between those regions depends on the specific choice of parameters, but the overall trend is universal: packing separates generators with mode collapse from those without. Further, as the degree of packing increases, a packed discriminator increasingly penalizes generators with mode collapse and rewards generators that exhibit less mode collapse. Even if we consider complementary sets and with the same and (such that the union covers the whole space of pairs of with the same total variation distance), the least penalized pairs will be those with least mode collapse, which fall within the blue region of the bottom right panel in Figure 8. This is consistent with the empirical observations in Tables 1 and 3, where increasing the degree of packing captures more modes.
JensenShannon divergence. In this theoretical analysis, we have focused on  loss, as our current analysis technique gives exact solutions to the optimization problems (4), (8), and (14) if the metric is total variation distance. This follows from the fact that we can provide tight inner and outer regions to the family of mode collapse regions that have the same total variation distances as as shown in Section 5.
In practice,  loss is never used, as it is not differentiable. The most popular choice of a loss function is cross entropy loss, which gives a metric of JensenShannon (JS) divergence, as shown in the beginning of Section 4. However, the same proof techniques used to show Theorems 4 and 5 give loose bounds on JS divergence. In particular, this gap prevents us from sharply characterizing the full effect of packing degree on the JS divergence of a pair of distributions. Nonetheless, we find that empirically, packing seems to reduce mode collapse even under a cross entropy loss. Hence, we leave it as a future research direction to find solutions to the optimization problems (4), (8), and (14), when the metric is the (more common) JensenShannon divergence.
4.3 Operational interpretation of mode collapse via hypothesis testing region
So far, all the definitions and theoretical results have been explained without explicitly using the mode collapse region. The main contribution of introducing the region definition is that it provides a new proof technique based on the geometric properties of these twodimensional regions. Concretely, we show that the proposed mode collapse region is equivalent to a similar notion in binary hypothesis testing. This allows us to bring powerful mathematical tools from this mature area in statistics and information theory—in particular, the data processing inequalities originating from the seminal work of Blackwell [4]. We make this connection precise, which gives insights on how to interpret the mode collapse region, and list the properties and techniques which dramatically simplify the proof, while providing the tight results in Section 5.
4.3.1 Equivalence between the mode collapse region and the hypothesis testing region
There is a simple onetoone correspondence between mode collapse region as we define it in Section 4.1 (e.g. Figure 4) and the hypothesis testing region studied in binary hypothesis testing. In the classical testing context, there are two hypotheses, or , and we make observations via some stochastic experiment in which our observations depend on the hypothesis. Let denote this observation. One way to visualize such an experiment is using a twodimensional region defined by the corresponding type I and type II errors. This was, for example, used to prove strong composition theorems in the context of differential privacy in [20], and subsequently to identify the optimal differentially private mechanisms under local privacy [18] and multiparty communications [19]. We refer to [20] for the precise definition of the hypothesis testing region and its properties.
We can map this binary hypothesis testing setup directly to the GAN context. Suppose the null hypothesis
denotes observations being drawn from the true distribution , and the alternate hypothesis denotes observations being drawn from the generated distribution . Given a sample from this experiment, suppose we make a decision on whether the sample came from or based on a rejection region , such that we reject the null hypothesis if. Type I error is when the null hypothesis is true but rejected, which happens with
, and type II error is when the null hypothesis is false but accepted, which happens with . Sweeping through the achievable pairs for all possible rejection sets, this defines a two dimensional convex region that is called hypothesis testing region. An example of hypothesis testing regions for the two toy examples and from Figure 4 are shown below in Figure 10.In defining the region, we allow stochastic decisions, such that if a point and another point are achievable type II and type I errors, then any convex combination of those points are also achievable by randomly choosing between those two rejection sets. Hence, the resulting hypothesis testing region is always a convex set by definition. We also show only the region below the 45degree line passing through and , as the other region is symmetric and redundant. For a given pair , there is a very simple relation between its mode collapse region and hypothesis testing region.
Remark 6 (Equivalence).
For a pair of target and generator , the hypothesis testing region is a mirror image of the mode collapse region with respect to a horizontal axis at .
For example, the hypothesis testing regions of the toy examples from Figure 4 are shown below in Figure 10. This simple relation allows us to tap into the rich analysis tools known for hypothesis testing regions. We list such properties of mode collapse regions derived from this relation in the next section. The proof of all the remarks follow from the equivalence to binary hypothesis testing and corresponding existing results from [4] and [20].
4.3.2 Properties of the mode collapse region
Given the equivalence between the mode collapse region and the binary hypothesis testing region, several important properties follow as corollaries. First, the mode collapse region is a convex set, by definition. Second, the hypothesis testing region is a sufficient statistic for the purpose of binary hypothesis testing from a pair of distributions . This implies, among other things, that all divergences can be computed from the region. In particular, for the purpose of GAN with  loss, we can define total variation as a geometric property of the region, which is crucial to proving our main results.
Remark 7 (Total variation distance).
The total variation distance between and is the intersection between the vertical axis and the tangent line to the upper boundary of that has a slope of one, as shown in Figure 11.
This follows from the equivalence of the mode collapse region (Remark 6) and the hypothesis testing region. This geometric definition of total variation allows us to enumerate over all pairs that have the same total variation in our proof, via enumerating over all regions that touch the line that has a unit slope and a shift (see Figure 12).
The major strength of the region perspective, as originally studied by Blackwell [4], is in providing a comparison of stochastic experiments. In our GAN context, consider comparing two pairs of target distributions and generators and as follows. First, a hypothesis is drawn, choosing whether to produce samples from the true distribution, in which case we say , or to produce samples from the generator, in which case we say . Conditioned on this hypothesis , we use to denote a random variable that is drawn from the first pair such that and . Similarly, we use to denote a random sample from the second pair, where and . Note that the conditional distributions are welldefined for both and , but there is no coupling defined between them. We can without loss of generality assume to be independently drawn from the uniform distribution.
Definition 8.
The data processing inequality in the following remark shows that if we further process the output samples from the pair then the further processed samples can only have less mode collapse. Processing output of stochastic experiments has the effect of smoothing out the distributions, and mode collapse, which corresponds to a peak in the pair of distributions, are smoothed out in the processing down the Markov chain.
Remark 9 (Data processing inequality).
The following data processing inequality holds for the mode collapse region. For two coupled targetgenerator pairs and , if dominates another pair , then
This is expected, and follows directly from the equivalence of the mode collapse region (Remark 6) and the hypothesis testing region. What is perhaps surprising is that the reverse is also true.
Remark 10 (Reverse data processing inequality).
The following reverse data processing inequality holds for the mode collapse region. For two paired marginal distributions and , if
then there exists a coupling of the random samples from and such that dominates , i.e. they form a Markov chain .
This follows from the equivalence between the mode collapse region and the hypothesis testing region (Remark 6) and Blackwell’s celebrated result on comparisons of stochastic experiments [4]. This region interpretation, and the accompanying (reverse) data processing inequality, abstracts away all the details about and
Comments
There are no comments yet.