An empirical study on evaluation metrics of generative adversarial networks

06/19/2018 ∙ by Qiantong Xu, et al. ∙ cornell university berkeley college 0

Evaluating generative adversarial networks (GANs) is inherently challenging. In this paper, we revisit several representative sample-based evaluation metrics for GANs, and address the problem of how to evaluate the evaluation metrics. We start with a few necessary conditions for metrics to produce meaningful scores, such as distinguishing real from generated samples, identifying mode dropping and mode collapsing, and detecting overfitting. With a series of carefully designed experiments, we comprehensively investigate existing sample-based metrics and identify their strengths and limitations in practical settings. Based on these results, we observe that kernel Maximum Mean Discrepancy (MMD) and the 1-Nearest-Neighbor (1-NN) two-sample test seem to satisfy most of the desirable properties, provided that the distances between samples are computed in a suitable feature space. Our experiments also unveil interesting properties about the behavior of several popular GAN models, such as whether they are memorizing training samples, and how far they are from learning the target distribution.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


An empirical study on evaluation metrics of generative adversarial networks

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative adversarial networks (GANs) (Goodfellow et al., 2014) have been studied extensively in recent years. Besides producing surprisingly plausible images (Radford et al., 2015; Larsen et al., 2015; Karras et al., 2017; Arjovsky et al., 2017; Gulrajani et al., 2017)

, they have also been innovatively applied in, for example, semi-supervised learning

(Odena, 2016; Makhzani et al., 2015)

, image-to-image translation

(Isola et al., 2016; Zhu et al., 2017), and simulated image refinement (Shrivastava et al., 2016). However, despite the availability of a plethora of GAN models (Arjovsky et al., 2017; Qi, 2017; Zhao et al., 2016), their evaluation is still predominantly qualitative, very often resorting to manual inspection of the visual fidelity of generated images. Such evaluation is time-consuming, subjective, and possibly misleading. Given the inherent limitations of qualitative evaluations, proper quantitative metrics are crucial for the development of GANs to guide the design of better models.

Possibly the most popular metric is the Inception Score (Salimans et al., 2016), which measures the quality and diversity of the generated images using an external model, the Google Inception network (Szegedy et al., 2014)

, trained on the large scale ImageNet dataset

(Deng et al., 2009). Some other metrics are less widely used but still very valuable. Wu et al. (2016)

proposed a sampling method to estimate the log-likelihood of generative models, by assuming a Gaussian observation model with a fixed variance.

Bounliphone et al. (2015) propose to use maximum mean discrepancies (MMDs) for model selection in generative models. Lopez-Paz & Oquab (2016)

apply the classifier two-sample test, a well-studied tool in statistics, to assess the difference between the generated and target distribution.

Although these evaluation metrics are shown to be effective on various tasks, it is unclear in which scenarios their scores are meaningful, and in which other scenarios prone to misinterpretations. Given that evaluating GANs is already challenging it can only be more difficult to evaluate the evaluation metrics themselves. Most existing works attempt to justify their proposed metrics by showing a strong correlation with human evaluation (Salimans et al., 2016; Lopez-Paz & Oquab, 2016)

. However, human evaluation tends to be biased towards the visual quality of generated samples and neglect the overall distributional characteristics, which are important for unsupervised learning.

In this paper we comprehensively examine the existing literature on sample-based quantitative evaluation of GANs. We address the challenge of evaluating the metrics themselves by carefully designing a series of experiments through which we hope to answer the following questions: 1) What are reasonable characterizations of the behavior of existing sample-based metrics for GANs? 2) What are the strengths and limitations of these metrics, and which metrics should be preferred accordingly? Our empirical observation suggests that MMD and 1-NN two-sample test are best suited as evaluation metrics on the basis of satisfying useful properties such as discriminating real versus fake images, sensitivity to mode dropping and collapse, and computational efficiency.

Ultimately, we hope that this paper will establish good principles on choosing, interpreting, and designing evaluation metrics for GANs in practical settings. We will also release the source code for all experiments and metrics examined (, providing the community with off-the-shelf tools to debug and improve their GAN algorithms.

2 Background

We briefly review the original GAN framework proposed by Goodfellow et al. (2014). Description of the GAN variants used in our experiments is deferred to the Appendix A.

2.1 Generative adversarial networks

Let be the space of natural images. Given i.i.d. samples drawn from a real distribution over , we would like to learn a parameterized distribution that approximates the distribution .

A generative adversarial network has two components, the discriminator and the generator , where is some latent space. Given a distribution over (usually an isotropic Gaussian), the distribution is defined as . Optimization is performed with respect to a joint loss for and

Intuitively, the discriminator

outputs a probability for every

that corresponds to its likelihood of being drawn from

, and the loss function encourages the generator

to produce samples that maximize this probability. Practically, the loss is approximated with finite samples from and , and optimized with alternating steps for and using gradient descent.

To evaluate the generator, we would like to design a metric that measures the “dissimilarity" between to .111Note that does not need satisfy symmetry or triangle inequality, so it is not, mathematically speaking, a distance metric between and . We still call it a metric throughout this paper for simplicity. In theory, with both distributions known, common choices of

include the Kullback-Leibler divergence (KLD), Jensen-Shannon divergence (JSD) and total variation. However, in practical scenarios,

is unknown and only the finite samples in are observed. Furthermore, it is almost always intractable to compute the exact density of , but much easier to sample (especially so for GANs). Given these limitations, we focus on empirical measures of “dissimilarity" between samples from two distributions.

Figure 1: Typical sample based GAN evaluation methods.

2.2 Sample based metrics

We mainly focus on sample based evaluation metrics that follow a common setup illustrated in Figure 1. The metric calculator is the key element, for which we briefly introduce five representative methods: Inception Score (Salimans et al., 2016), Mode Score (Che et al., 2016) , Kernel MMD (Gretton et al., 2007), Wasserstein distance, Fréchet Inception Distance (FID) (Heusel et al., 2017), and 1-nearest neighbor (1-NN)-based two sample test (Lopez-Paz & Oquab, 2016). All of them are model agnostic and require only finite samples from the generator.

The Inception Score

is arguably the most widely adopted metric in the literature. It uses a image classification model , the Inception network (Szegedy et al., 2016), pre-trained on the ImageNet (Deng et al., 2009) dataset, to compute

where denotes the label distribution of as predicted by , and , i.e. the marginal of over the probability measure . The expectation and the integral in can be approximated with i.i.d. samples from . A higher has close to a point mass, which happens when the Inception network is very confident that the image belongs to a particular ImageNet category, and has close to uniform, i.e. all categories are equally represented. This suggests that the generative model has both high quality and diversity. Salimans et al. (2016) show that the Inception Score has a reasonable correlation with human judgment of image quality. We would like to highlight two specific properties: 1) the distributions on both sides of the KL are dependent on , and 2) the distribution of the real data , or even samples thereof, are not used anywhere.

The Mode Score222We use a modified version here, as the original one reduces to the Inception Score.

is an improved version of the Inception Score. Formally, it is given by

where is the marginal label distribution for the samples from the real data distribution. Unlike the Inception Score, it is able to measure the dissimilarity between the real distribution and generated distribution through the term .

Kernel MMD

(Maximum Mean Discrepancy) is defined as

measures the dissimilarity between and for some fixed kernel function . Given two sets of samples from and , the empirical MMD between the two distributions can be computed with finite sample approximation of the expectation. A lower MMD means that is closer to . The Parzen window estimate (Gretton et al., 2007) can be viewed as a specialization of Kernel MMD.

The Wasserstein distance

between and is defined as


denotes the set of all joint distributions (i.e. probabilistic couplings) whose marginals are respectively

and , and denotes the base distance between the two samples. For discrete distributions with densities and , the Wasserstein distance is often referred to as the Earth Mover’s Distance (EMD), and corresponds to the solution to the optimal transport problem

This is the finite sample approximation of used in practice. Similar to MMD, the Wasserstein distance is lower when two distributions are more similar.

The Fréchet Inception Distance (FID)

was recently introduced by Heusel et al. (2017) to evaluate GANs. For a suitable feature function (by default, the Inception network’s convolutional feature), FID models and

as Gaussian random variables with empirical means

and empirical covariance , and computes

which is the Fréchet distance (or equivalently, the Wasserstein-2 distance) between the two Gaussian distributions

(Heusel et al., 2017).

The 1-Nearest Neighbor classifier

is used in two-sample tests to assess whether two distributions are identical. Given two sets of samples and , with , one can compute the leave-one-out (LOO) accuracy of a 1-NN classifier trained on and with positive labels for and negative labels for . Different from the most common use of accuracy, here the 1-NN classifier should yield a LOO accuracy when is large. This is achieved when the two distributions match. The LOO accuracy can be lower than , which happens when the GAN overfits to . In the (hypothetical) extreme case, if the GAN were to memorize every sample in and re-generate it exactly, i.e. , the accuracy would be , as every sample from would have it nearest neighbor from

with zero distance. The 1-NN classifier belongs to the two-sample test family, for which any binary classifier can be adopted in principle. We will only consider the 1-NN classifier because it requires no special training and little hyperparameter tuning.

Lopez-Paz & Oquab (2016) considered the 1-NN accuracy primarily as a statistic for two-sample testing. In fact, it is more informative to analyze it for the two classes separately. For example, a typical outcome of GANs is that for both real and generated images, the majority of their nearest neighbors are generated images due to mode collapse. In this case, the LOO 1-NN accuracy of the real images would be relatively low (desired): the mode(s) of the real distribution are usually well captured by the generative model, so a majority of real samples from are surrounded by generated samples from , leading to low LOO accuracy; whereas the LOO accuracy of the generated images is high (not desired): generative samples tend to collapse to a few mode centers, thus they are surrounded by samples from the same class, leading to high LOO accuracy. For the rest of the paper, we distinguish these two cases as 1-NN accuracy (real) and 1-NN accuracy (fake).

2.3 Other metrics

All of the metrics above are, what we refer to as “model agnostic": they use the generator as a black box to sample the generated images . Model agnostic metrics should not require a density estimation from the model. We choose to only experiment with model agnostic metrics, which allow us to support as many generative models as possible for evaluation without modification to their structure. We will briefly mention some other evaluation metrics not included in our experiments.

Kernel density estimation (KDE, or Parzen window estimation) is a well-studied method for estimating the density function of a distribution from samples. For a probability kernel (most often an isotropic Gaussian) and i.i.d samples , we can define the density function at as , where is a normalizing constant. This allows the use of classical metrics such as KLD and JSD. However, despite the widespread adoption of this technique to various applications, its suitability to estimating the density of or for GANs has been questioned by Theis et al. (2015) since the probability kernel depends on the Euclidean distance between images.

More recently, Wu et al. (2016) applied annealed importance sampling (AIS) to estimate the marginal distribution of a generative model. This method is most natural for models that define a conditional distribution where is the latent code, which is not satisfied by most GAN models. Nevertheless, AIS has been applied to GAN evaluation by assuming a Gaussian observation model. We exclude this method from our experiments as it needs the access to the generative model to compute the likelihood, instead of only depending on a finite sample set .

3 Experiments with GAN evaluation metrics

3.1 Feature space

All the metrics introduced in the previous section, except for the Inception Score and Mode Score, access the samples only through pair-wise distances. The Kernel MMD requires a fixed kernel function , typically set to an isotopic Gaussian; the Wasserstein distance and 1-NN accuracy use the underlying distance metric directly; all of these methods are highly sensitive to the choice that distance.

It is well-established that pixel representations of images do not induce meaningful Euclidean distances (Forsyth & Ponce, 2011). Small translations, rotations, or changes in illumination can increase distances dramatically with little effect on the image content. To quantity the similarity between distributions of images, it is therefore desirable to use distances invariant to such transformations. The choice of distance function can be re-interpreted as a choice of representation, by defining the distance in a more general form as , where is some general mapping of the input into a semantically meaningful feature space. For Kernel MMD, this corresponds to computing the usual inner product in the feature space .

Inspired by the Inception/Mode Score, and recent works from Upchurch et al. (2017); Larsen et al. (2015) which show that convolutional networks may linearize the image manifold, we propose to operate in the feature space of a pre-trained model on the ImageNet dataset. For efficiency, we use a 34-layer ResNet as the feature extractor. Our experiments show that other models such as VGG or Inception give very similar results.

To illustrate our point, we show failure examples of the pixel space distance for evaluating GANs in this section, and highlight that using a proper feature space is key to obtaining meaningful results when applying the distance-based metrics. The usage of a well-suited feature space enables us to draw more optimistic conclusions on GAN evaluation metrics than in Theis et al. (2015).

3.2 Setup

For the rest of this major section, we introduce what in our opinion are necessary conditions for good metrics for GANs. After the introduction of each condition, we use it as a criterion to judge the effectiveness of the metrics presented in Section 2, through carefully designed empirical experiments.

The experiments are performed on two commonly used datasets for generative models, CelebA333CelebA is a large-scale high resolution face dataset with more than 200,000 centered celebrity images. and LSUN bedrooms444LSUN consists of around one million images for each of 10 scene classes. Following standard practice, we only take the bedroom scene.

. To remove the degree of freedom induced by feature representation, we use the Inception Score (IS) and Mode Score (MS) computed from the softmax probabilities of the same ResNet-34 model as the other metrics, instead of the Inception model. We also compute the Inception Score over the real training data

as an upper bound, which we denote as . Moreover, to be consistent with other metrics where lower values correspond to better models, we report the relative inverse Inception Score here, after computing IS

the Inception Score. We similarly report the relative inverse Mode Score (RMS). Although RIS and RMS operate in the softmax space, we always compare them together with other metrics in the convolutional space for simplicity. For all the plots in this paper, shaded areas denote the standard deviations, computed by running the same experiment

times with different random seeds.

Figure 2: Distinguishing a set of real images from a mixed set of real images and GAN generated images. For the metric to be discriminative, its score should increase as the fraction of generated samples in the mix increases. RIS and RMS fail as they decrease with the fraction of generated samples in on LSUN. Wasserstein and 1-NN accuracy (real) fail in pixel space as they do not increase.

3.3 Discriminability

Mixing of generated images.

Arguably, the most important property of a metric for measuring GANs is the ability to distinguish generated images from real images. To test this property, we sample a set consisting of ( if not otherwise specified) real images uniformly from the training set, and a set of the same size consisting of a mix of real samples and generated images from a DCGAN (Radford et al., 2015) trained on the same training set, where denotes the ratio of generated images.

The computed values of various metrics between and for are shown in Figure 2. Since should serve as a lower bound for any metric, we expect that any reasonable should increase as the ratio of generated images increases. This is indeed satisfied for all the metrics except: 1) RIS and RMS (red and green curves) on LSUN, which decrease as more fake samples are in the mix; 2) 1-NN accuracy of real samples (dotted magenta curve) computed in pixel space, which also appears to be a decreasing function; and 3) Wasserstein Distance (cyan curve), which almost remains unchanged when varies.

The reason that RIS and RMS do not work well here is likely because they are not suitable for images beyond the ImageNet categories. Although other metrics operate in the convolutional feature space also depend on a network pretrained on ImageNet, the convolutional features are much more general than the specific softmax representation. The failure of Wasserstein Distance is possibly due to an insufficient number of samples, which we will discuss in more detail when we analyze the sample efficiency of various metrics in a latter subsection. The last paragraph of Section 2.2 explains why the 1-NN accuracy for real samples (dotted magenta curve) is always lower than that for generated samples (dashed magenta curve). In the pixel space, more than half of the samples from have the nearest neighbor from , indicating that the DCGAN is able to represent the modes in the pixel space quite well.

We also conducted the same experiment using 1) random noise images and 2) images from an entirely different distribution (e.g. CIFAR-10), instead of DCGAN generated images to construct . We call these injected samples as out-of-domain violations since they are not in , the domain of the real images. These settings yield similar results as in Figure 2, thus we omit their plots.

Figure 3: Experiment on simulated mode collapsing. A metric score should increase to reflect the mismatch between true distribution and generated distribution as more modes are collapsed towards their cluster center. All metrics respond correctly in convolutional space. In pixel space, both Wasserstein distance and 1-NN accuracy (real) fail as they decrease in response to more collapsed clusters.

Mode collapsing and mode dropping.

In realistic settings, is usually very diverse since natural images are inherently multimodal. Many have conjectured that differs from by reducing diversity, possibly due to the lack of model capacity or inadequate optimization (Arora et al., 2017). This is often manifested itself for generative models in a mix of two ways: mode dropping, where some hard-to-represent modes of are simply “ignored" by ; and mode collapsing, where several modes of are “averaged" by into a single mode, possibly located at a midpoint. An ideal metric should be sensitive to these two phenomena.

To test for mode collapsing, we first randomly sample both and as two disjoint sets of 2000 real images. Next, we find 50 clusters in the whole training set with -means and progressively replace each cluster by its respective cluster center to simulate mode collapse. Figure 3 shows computed values of as the number of replaced (collapsed) clusters, denoted as increases. Ideally, we expect the scores increase as grows. We first observe that all the metrics are able to respond correctly when distances are computed in the convolutional feature space. However, the Wasserstein metric (cyan curve) breaks down in pixel space, as it considers a collapsed sample set (with ) being closer to the real sample set than another set of real images (with ). Moreover, although the overall 1-NN accuracy (solid magenta curve) follows the desired trend, the real and fake parts follow opposite trends: 1-NN real accuracy (dotted magenta curve) decreases while 1-NN fake accuracy (dashed magenta curve) increases. Again, this is inline with our explanation given in the last paragraph of Section 2.2.

To test for mode dropping, we take as above and construct by randomly removing clusters. To keep the size of constant, we replace images from the removed cluster with images randomly selected from the remaining clusters. Figure 4 shows how different metrics react to the number of removed clusters, also denoted as . All scores effectively discriminate against mode dropping except the RIS and RMS - they remain almost indifferent when some modes are dropped. Again, this is perhaps caused by the fact that the Inception/Mode Score were originally designed for datasets with classes overlapping with the ImageNet dataset, and they do not generalize well to other datasets.

Figure 4: Experiment on simulated mode dropping. A metric score should increase to reflect the mismatch between true distribution and generated distribution as more modes are dropped. All metrics except RIS and RMS respond correctly, as they only increase slightly in value even when almost all modes are dropped.
Figure 5: Experiment on robustness of each metric to small transformations (rotations and translations). All metrics should remain constant across all mixes of real and transformed real samples, since the transformations do not alter image semantics. All metrics respond correctly in convolutional space, but not in pixel space. This experiment illustrates the unsuitability of distances in pixel space.
Figure 6: The score of various metrics as a function of the number of samples. An ideal metric should result in a large gap between the real-real (R-R; ) and real-fake (R-G; ) curves in order to distinguish between real and fake distributions using as few samples as possible. Compared with Wasserstein distance, MMD and 1-NN accuracy require much fewer samples to discriminate real and generated images, while RIS totally fails on LSUN as it scores generated images even better (lower) than real images.

3.4 Robustness to transformations

GANs are widely used for image datasets, which have the property that certain transformations to the input do not change its semantic meaning. Thus an ideal evaluation metric should be invariant to such transformations to some extent. For example, a generator trained on CelebA should not be penalized by a metric if its generated faces are shifted by a few pixels or rotated by a small angle.

Figure 5 shows how the various metrics react to such small transformation to the images. In this experiment, and are two disjoint sets of 2000 real images sampled from the training data. However, a proportion of images from are randomly shifted (up to 4 pixels) or rotated (up to 15 degrees). We can observe from the results that metrics operating in the convolutional space (or softmax space for RIS and RMS) are robust to these transformations, as all the scores are almost constant as the ratio of transformed samples increases. This is not that surprising as convolutional networks are well know for being invariant to certain transformations (Mallat, 2016). In comparison, in the pixel space all the metrics consider the shifted/rotated images as drawn from a different distribution, highlighting the importance of computing distances in a proper feature space.

3.5 Efficiency

A practical GAN evaluation metric should be able to compute “accurate” scores from a reasonable number

of samples and within an affordable computation cost, such that it can be computed, for example, after each training epoch to monitor the training process.

Sample efficiency.

Here we measure the sample efficiency of various metrics by investigating how many samples are needed for each of them in order to discriminate a set of generated samples (from DCGAN) from a set of real samples . To do this, we introduce a reference set , which is also uniformly sampled from the real training data, but is disjoint with . All three sample sets have the same size, i.e., . We expect that an ideal metric should correctly score lower than with a relatively small . In other words, the number of samples needed for the metric to distinguish and can be viewed as its sample complexity.

Figure 7: Measurement of wall-clock time for computing various metrics as a function of the number of samples. All metrics are practical to compute for a sample of size 2000, but Wasserstein distance does not scale to large sample sizes.

In Figure 6 we show the individual scores as a function of . We can observe that MMD, FID and 1-NN accuracy computed in convolution feature space are able to distinguish the two set of images (solid blue and magenta curves) and (dotted solid blue and magenta curves) with relatively few samples. The Wasserstein distance (cyan curves) is not discriminative with samples size less than 1000, while the RIS even considers the generated samples to be more “real” than the real samples on the LSUN dataset (the red curves in the third panel). The dotted lines in Figure 6 also quantify how fast the scores converge to their expectations as we increase the sample size. Note that MMD for converges very quickly to zero and gives discriminative scores with few samples, making it a practical metric for comparing GAN models.

Computational efficiency.

Fast computation of the metric is of practical concern as it helps researchers monitor the training process and diagnose problems early on, or perform early stopping. In Figure 7 we investigate the computational efficiency of the above metrics by showing the wall-clock time (in log scale) to compute them as a function of the number of samples. For a typical number of 2000 samples, it only takes about 8 seconds to compute each of these metrics on an NVIDIA TitanX. In fact, the majority of time is spent on extracting features from the ResNet model. Only the Wasserstein distance becomes prohibitively slow for large sample sizes.

Figure 8: Experiment on detecting overfitting of generated samples. As more generated samples overlap with real samples from the training set, the gap between validation and training score should increase to signal overfitting. All metrics behave correctly except for RIS and RMS, as these two metrics do not increase when the fraction of overlapping samples increases.

3.6 Detecting overfitting

Overfitting is an artifact of training with finite samples. If a GAN successfully memorizes the training images, i.e.,

is a uniform distribution over the training sample set

, then the generated samples becomes a uniformly drawn set of samples from , and any reasonable should be close to 0. The Wasserstein distance, MMD and 1-NN two sample test are able to detect overfitting in the following sense: if we hold out a validation set , then should be significantly higher than when memorizes a part of . The difference between them can informally be viewed as a form of “generalization gap".

We simulate the overfitting process by defining as a mix of samples from the training set and a second holdout set, disjoint from both and . Figure 8 shows the gap of the various metrics as a function of the overlapping ratio between and . The left most point of each curve can be viewed as the score computed on a validation set since the overlap ratio is 0. For better visualization, we normalize the Wasserstein distance and MMD by dividing their corresponding score when and have no overlap. As shown in Figure 8, all the metrics except RIS and RMS reflect that the “generalization gap" increases as overfits more to . The failure of RIS is not surprising: it totally ignores the real data distribution as we discussed in Section 2.2. While the reason that RMS also fails to detect overfitting may again be its lack of generalization to datasets with classes not contained in the ImageNet dataset. In addition, RMS operates in the softmax space, the features in which might be too specific compared to the features in the convolutional space.

4 Discussions and Conclusion

Based on the above analysis, we can summarize the advantages and inherent limitations of the six evaluation metrics, and conditions under which they produce meaningful results. With some of the metrics, we are able to study the problem of overfitting (see Appendix C), perform model selection on GAN models and compare them without resorting to human evaluation based on cherry-picked samples (see Appendix D).

The Inception Score

does show a reasonable correlation with the quality and diversity of generated images, which explains the wide usage in practice. However, it is ill-posed mostly because it only evaluates as an image generation model rather than its similarity to . Blunt violations like mixing in natural images from an entirely different distribution completely deceives the Inception Score. As a result, it may encourage the models to simply learn sharp and diversified images (or even some adversarial noise), instead of . This also applies to the Mode Score. Moreover, the Inception Score is unable to detect overfitting since it cannot make use of a holdout validation set.

Kernel MMD works surprising well when it operates in the feature space of a pre-trained ResNet. It is always able to identify generative/noise images from real images, and both its sample complexity and computational complexity are low. Given these advantages, even though MMD is biased, we recommend its use in practice.

Wasserstein distance

works well when the distance is computed in a suitable feature space. However, it has a high sample complexity, a fact that has also been observed by (Arora et al., 2017). Another key weakness is that computing the exact Wasserstein distance has a time complexity of , which is prohibitively expensive as sample size increases. Compared to other methods, Wasserstein distance is less appealing as a practical evaluation metric.

Fréchet Inception Distance

performs well in terms of discriminability, robustness and efficiency. It serves as a good metric for GANs, despite only modeling the first two moments of the distributions in feature space.

1-NN classifier appears to be an ideal metric for evaluating GANs. Not only does it enjoy all the advantages of the other metrics, it also outputs a score in the interval , similar to the accuracy/error in classification problems. When the generative distribution perfectly match the true distribution, perfect score (i.e., accuracy) is attainable. From Figure 2, we find that typical GAN models tend to achieve lower LOO accuracy for real samples (1-NN accuracy (real)), while higher LOO accuracy for generated samples (1-NN accuracy (fake)). This suggests that GANs are able to capture modes from the training distribution, such that the majority of training samples distributed around the mode centers have their nearest neighbor from the generated images, yet most of the generated images are still surrounded by generated images as they are collapsed together. The observation indicates that the mode collapse problem is prevalent for typical GAN models. We also note that this problem, however, cannot be effectively detected by human evaluation or the widely used Inception Score.

Overall, our empirical study suggests that the choice of feature space in which to compute various metrics is crucial. In the convolutional space of a ResNet pretrained on ImageNet, both MMD and 1-NN accuracy appear to be good metrics in terms of discriminability, robustness and efficiency. Wasserstein distance has very poor sample efficiency, while Inception Score and Mode Score appear to be unsuitable for datasets that are very different from ImageNet. We will release our source code for all these metrics, providing researchers with an off-the-shelf tool to compare and improve GAN algorithms.

Based on the two most prominent metrics, MMD and 1-NN accuracy, we study the overfitting problem of DCGAN and WGAN (in Appendix C). Despite the widespread belief that GANs are overfitting to the training data, we find that this does not occur unless there are very few training samples. This raises an interesting question regarding the generalization of GANs in comparison to the supervised setting. We hope that future work can contribute to explaining this phenomenon.


The authors are supported in part by grants from the National Science Foundation (III-1525919, IIS-1550179, IIS-1618134, S&AS 1724282, and CCF-1740822), the Office of Naval Research DOD (N00014-17-1-2175), and the Bill and Melinda Gates Foundation. We are thankful for generous support by SAP America Inc.


Appendix A GAN Variants used in our experiments

Many GAN variants have been proposed recently. In this paper we consider several of them, which we briefly review in this section.


(Radford et al., 2015)

. The generator of a DCGAN takes a lower dimensional input from a uniform noise distribution, then projects and reshapes it to a small convolutional representation with many feature maps. After applying a series of four fractionally-strided convolutions, the generator converts this representation into a

pixel image. DCGAN is optimized by minimizing the Jensen-Shannon divergence between the real and generated images.


(Arjovsky et al., 2017). A critic network that outputs unconstrained real values is used in place of the discriminator. When the critic is Lipschitz, this network approximates the Wasserstein distance between and . A Lipschitz condition is enforced by clipping the critic networks’ parameters to stay within a predefined bounding box.

WGAN with gradient penalty

(Gulrajani et al., 2017) improves upon WGAN by enforcing the Lipschitz condition with a gradient penalty term. This method significantly improves the convergence speed and the quality of the images generated by a WGAN.


(Mao et al., 2016). Least Squares GAN adopts the least squares loss function instead of the commonly used sigmoid cross entropy loss for the discriminator, essentially minimizing the Pearson divergence between the real distribution and generative distribution .

Figure 9: Comparison of all metrics in different feature spaces. When using different trained networks, the trends of all metrics are very similar. Most metrics work well even in a random network, but Wasserstein distance has very high variance and the magnitude of increase for 1-NN accuracy is small.

Appendix B The choice of feature space

The choice of features space is crucial for all these metrics. Here we consider several alternatives to the convolutional features from the 34-layer ResNet trained on ImageNet. In particular, we compute various metrics using the features extracted by (1) the VGG and Inception networks; (2) a 34-layer ResNet with random weights; (3) a ResNet classifier trained on the same dataset as the GAN models. We use the features extracted from these models to test all metrics in the discriminative experiments we performed in Section

3. All experimental settings are identical except for the third experiments, which is performed on the CIFAR-10 dataset (Krizhevsky & Hinton, 2009) instead as we need class labels to train the classifier. Note that we consider setting (3) only for analytical purposes. It is not a practical choice as GANs are mainly designed for unsupervised learning and we should not assume the existence of ground truth labels.

Figure 10: Using features extracted from a ResNet trained on CIFAR-10 (right plot) to evaluate a GAN model trained on the same dataset. Compared to using an extractor trained on ImageNet, the metrics appear to have lower variance. However, this may due to the feature dimensionality being smaller for CIFAR-10.

The results are shown in Figure 9 and Figure 10, from which several observations can be made: (1) switching from ResNet-34 to VGG or Inception has little effect to the metric scores; (2) the features from a random network still works for MMD, while it makes the Wasserstein distance unstable and 1-NN accuracy less discriminative. Not surprisingly, the Inception Score and Mode Score becomes meaningless if we use the softmax values from the random network; (3) features extracted from the classifier trained on the same dataset as the GAN model also offers high discriminability for these metrics, especially for the Wasserstein distance. However, this may be simply due to the fact that the feature dimensionality of the ResNet trained on CIFAR-10 is much smaller than that of the ResNet-34 trained on ImageNet (64 v.s. 512).

Figure 11: Training curves of DCGAN and WGAN on a large (left two panels), small (middle two panels) and tiny (right two panels) subsets of CelebA. Note that for the first four plots, blue (yellow) curves almost overlap with the red (green) curves, indicating no overfitting detected by the two metrics. Overfitting only observed on the tiny training set, with MMD score and 1-NN accuracy significantly worse (higher) on the validation set.

Appendix C Are GANs overfitting to the training data?

We trained two representative GAN models, DCGAN (Radford et al., 2015) and WGAN (Arjovsky et al., 2017) on the CelebA dataset. Out of the 200,000 images in total, we holdout 20,000 images for validation, and the rest for training. As the training set is sufficiently large, which makes overfitting unlikely to occur, we also create a small training set and a tiny training set respectively with only 2000 and 10 images sampled from the full training set.

The training setting for DCGAN and WGAN strictly follow their original implementation, except that we change the default number of training iterations such that both models are sufficiently updated. For each metric, we compute their score on 2000 real samples and 2000 generated samples, where the real samples are drawn from either the training set or the validation set, giving rise to training and validation scores. The results are shown in Figure 11, from which we can make several observations:

  • [leftmargin=3ex,noitemsep,nolistsep]

  • The training and validation scores almost overlap with each other with 2000 or 180k training samples, showing that both DCGAN and WGAN do not overfit to the training data under of these metrics. Even when using only 2000 training samples, there is still no significant difference between the training score and validation score. This shows that the training process of GANs behaves quite differently from those of supervised deep learning models, where a model can easily achieve 0 training error while behaving like random guess on the validation set

    (Zhang et al., 2016). 555We observed that even memorizing images is difficult for GAN models.

  • DCGAN outperforms WGAN on the full training set under both metrics, and converges faster. However, WGAN is much more stable on the small training set, and converges to better positions.

Conv Space MMD 0.019 0.205 0.270 0.194 0.232
1-NN Accuracy 0.499 0.825 0.920 0.812 0.871
1-NN Accuracy (real) 0.495 0.759 0.880 0.765 0.804
1-NN Accuracy (fake) 0.503 0.892 0.961 0.860 0.938
Table 1: Comparison of several GAN models on the LSUN dataset

Appendix D Comparison of popular GAN models based on quantitative evaluation metrics

Based on our analysis, we chose MMD and 1-NN accuracy in the feature space of a 34-layer ResNet trained on ImageNet to compare several state-of-the-art GAN models. All scores are computed using 2000 samples from the holdout set and 2000 generated samples. The GAN models evaluated include DCGAN (Radford et al., 2015), WGAN (Arjovsky et al., 2017), WGAN with gradient penalty (WGAN-GP ) (Gulrajani et al., 2017), and LSGAN (Mao et al., 2016) , all trained on the CelebA dataset. The results are reported in Table 1, from which we highlight three observations:

  • WGAN-GP performs the best under most of the metrics.

  • DCGAN achieves 0.759 overall 1-NN accuracy on real samples, slightly better than 0.765 achieved by WGAN-GP; while the 1-NN accuracy on generated (fake) samples achieved by DCGAN is higher than that by WGAN-GP (0.892 v.s. 0.860). This seems to suggest that DCGAN is better at capturing modes in the training data distribution, while its generated samples are more collapsed compared to WGAN-GP. Such subtle difference is unlikely to be discovered by the Inception Score or human evaluation.

  • The 1-NN accuracy for all evaluated GAN models are higher than , far above the ground truth of . The MMD score of the four GAN models are also much larger than that of ground truth (). This indicates that even state-of-the-art GAN models are far from learning the true distribution.