Pros and Cons of GAN Evaluation Measures

02/09/2018 ∙ by Ali Borji, et al. ∙ 0

Generative models, in particular generative adverserial networks (GANs), have received a lot of attention recently. A number of GAN variants have been proposed and have been utilized in many applications. Despite large strides in terms of theoretical progress, evaluating and comparing GANs remains a daunting task. While several measures have been introduced, as of yet, there is no consensus as to which measure best captures strengths and limitations of models and should be used for fair model comparison. As in other areas of computer vision and machine learning, it is critical to settle on one or few good measures to steer the progress in this field. In this paper, I review and critically discuss more than 19 quantitative and 4 qualitative measures for evaluating generative models with a particular emphasis on GAN-derived models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

page 16

page 19

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative models are a fundamental component of a variety of important machine learning and computer vision algorithms. They are increasingly used to estimate the underlying statistical structure of high dimensional signals and artificially generate various kinds of data including high-quality images, videos, and audio. They can be utilized for purposes such as representation learning and semi-supervised learning 

radford2015unsupervised ; odena2016conditional ; salimans2016improved , domain adaptation ganin2016domain ; tzeng2017adversarial , text to image synthesis reed2016generative , compression theis2017lossy

, super resolution 

ledig2016photo , inpainting pathak2016context ; yeh2017semantic , saliency prediction pan2017salgan , image enhancement zhang2017image , style transfer and texture synthesis gatys2016image ; johnson2016perceptual

, image-to-image translation 

isola2017image ; zhu2017unpaired , and video generation and prediction vondrick2016generating . A recent class of generative models known as Generative Adversarial Networks (GANs) by Goodfellow et al.  goodfellow2014generative has attracted much attention. A sizable volume of follow-up papers have been published since introduction of GANs in 2014. There has been substantial progress in terms of theory and applications and a large number of GAN variants have been introduced. However, relatively less effort has been spent in evaluating GANs and grounded ways to quantitatively and qualitatively assess them are still missing.

Generative models can be classified into two broad categories of

explicit and implicit approaches. The former class assumes access to the model likelihood function, whereas the latter uses a sampling mechanism to generate data. Examples of explicit models are variational auto-encoders (VAEs) kingma2013auto ; kingma2014semi and PixelCNN van2016conditional . Examples of implicit generative models are GANs. Explicit models are typically trained by maximizing the likelihood or its lower bound. GANs aim to approximate a data distribution , using a parameterized model distribution . They achieve this by jointly optimizing two adversarial networks: a generator and a discriminator. The generator

is trained to synthesize from a noise vector an image that is close to the true data distribution. The discriminator

is optimized to accurately distinguish between the synthesized images coming from the generator and the real images from the data distribution. GANs have shown a dramatic ability to generate realistic high resolution images.

Several evaluation measures have surfaced with the emergence of new models. Some of them attempt to quantitatively evaluate models while some others emphasize on qualitative ways such as user studies or analyzing internals of models. Both of these approaches have strengths and limitations. For example, one may think that fooling a person in distinguishing generated images from real ones can be the ultimate test. Such a measure, however, may favor models that concentrate on limited sections of the data (i.e. 

 overfitting or memorizing; low diversity; mode dropping). Quantitative measures, while being less subjective, may not directly correspond to how humans perceive and judge generated images. These, along with other issues such as the variety of probability criteria and the lack of a perceptually meaningful image similarity measures, have made evaluating generative models notoriously difficult 

theis2015note . In spite of no agreement regarding the best GAN evolution measure, few works have already started to benchmark GANs (e.g.  lucic2017gans ; kurach2018gan ; shmelkov2018good ). While such studies are indeed helpful, further research is needed to understand GAN evaluation measures and assess their strengths and limitations (e.g.  theis2015note ; huang2018an ; arora2017gans ; chen2017metrics ; wu2016quantitative ; anonymous2018an ).

My main goal in this paper is to critically review available GAN measures and help the researchers objectively assess them. At the end, I will offer some suggestions for designing more efficient measures for fair GAN evaluation and comparison.

2 GAN Evaluation Measures

I will enumerate the GAN evaluation measures while discussing their pros and cons. They will be organized in two categories: quantitative and qualitative. Notice that some of these measures (e.g.  Wasserstein distance, reconstruction error or SSIM) can also be used for model optimization during training. In the next subsection, I will first provide a set of desired properties for GAN measures (a.k.a meta measures or desiderata) followed by an evaluation of whether a given measure or a family of measures is compatible with them.

Table 1 shows the list of measures. Majority of the measures return a single value while few (GAM im2016generating and NRDS zhang2018decoupled ) perform relative comparison. The rationale behind the latter is that if it is difficult to obtain the perfect measure, at least we can evaluate which model generates better images than others.

2.1 Desiderata

Before delving into the explanation of evaluation measures, first I list a number of desired properties that an efficient GAN evaluation measure should fulfill. These properties can serve as meta measures to evaluate and compare the GAN evaluation measures. Here, I emphasize on the qualitative aspects of these measures. As will be discussed in Section 3, some recent works have attempted to compare the meta measures quantitatively (e.g. computational complexity of a measure). An efficient GAN evaluation measure should:

  1. favor models that generate high fidelity samples (i.e. ability to distinguish generated samples from real ones; discriminability),

  2. favor models that generate diverse samples (and thus is sensitive to overfitting, mode collapse and mode drop, and can undermine trivial models such as the memory GAN),

  3. favor models with disentangled latent spaces as well as space continuity (a.k.a controllable sampling),

  4. have well-defined bounds (lower, upper, and chance),

  5. be sensitive to image distortions and transformations. GANs are often applied to image datasets where certain transformations to the input do not change semantic meanings. Thus, an ideal measure should be invariant to such transformations. For instance, score of a generator trained on CelebA face dataset should not change much if its generated faces are shifted by a few pixels or rotated by a small angle.

  6. agree with human perceptual judgments and human rankings of models, and

  7. have low sample and computational complexity.

In what follows, GAN measures will be discussed and assessed with respect to the above desiderata, and a summary will be presented eventually in Section 3. See Table 2.

Figure 1: A schematic layout of the typical approach for sample based GAN evaluation. and represent real and generated samples, respectively. Figure from huang2018an .

2.2 Quantitative Measures

A schematic layout for sample based GAN evaluation measures is shown in Fig. 1

. Some measures discussed in the following are “model agnostic” in that the generator is used as a black box to sample images and they do not require a density estimation from the model. On the contrary, some other measures such as average log-likelihood demand estimating a probability distribution from samples.

Measure Description
Quantitative 1. Average Log-likelihood goodfellow2014generative ; theis2015note Log likelihood of explaining realworld held out/test data using a density estimated from the generated data (e.g.  using KDE or Parzen window estimation).
2. Coverage Metric tolstikhin2017adagan The probability mass of the true data “covered” by the model distribution with such that
3. Inception Score (IS) salimans2016improved KLD between conditional and marginal label distributions over generated data.
4. Modified Inception Score (m-IS) gurumurthy2017deligan Encourages diversity within images sampled from a particular category.
5. Mode Score (MS) che2016mode Similar to IS but also takes into account the prior distribution of the labels over real data.
6. AM Score zhou2018activation Takes into account the KLD between distributions of training labels vs. predicted labels, as well as the entropy of predictions.

7. Fréchet Inception Distance (FID) heusel2017gans Wasserstein-2 distance between multi-variate Gaussians fitted to data embedded into a feature space
8. Maximum Mean Discrepancy (MMD) gretton2012kernel Measures the dissimilarity between two probability distributions and using samples drawn independently from each distribution.
9. The Wasserstein Critic arjovsky2017wasserstein The critic (e.g.  an NN) is trained to produce high values at real samples and low values at generated samples
10. Birthday Paradox Test arora2017gans Measures the support size of a discrete (continuous) distribution by counting the duplicates (near duplicates)
11. Classifier Two Sample Test (C2ST) lehmann2006testing Answers whether two samples are drawn from the same distribution (e.g.  by training a binary classifier)
12. Classification Performance radford2015unsupervised ; isola2017image An indirect technique for evaluating the quality of unsupervised representations (e.g. 

 feature extraction; FCN score). See also the GAN Quality Index (GQI) 

ye2018gan .
13. Boundary Distortion santurkar2018classification Measures diversity of generated samples and covariate shift using classification methods.
14. Number of Statistically-Different Bins (NDB) Richardson2018 Given two sets of samples from the same distribution, the number of samples that fall into a given bin should be the same up to sampling noise
15. Image Retrieval Performance wang2016ensembles Measures the distributions of distances to the nearest neighbors of some query images (i.e.  diversity)
16. Generative Adversarial Metric (GAM) im2016generating Compares two GANs by having them engaged in a battle against each other by swapping discriminators or generators. )
17. Tournament Win Rate and Skill Rating olsson2018skill Implements a tournament in which a player is either a discriminator that attempts to distinguish between real and fake data or a generator that attempts to fool the discriminators into accepting fake data as real.
18. Normalized Relative Discriminative Score (NRDS) zhang2018decoupled Compares GANs based on the idea that if the generated samples are closer to real ones,

more epochs would be needed to distinguish them from real samples.

19. Adversarial Accuracy and Divergence yang2017lr Adversarial Accuracy. Computes the classification accuracies achieved by the two classifiers, one trained on real data and another on generated data, on a labeled validation set to approximate and . Adversarial Divergence: Computes
20. Geometry Score khrulkov2018geometry Compares geometrical properties of the underlying data manifold between real and generated data.
21. Reconstruction Error xiang2017effects Measures the reconstruction error (e.g.   norm) between a test image and its closest generated image by optimizing for (i.e.  )
22. Image Quality Measures wang2004image ; ridgeway2015learning ; juefei2017gang Evaluates the quality of generated images using measures such as SSIM, PSNR, and sharpness difference
23. Low-level Image Statistics zeng2017statistics ; karras2017progressive Evaluates how similar low-level statistics of generated images are to those of natural scenes in terms of mean power spectrum, distribution of random filter responses, contrast distribution, etc.
24. Precision, Recall and score lucic2017gans These measures are used to quantify the degree of overfitting in GANs, often over toy datasets.

Qualitative
1. Nearest Neighbors To detect overfitting, generated samples are shown next to their nearest neighbors in the training set
2. Rapid Scene Categorization goodfellow2014generative In these experiments, participants are asked to distinguish generated samples from real images in a short presentation time (e.g.  100 ms); i.e.  real v.s fake
3. Preference Judgment huang2017stacked ; zhang2017stackgan ; xiao2018generating ; yi2017dualgan Participants are asked to rank models in terms of the fidelity of their generated images (e.g.  pairs, triples)
4. Mode Drop and Collapse srivastava2017veegan ; lin2017pacgan Over datasets with known modes (e.g.  a GMM or a labeled dataset), modes are computed as by measuring the distances of generated data to mode centers
5. Network Internals radford2015unsupervised ; chen2016infogan ; higgins2016beta ; mathieu2016disentangling ; zeiler2014visualizing ; bau2017network Regards exploring and illustrating the internal representation and dynamics of models (e.g.  space continuity) as well as visualizing learned features
Table 1: A summary of common GAN evaluation measures.
  1. Average Log-likelihood.Kernel density estimation (KDE or Parzen window estimation) is a well-established method for estimating the density function of a distribution from samples111Each sample is a vector shown in boldface (e.g.  ).. For a probability kernel (most often an isotropic Gaussian) and i.i.d samples , a density function at is defined as , where is a normalizing constant. This allows the use of classical measures such as KLD and JSD (Jensen Shannon divergence). However, despite its widespread use, its suitability for estimating the density of GANs has been questioned by Theis et al.  theis2015note .

    Log-likelihood (or equivalently Kullback-Leibler divergence) has been the de-facto standard for training and evaluating generative models 

    tolstikhin2017adagan . It measures the likelihood of the true data under the generated distribution on samples from the data, i.e.  . Since estimating likelihood in higher dimensions is not feasible, generated samples can be used to infer something about a model’s log-likelihood. The intuition is that a model with maximum likelihood (zero KL divergence) will produce perfect samples.

    The Parzen window approach to density estimation works by taking a finite set of samples generated by a model and then using those as the centroids of a Gaussian mixture. The constructed Parzen windows mixture is then used to compute a log-likelihood score on a set of test examples. Wu et al.  wu2016quantitative proposed to use annealed importance sampling (AIS) neal2001annealed

    to estimate log-likelihoods using a Gaussian observation model with a fixed variance. The key drawback of this approach is the assumption of the Gaussian observation model which may not work quite well in high-dimensional spaces. They found that AIS is two orders of magnitude more accurate than KDE, and is accurate enough for comparing generative models.

    While likelihood is very intuitive, it suffers from several drawbacks theis2015note :

    1. For a large number of samples, Parzen window estimates fall short in approximating a model’s true log-likelihood when the data dimensionality is high. Even for the fairly low dimensional space of image patches, it requires a very large number of samples to come close to the true log-likelihood of a model. See Fig. 14.B.

    2. Theis et al. 

      showed that the likelihood is generally uninformative about the quality of samples and vice versa. In other words, log-likelihood and sample quality are moderate unrelated. A model can have poor log-likelihood and produce great samples, or have great log-likelihood and produce poor samples. An example in the former case is a mixture of Gaussian distributions where the means are training images (

      i.e.  akin to a look-up table). Such a model will generate great samples but will still have very poor log-likelihood. An example of the latter is a mixture model combined of a good model, with a very low weight (e.g.  ), and a bad model with a high weight . Such a model has a large average log-likelihood but generates very poor samples (See theis2015note for the proof).

    3. Parzen window estimates of the likelihood produce rankings different from other measures (See Fig. 14.C).

    Due to the above issues, it becomes difficult to answer basic questions such as whether GANs are simply memorizing training examples, or whether they are missing important modes of the data distribution. For further discussions on other drawbacks of average likelihood measures consult huszar2015not .

  2. Coverage Metric. Tolstikhin et al.  tolstikhin2017adagan proposed to use the probability mass of the real data “covered” by the model distribution as a metric. They compute with such that . A kernel density estimation method was used to approximate the density of . They claim that this metric is more interpretable than the likelihood, making it easier to assess the difference in performance of the algorithms.

  3. Inception Score (IS). Proposed by Salimans et al.  salimans2016improved , it is perhaps the most widely adopted score for GAN evaluation (e.g.  in fedus2017many ). It uses a pre-trained neural network (the Inception Net szegedy2016rethinking

    trained on the ImageNet 

    deng2009imagenet ) to capture the desirable properties of generated samples: highly classifiable and diverse with respect to class labels. It measures the average KL divergence between the conditional label distribution of samples (expected to have low entropy for easily classifiable samples; better sample quality) and the marginal distribution obtained from all the samples (expected to have high entropy if all classes are equally represented in the set of samples; high diversity). It favors low entropy of but a large entropy of .

    (1)

    where is the conditional label distribution for image estimated using a pretrained Inception model szegedy2016rethinking , and is the marginal distribution: . represents entropy of variable .

    The Inception score shows a reasonable correlation with the quality and diversity of generated images salimans2016improved . IS over real images can serve as the upper bound. Despite these appealing properties, IS has several limitations:

    1. First, similar to log-likelihood, it favors a “memory GAN” that stores all training samples, thus is unable to detect overfitting (i.e.  can be fooled by generating centers of data modes yang2017lr ). This is aggravated by the fact that it does not make use of a holdout validation set.

    2. Second, it fails to detect whether a model has been trapped into one bad mode (i.e.  is agnostic to mode collapse). Zhou et al.  zhou2018activation , however, shows results on the contrary.

    3. Third, since IS uses Inception model that has been trained on ImageNet with many object classes, it may favor models that generate good objects rather realistic images.

    4. Fourth, IS only considers and ignores . Manipulations such as mixing in natural images from an entirely different distribution could deceive this score. As a result, it may favor models that simply learn sharp and diversified images, instead of  huang2018an 222This also applies to the Mode Score..

    5. Fifth, it is an asymmetric measure.

    6. Finally, it is affected by image resolution. See Fig. 2.

    Figure 2: Sensitivity of the inception score to image resolution. Top: Training data and synthesized images from the zebra class resized to a lower spatial resolution and subsequently artificially resized to the original resolution (128 × 128 for the red and black lines; for the blue line). Bottom Left: IS score across varying spatial resolutions for training data and image samples from and

    models. Error bars show standard deviation across 10 subsets of images. Dashed lines highlight the accuracy at the output spatial resolution of the model. Bottom Right: Comparison of accuracy scores at

    and spatial resolutions. Each point represents an ImageNet class. 84.4% of the classes are below the diagonal. The green dot corresponds to the zebra class. Figure from odena2016conditional .

    Zhou et al.  zhou2018activation provide an interesting analysis of the Inception score. They experimentally measured the two components of the IS score, entropy terms in Eq. 1, during training and showed that behaves as expected (i.e.  decreasing) while does not. See Fig. 3 (top row). They found that CIFAR-10 data are not evenly distributed over the classes under the Inception model trained on ImageNet. See Fig. 3(d). Using the Inception model trained over ImageNet or CIFAR-10, results in two different values for . Also, the value of varies for each specific sample in the training data (i.e.  some images are deemed less real than others). Further, a mode-collapsed generator usually gets a low Inception score (See Fig. 5 in zhou2018activation ), which is a good sign. Theoretically, in an extreme case when all the generated samples are collapsed into a single point (thus ), then the minimal Inception score of 1.0 will be achieved. Despite this, it is believed that the Inception score can not reliably measure whether a model has collapsed. For example, a class-conditional model that simply memorizes one example per each ImageNet class, will achieve high IS values. Please refer to barratt2018note for further analysis on the inception score.

    Figure 3: Top: Training curves of Inception Score and its decomposed terms. A) IS during training, B) First term in rhs of Eq. 1, , goes down with training which is supposed to go up, C) The second term, decreases in training, as expected. Bottom: Statistics of the CIFAR-10 training images. D) over ImageNet classes, E) distribution with ImageNet classifier of each class, and F) distribution with CIFAR-10 classifier of each class. Figure compiled from zhou2018activation .
  4. Modified Inception Score (m-IS). Inception score assigns a higher value to models with a low entropy class conditional distribution over all generated data . However, it is desirable to have diversity within samples in a particular category. To characterize this diversity, Gurumurthy et al.  gurumurthy2017deligan suggested to use a cross-entropy style score where s are samples from the same class as based on the inception model’s output. Incorporating this term into the original inception-score results in:

    (2)

    which is calculated on a per-class basis and is then averaged over all classes. Essentially, m-IS can be viewed as a proxy for measuring both intra-class sample diversity as well as sample quality.

  5. Mode Score. Introduced in che2016mode , this score addresses an important drawback of the Inception score which is ignoring the the prior distribution of the ground truth labels (i.e.  disregarding the dataset):

    (3)

    where is the empirical distribution of labels computed from training data. Mode score adequately reflects the variety and visual quality of generated images che2016mode . It has been, however, proved that Inception and MODE scores are in fact equivalent. See zhou2017inception for the proof.

  6. AM Score. Zhou et al.  zhou2018activation argue that the entropy term on in the Inception score is not suitable when the data is not evenly distributed over classes. To take into account, they proposed to replace with the KL divergence between and . The AM score is then defined as

    (4)

    The AM score consists of two terms. The first one is minimized when is close to . The second term is minimized when the predicted class label for sample (i.e.  ) has low entropy. Thus, the smaller the AM score, the better.

    It has been shown that the Inception score with being the Inception model trained with ImageNet, correlates with human evaluation on CIFAR10. CIFAR10 data, however, is not evenly distributed over the ImageNet Inception model. The entropy term on average distribution of the Inception score may thus not work well (See Fig. 3). With a pre-trained CIFAR10 classifier, the AM score can well capture the statistics of the average distribution. Thus, should be a pre-trained classifier on a given dataset.

  7. Fréchet Inception Distance (FID). Introduced by Heusel et al.  heusel2017gans , FID embeds a set of generated samples into a feature space given by a specific layer of Inception Net (or any CNN). Viewing the embedding layer as a continuous multivariate Gaussian, the mean and covariance are estimated for both the generated data and the real data. The Fréchet distance between these two Gaussians (a.k.a Wasserstein-2 distance) is then used to quantify the quality of generated samples, i.e.  ,

    (5)

    where and are the mean and covariance of the real data and model distributions, respectively. Lower FID means smaller distances between synthetic and real data distributions.

    FID performs well in terms of discriminability, robustness and computational efficiency. It appears to be a good measure, even though it only takes into consideration the first two order moments of the distributions. However, it assumes that features are of Gaussian distribution which is often not guaranteed. It has been shown that FID is consistent with human judgments and is more robust to noise than IS 

    heusel2017gans (e.g.  negative correlation between the FID and visual quality of generated samples). Unlike IS however, it is able to detect intra-class mode dropping333On the contrary, Sajjadi et al. sajjadi2018assessing show that FID is sensitive to both the addition of spurious modes as well as to mode dropping., i.e.  a model that generates only one image per class can score a high IS but will have a bad FID. Also, unlike IS, the FID worsens as various types of artifacts are added to images (See Fig. 4). IS and AM scores measure the diversity and quality of generated samples, while FID measures the distance between the generated and real distributions. An empirical analysis of FID can be found in lucic2017gans . See also liu2018improved for a class-aware version of FID.

    Figure 4: FID measure is sensitive to image distortions. From upper left to lower right: Gaussian noise, Gaussian blur, implanted black rectangles, swirled images, salt and pepper noise, and CelebA dataset contaminated by ImageNet images. Figure from heusel2017gans .
  8. Maximum Mean Discrepancy (MMD). This measure computes the dissimilarity between two probability distributions444Distinguishing two distributions by finite samples is known as Two-Sample Test in statistics. and using samples drawn independently from each fortet1953convergence . A lower MMD hence means that is closer to . MMD can be regarded as two-sample testing since, as in classifier two samples test, it tests whether one model or another is closer to the true data distribution muandet2017kernel ; sutherland2016generative ; bounliphone2015test . Such hypothesis tests allow choosing one evaluation measure over another.

    The kernel MMD gretton2012kernel measures (square) MMD between and for some fixed characteristic kernel function (e.g.  Gaussian kernel ) as follows555Please beware that here represents the generated samples, and not the class labels.:

    (6)

    In practice, finite samples from distributions are used to estimate MMD distance. Given and , one estimator of is:

    (7)

    Because of the sampling variance, may not be zero even when . Li et al.  li2017mmd put forth a remedy to address this. Kernel MMD works surprisingly well when it operates in the feature space of a pre-trained CNN. It is able to distinguish generated images from real images, and both its sample complexity and computational complexity are low huang2018an .

    Kernel MMD has also been used for training GANs. For example, the Generative Moment Matching Network (GMMN) li2015generative ; dziugaite2015training ; li2017mmd replaces the discriminator in GAN with a two-sample test based on kernel MMD. See also binkowski2018demystifying for more analyses on MMD and its use in GAN training.

  9. The Wasserstein Critic. The Wasserstein critic arjovsky2017wasserstein provides an approximation of the Wasserstein distance between the real data distribution and the generator distribution :

    (8)

    where is a Lipschitz continuous function. In practice, the critic is a neural network with clipped weights to have bounded derivatives. It is trained to produce high values at real samples and low values at generated samples (i.e.  is an approximation):

    (9)

    where is a batch of samples from a test set, is a batch of generated samples, and is the independent critic. For discrete distributions with densities and , the Wasserstein distance is often referred to as the Earth Mover’s Distance (EMD) which intuitively is the minimum mass displacement to transform one distribution into the other. A variant of this score known as sliced Wasserstein distance (SWD) approximates the Wasserstein-1 distance between real and generated images, and is computed as the statistical similarity between local image patches extracted from Laplacian pyramid representations of these images karras2017progressive . We will discuss SWD in more detail later under scores that utilize low-level image statistics.

    This measure addresses both overfitting and mode collapse. If the generator memorizes the training set, the critic trained on test data can distinguish between samples and data. If mode collapse occurs, the critic will have an easy job in distinguishing between data and samples. Further, it does not saturate when the two distributions do not overlap. The magnitude of the distance indicates how easy it is for the critic to distinguish between samples and data.

    The Wasserstein distance works well when the base distance is computed in a suitable feature space. A key limitation of this distance is its high sample and time complexity. These make Wasserstein distance less appealing as a practical evaluation measure, compared to other ones (See arora2017gans ).

  10. Birthday Paradox Test. This test approximates the support size666The support of a real-valued function is the subset of the domain containing those elements which are not mapped to zero. of a discrete distribution. Arora and Zhang arora2017gans proposed to use the birthday paradox777The “Birthday theorem” states that with probability at least 50%, a uniform sample (with replacement) of size from a set of elements will have a duplicate given . test to evaluate GANs as follows:

    1. Pick a sample of size from the generated distribution

    2. Use an automated measure of image similarity to flag the (e.g.  ) most similar pairs in the sample

    3. Visually inspect the flagged pairs and check for duplicates

    4. Repeat.

    The suggested plan is to manually check for duplicates in a sample of size . If a duplicate exists, then the estimated support size is . It is not possible to find exact duplicates as the distribution of generated images is continuous. Instead, a distance measure can be used to find near-duplicates (e.g.  using the

    norm). In practice, they first created a candidate pool of potential near-duplicates by choosing the 20 closest pairs according to some heuristic measure, and then visually identified the near duplicated. Following this procedure and using Euclidean distance in pixel space, Arora and Zhang 

    arora2017gans found that with probability , a batch of about samples generated from the CelebA dataset liu2015faceattributes contains at least one pair of duplicates for both DCGAN and MIX+DCGAN (thus leading to support size of ). The birthday theorem assumes uniform sampling. Arora and Zhang arora2017gans , however, claim that the birthday paradox holds even if data are distributed in a highly nonuniform way. This test can be used to detect mode collapse in GANs.

  11. Classifier Two-sample Tests (C2ST). The goal of two-sample tests is to assess whether two samples are drawn from the same distribution lehmann2006testing . In other words, decide whether two probability distributions, denoted by P and Q, are equal. The generator is evaluated on a held out test set. This set is split into a test-train and test-test subsets. The test-train set is used to train a fresh discriminator, which tries to distinguish generated images from the real images. Afterwards, the final score is computed as the performance of this new discriminator on the test-test set and the freshly generated images. More formally, assume we have access to two samples and where , for all

    . To test whether the null hypothesis

    is true, these five steps need to be completed:

    1. Construct the following dataset

    2. Randomly shuffle , and split it into two disjoint training and testing subsets and , where and .

    3. Train a binary classifier on . In the following, assume that

      is an estimate of the conditional probability distribution

      .

    4. Calculate the classification accuracy on :

      (10)

      as the C2ST statistic, where is the indicator function. The intuition here is that if , the test accuracy in Eq. 10 should remain near chance-level. In contrast, if binary classifier performs better than chance then it implies that .

    5. To accept or reject the null hypothesis, compute a -value using the null distribution of the C2ST.

    In principle, any binary classifier can be adopted for computing C2ST. Huang et al.  huang2018an introduce a variation of this measure known as the 1-Nearest Neighbor

    classifier. The advantage of using 1-NN over other classifiers is that it requires no special training and little hyperparameter tuning. Given two sets of real

    and generated samples with the same size (i.e.  ), one can compute the leave-one-out (LOO) accuracy of a 1-NN classifier trained on and with positive labels for and negative labels for . The LOO accuracy can vary from 0% to 100%. If the GAN memorizes samples in and re-generate them exactly, i.e.  , then the accuracy would be 0%. This is because every sample from would have its nearest neighbor from with zero distance (and vice versa). If it generates samples that are widely different than real images (and thus completely separable), then the performance would be 100%. Notice that chance level here is 50% which happens when a label is randomly assigned to an image. Lopez-Paz and Oquab lopez2016revisiting offer a revisit of classifier two-sample tests in lopez2016revisiting .

    Classifier two-sample tests can be considered a different form of two-sample test to MMD. MMD has the advantage of a U-statistic estimator with Gaussian asymptotic distribution, while the classifier 2-sample test has a different form (cf. [75]). MMD can be better when the U-statistic convergence outweighs the potentially more powerful classifier (e.g. from a deep network), while a classifier based test could be better if the classifier is better than the choice of kernel.

  12. Classification Performance. One common indirect technique for evaluating the quality of unsupervised representation learning algorithms is to apply them as feature extractors on labeled datasets and evaluate the performance of linear models fitted on top of the learned features. For example, to evaluate the quality of the representations learned by DCGANs, Radford et al.  radford2015unsupervised trained their model on ImageNet dataset and then used the discriminator’s convolutional features from all layers to train a regularized linear L2-SVM to classify CIFAR-10 images. They achieved 82.8% accuracy on par with or better than several baselines trained directly on CIFAR-10 data.

    A similar strategy has also been followed in evaluating conditional GANs (e.g.  the ones proposed for style transfer). For example, an off-the-shelf classifier is utilized by Zhang et al.  zhang2016colorful to assess the realism of synthesized images. They fed their fake colorized images to a VGG network that was trained on real color photos. If the classifier performs well, this indicates that the colorizations are accurate enough to be informative about object class. They call this “semantic interpretability”. Similarly, Isola et al.  isola2017image proposed the “FCN score” to measure the quality of the generated images conditioned on an input segmentation map. They fed the generated images to the fully-convolutional semantic segmentation network (FCN) long2015fully and then measured the error between the output segmentation map and the ground truth segmentation mask.

    Ye et al. ye2018gan proposed an objective measure known as the GAN Quality Index (GQI) to evaluate GANs. First, a generator is trained on a labeled real dataset with classes. Next, a classifier is trained on the real dataset. The generated images are then fed to this classifier to obtain labels. A second classifier, called the GAN-induced classifier , is trained on the generated data. Finally, the GQI is defined as the ratio of the accuracies of the two classifiers:

    (11)

    GQI is an integer in the range of 0 to 100. Higher GQI means that the GAN distribution better matches the real data distribution.

    Data Augmentation Utility: Some works measure the utility of GANs for generating additional training samples. This can be interpreted as a measure of the diversity of the generated images. Similar to Ye et al. ye2018gan , Lesort et al. lesort2018evaluation proposed to use a mixture of real and generated data to train a classifier and then test it on a labeled test dataset. The result is then compared with the score of the same classifier trained on the real training data mixed with noise. Along the same line, recently, Shmelkov et al. shmelkov2018good proposed to compare class-conditional GANs with GAN-train and GAN-test scores using a neural net classifier. GAN-train is a network trained on GAN generated images and is evaluated on real-world images. GAN-test, on the other hand, is the accuracy of a network trained on real images and evaluated on the generated images. They analyzed the diversity of the generated images by evaluating GAN-train accuracy with varying amounts of generated data. The intuition is that a model with low diversity generates redundant samples, and thus increasing the quantity of data generated in this case does not result in better GAN-train accuracy. In contrast, generating more samples from a model with high diversity produces a better GAN-train score.

    Above mentioned measures are indirect and rely heavily on the choice of the classifier. Nonetheless, they are useful for evaluating generative models based on the notion that a better generative model should result in better representations for surrogate tasks (e.g.  supervised classification). This, however, does not necessary imply that generated images have high diversity.

  13. Boundary Distortion. Santurkar et al. santurkar2018classification aimed to measure diversity of generated samples using classification methods. This phenomenon can be viewed as a form of covariate shift in GANs wherein the generator concentrates a large probability mass on a few modes of the true distribution. It is illustrated using two toy examples in Fig. 5. The first example regards learning a unimodal spherical Gaussian distribution using a vanilla GAN goodfellow2014generative . As can be seen in Fig. 5

    .A, the spectrum (eigenvalues of the covariance matrix) of GAN data shows a decaying behavior (unlike true data). The second example considers binary classification using logistic regression where the true distribution for each class is a unimodal spherical Gaussian. The synthetic distribution for one of the classes undergoes boundary distortion which causes a skew between the classifiers trained on true and synthetic data (Fig. 

    5.B). Naturally, such errors would lead to poor generalization performance on true data as well. Taken together, these examples show that a) boundary distortion is a form of covariate shift that GANs can realistically introduce, and b) this form of diversity loss can be detected and quantified even using classification.

    Specifically, Santurkar et al. proposed the following method to measure boundary distortion introduced by a GAN:

    1. Train two separate instances of the given unconditional GAN, one for each class in true dataset D (assume two classes).

    2. Generate a balanced dataset by drawing N/2 from each of these GANs.

    3. Train a binary classifier based on the labeled GAN dataset obtained in Step 2 above.

    4. Train an identical, in terms of architecture and hyperparameters, classifier on the true data D for comparison.

    Afterwards, the performance of both classifiers is measured on a hold-out set of true data. Performance of the classifier trained on synthetic data on this set acts as a proxy measure for diversity loss through covariate shift. Notice that this measure is akin to the classification performance discussed above.

    Figure 5: A) Spectrum of the learned distribution of a vanilla GAN (a 200D latent space), compared to that of the true distribution (a 75D spherical unimodal Gaussian). B) An example of covariate shift between synthetic and true distributions leading to a distortion in the learned decision boundary of a (linear) logistic regression classifier. Here the synthetic distribution for one class suffers from boundary distortion. Figure compiled from zhang2018decoupled .
  14. Number of Statistically-Different Bins (NDB). To measure diversity of generates samples and mode collapse, Richardson and Weiss Richardson2018 propose an evaluation method based on the following observation: Given two sets of samples from the same distribution, the number of samples that fall into a given bin should be the same up to sampling noise. More formally, let be the indicator function for bin . if the sample falls into the bin and zero otherwise. Let be samples from distribution (e.g.  training samples) and be samples from distribution (e.g.  testing samples), then if , it is expected that . The pooled sample proportion (the proportion that falls into

    in the joined sets) and its standard error:

    are calculated. The test statistic is the

    -score: , where and are the proportions from each sample that fall into bin . If is smaller than a threshold (i.e.  significance level) then the number is statistically different. This test is performed on all bins and then the number of statistically-different bins (NDB) is reported.

    To perform binning, one option is to use a uniform grid. The drawback here is that in high dimensions, a randomly chosen bin in a uniform grid is very likely to be empty. Richardson and Weiss proposed to use Voronoi cells to guarantee that each bin will contain some samples. Fig. 6 demonstrates this procedure using a toy example in . To define the Voronoi cells, training samples are clustered into K (

    ) clusters using K-means. Each training sample

    is assigned to one of the cells (bins). Each generated sample is then assigned to the nearest () of the centroids.

    Unlike IS and FID, NDB measure is applied directly on the image pixels rather than pre-learned deep representations. This makes NDB domain agnostic and sensitive to different image artifacts (as opposed to using pre-trained deep models). One advantage of NDB over MS-SSIM and Birthday Paradox Test is that NDB offers a measure between the data and generated distributions and not just measuring the general diversity of the generated samples. One concern regarding NDB is that using distance in pixel space as a measure of similarity may not be meaningful.

    Figure 6: Illustration of the NDB evaluation method on a toy example in . Top-left: The training data (blue) and binning result - Voronoi cells (numbered by bin size). Bottom-left: Samples (red) drawn from a GAN trained on the data. Right: Comparison of bin proportions between the training data and the GAN samples. Black lines = standard error (SE) values. Figure from Richardson2018 .
  15. Image Retrieval Performance. Wang et al.  wang2016ensembles proposed an image retrieval measure to evaluate GANs. The main idea is to investigate images in the dataset that are badly modeled by a network. Images from a held-out test set as well as generated images are represented using a discriminatively trained CNN lecun1998gradient . The nearest neighbors of generated images in the test dataset are then retrieved. To evaluate the quality of the retrieval results, they proposed two measures:

    1. Measure 1: Consider to be the distance of the nearest image generated by method to test image , and the set of -nearest distances to all test images ( is often set to 1). The Wilcoxon signed-rank test is then used to test the hypothesis that the median of the difference between two nearest distance distributions by two generators is zero, in which case they are equally good (i.e.  the median of the distribution ). If they are not equal, the test can be used to assess which method is statistically better.

    2. Measure 2: Consider to be the distribution of the nearest distance of the train images to the test dataset. Since train and test sets are drawn from the same dataset, the distribution can be considered the optimal distribution that a generator could attain (assuming it generates an equal number of images present in the train set). To model the difference with this ideal distribution, the relative increase in mean nearest neighbor distance is computed as:

      (12)

      where is the size of the test dataset. As an example, for a model means that the average distance to the nearest neighbor of a query image is 10% higher than for data drawn from the real distribution.

  16. Generative Adversarial Metric (GAM). Im et al.  im2016generating proposed to compare two GANs by having them engaged in a battle against each other by swapping discriminators or generators across the two models (See Fig. 7). GAM measures the relative performance of two GANs by measuring the likelihood ratio of the two models. Consider two GANs with their respective trained partners, and , where and are the generators, and and are the discriminators. The hypothesis is that is better than if fools more than fools , and vice versa for the hypothesis . The likelihood-ratio is defined as:

    (13)

    where and are the swapped pairs and , is the likelihood of generated from the data distribution by model , and indicates that discriminator thinks is a real sample.

    Then, one can measure which generator fools the opponent’s discriminator more, where and . To do so, Im et al. proposed a sample ratio test to declare a winner or a tie.

    A variation of GAM known as generative multi-adversarial metric (GMAM), that is amenable to training with multiple discriminators, has been proposed in durugkar2016generative .

    GAM suffers from two main caveats: a) it has a constraint where the two discriminators must have an approximately similar performance on a calibration dataset, which can be difficult to satisfy in practice, and b) it is expensive to compute because it has to be computed for all pairs of models (i.e.  pairwise comparisons between independently trained GAN).

    Figure 7: Illustration of the Generative Adversarial Metric (GAM). During the training phase, and compete with and , respectively. At test time, model M1 plays against M2 by having try to fool , and vice-versa. M1 is better than M2 if fools more than fools (and vice versa). Figure from im2016generating .
  17. Tournament Win Rate and Skill Rating. Inspired by GAM and GMAM scores (mentioned above) as well as skill rating systems in games such as chess or tennis, Olsson et al. olsson2018skill utilized tournaments between generators and discriminators for GAN evaluation. They introduced two methods for summarizing tournament outcomes: tournament win rate and skill rating. Evaluations are useful in different contexts, including a) monitoring the progress of a single model as it learns during the training process, and b) comparing the capabilities of two different fully trained models. The former regards a single model playing against past and future versions of itself producing a useful measure of training progress (a.k.a within trajectory tournament). The latter regards multiple separate models (using different seeds, hyperparameters, and architectures) and provides a useful relative comparison between different trained GANs (a.k.a multiple trajectory tournament). Each player in a tournament is either a discriminator that attempts to distinguish between real and fake data or a generator that attempts to fool the discriminators into accepting fake data as real.

    Tournament Win Rate: To determine the outcome of a match between discriminator and generator , the discriminator judges two batches: one batch of samples from generator , and one batch of real data. Every sample that is not judged correctly by the discriminator (e.g.   for the generated data or for the real data) counts as a win for the generator and is used to compute its win rate. A match win rate of for means that ’s performance against is no better than chance. The tournament win rate for generator is computed as its average win rate over all discriminators in . Tournament win rates are interpretable only within the context of the tournament they were produced from, and cannot be directly compared with those from other tournaments.

    Olsson et al. ran a tournament between 20 saved checkpoints of discriminators and generators from the same training run of a DCGAN trained on SVHN netzer2011reading using an evaluation batch size of 64. Fig. 8.A shows the raw tournament outcomes from the within-trajectory tournament, alongside the same tournament outcomes summarized using tournament win rate and skill rating, as well as SVHN classifier score and SVHN Fréchet distance computed from 10,000 samples, for comparison888To compute these score, a pre-trained SVHN classifier is used rather than an ImageNet classifier.. It shows that tournament win rate and skill rating both provide a comparable measure of training progress to SVHN classifier score.

    Skill Rating: Here the idea is to use a skill rating system to summarize tournament outcomes in a way that takes into account the amount of new information each match provides. Olsson et al. used the Glicko2 system glickman1995comprehensive . In a nutshell, a player’s skill rating is represented as a Gaussian distribution, with a mean and standard deviation, representing the current state of the evidence about their “true” skill rating. See olsson2018skill for details of the algorithm.

    Olsson et al. constructed a tournament from saved snapshots from six SVHN GANs that differ slightly from one another, including different loss functions and architectures. They included 20 saved checkpoints of discriminators and generators from each GAN experiment, a single snapshot of 6-auto, and a generator player that produces batches of real data as a benchmark. Fig. 

    8.B shows the results compared to Inception and Fréchet distances.

    One advantage of these scores is that they are not limited to fixed feature sets and players can learn to attend to any features that are useful to win. Another advantage is that human judges are eligible to play as discriminators, and could participate to receive a skill rating. This allows a principled method to incorporate human perceptual judgments in model evaluation. The downside is providing relative rather than absolute score of a model’s ability thus making reproducing results challenging and expensive.

    Figure 8: A) A within-trajectory tournament. Left panel shows raw tournament outcomes. Each pixel represents the average win rate between one generator and one discriminator from different iterations. Brighter pixel values represent stronger generator performance. Right panel compares tournament summary measures to SVHN classifier score. Tournament win rate in this figure is the column-wise average of the pixel values in the heatmap. B) Multiple-trajectory tournament outcomes among six models and real data. The tournament contains SVHN generator and discriminator snapshots from models with different seeds, hyperparameters, and architectures. Models are evaluated using SVHN classifier score (left), SVHN Fréchet distance (center), and skill rating method (right). Each point represents the score of one iteration of one model. The overall trajectories show the improvement of each model with increasing training. Note the inverted y-axis on the Fréchet distance plot, such that lower distance (better quality) is plotted higher on the plot. The learning curves produced by skill rating broadly agree with those produced by Fréchet distance, and disagree with classifier score only in the case of the conditional models 4-cond and 5-con. Figure compiled from olsson2018skill . Please see text for more details on these experiments.
  18. Normalized Relative Discriminative Score (NRDS). The main idea behind this measure proposed by Zhang et al.  zhang2018decoupled is that more epochs would be needed to distinguish good generated samples from real samples (compared to separating poor ones from real samples). They used a binary classifier (discriminator) to separate the real samples from fake ones generated by all the models in comparison. In each training epoch, the discriminator’s output for each sample is recorded. The average discriminator output of real samples will increase with epoch (approaching 1), while that of generated samples from each model will decrease (approaching 0). However, the decrement rate of each model varies based on how close the generated samples are to the real ones. The samples closer to real ones show slower decrement rate whereas poor samples will show a faster decrement rate. Therefore, comparing the “decrement rate” of each model can be an indication of how well it performs relative to other models.

    There are three steps to compute the NRDS:

    1. Obtain the curve () of discriminator average output versus epoch (or mini-batch) for each model (assuming models in comparison) during training,

    2. Compute the area under each curve (as the decrement rate), and

    3. Compute NRDS of the th model by

      (14)

    The higher the NRDS, the better. Fig. 9 illustrates the computation of NRDS over a toy example.

    Figure 9: A) Illustration of NRDS. indicates the th generative model. Its corresponding fake samples are Fake , which are sampled randomly. The fake samples from models, as well as the real samples, are used to train the binary classifier during training (bottom left). Testing only uses fake samples and performs alternatively with the training process. The bottom right shows an example of average output of

    for fake samples of each model. B) A toy example of computing NRDS. Left: the real and fake samples randomly sampled from 2D normal distributions with different means but with the same (identity) covariance. The real samples (blue circles) have with zero mean. The red “x” and yellow “+” denote fake samples with the mean of [0.5, 0] and [1.5, 0], respectively. The notation fake-close (fake-far) indicates that the mean of correspondingly fake samples is close to (far from) that of the real samples. Right: the curves of epoch vs. averaged output of discriminator on corresponding sets (colors) of samples. In this example, the area under the curves of fake-close (

    ) and fake-far () are and , respectively. From Eq. 14, and . Therefore, the model generating fake-close is relatively better. Figure compiled from zhang2018decoupled .
  19. Adversarial Accuracy and Adversarial Divergence. Yang et al.  yang2017lr proposed two measures based on the intuition that a sufficient, but unnecessary, condition for closeness of generated data distribution and the real data distribution is closeness of and , i.e.  distributions of generated data and real data conditioned on all possible variables of interest , e.g.  category labels. One way to obtain the variable of interest is by asking human participants to annotate the images (sampled from and ).

    Since it is not feasible to directly compare and , they proposed to compare and instead (following the Bayes rule) which is a much easier task. Two classifiers are then trained from human annotations to approximate and for different categories. These classifiers are used to compute the following evaluation measures:

    1. Adversarial Accuracy: Computes the classification accuracies achieved by the two classifiers on a validation set (i.e.  another set of real images). If is close to , then similar accuracies are expected.

    2. Adversarial Divergence: Computes the KL divergence between and . The lower the adversarial divergence, the closer the two distributions. The lower bound for this measure is exactly zero, which means for all samples in the validation set.

    One drawback of these measures is that a lot of human effort is needed to label the real and generated samples. To mitigate this, Yang et al.  yang2017lr first trained one generator per category using a labeled training set and then generated samples from all categories. Notice that these measures overlap with classification performance discussed above.

  20. Geometry Score. Khrulkov and Oseledets khrulkov2018geometry proposed to compare geometrical properties of the underlying data manifold between real and generated data. This score, however, involves a lot of technical details making it hard to understand and compute. Here, we provide an intuitive description.

    The core idea is to build a simplicial complex from data using proximity information (e.g.  pairwise distances between samples). To investigate the structure of the manifold, a threshold is varied and generated simplices are added into the approximation. An example is shown in Fig. 10.A. For each value of , topological properties of the corresponding simplicial complex, namely homologies, are computed. A homology encodes the number of holes of various dimensions in a space. Eventually, a barcode (signature) is constructed reflecting how long generated holes (homologies) persist in simplicial complexes (Fig. 10.B). In general, to find the rank of a -homology (i.e.  the number of -dimensional holes) at some fixed value , one has to count intersections of the vertical line with the intervals at the desired block .

    Since computing the barcode using all data is intractable, in practice often subsets of data (e.g.  by randomly selecting points) are used. For each subset, Relative Living Times (RLT) of each number of holes is computed which is defined as the ratio of the total time when this number was present and of the value when points connect into a single blob. The RLT over random subsets are then averaged to give the Mean Relative Living Times (MRLT). By construction, they add up to 1. To quantitatively evaluate the topological difference between two datasets, the distance between these distributions is computed.

    Fig. 10.C shows an example over synthetic data. Intuitively, the value at location in the bar chart (on axis), indicates that for that amount of time, the 1D hole existed by varying the threshold. For example, in the left most histogram, nearly never none, 2 or 3 1D holes were observed and most of the time only one hole appeared. Similarly, for the 4th pattern from the left, most of the time one 1D hole is observed. Comparing the MRLTs of the patterns with the ground truth pattern (leftmost one) reveals that this one is indeed the closest to the ground truth.

    Fig. 10.D shows comparison of two GANs, WGAN arjovsky2017wasserstein and WGAN-GP gulrajani2017improved , over the MNIST dataset using the method above over single digits and the entire dataset. It shows that both models produce distributions that are very close to the ground truth, but for almost all classes WGAN-GP shows better performance.

    The geometry score does not use auxiliary networks and is not limited to visual data. However, since it only takes topological properties into account (which do not change if for example the entire dataset is shifted by 1) assessing the visual quality of samples may be difficult based only on this score. Due to this, authors propose to use this score in conjunction with other measures such as FID when dealing with natural images. .

    Figure 10: A) A simplicial complex constructed on a sample . First, a fixed proximity parameter is chosen. Then, all balls of radius centered at each point are considered, and if for some subset of of size all the pairwise intersections of the corresponding balls are nonempty, then the -dimensional simplex spanning this subset is added to the simplicial complex . B) Using different values for , different simplicial complexes are obtained (a). For the balls do not intersect and there are just 10 isolated components (b, [left]). For several components are merged and one loop is appeared (b, [middle]). The filled triangle corresponding to the triple pairwise intersection is topologically trivial and does not affect the topology (and similarly darker tetrahedron on the right). For all the components are merged into one and the same hole still exists (b, [right]). In the interval one smaller hole as in A is appeared and is quickly disappeared. This information is summarized in the persistence barcode (c). The number of connected components (holes) in the simplicial complex for some value is given by the number of intervals in intersecting the vertical line . C) Mean Relative Living Times (MRLT) for various synthetic 2D datasets. The number of 1D holes is correctly identified in all the cases. Comparing MRLTs reveals that the second dataset from the left is closest to the ‘ground truth’ (noisy circle on the left). D) Comparison of MRLT of the MNIST dataset and of samples generated by WGAN and WGAN-GP trained on MNIST. MRLTs match almost perfectly, however, WGAN-GP shows slightly better performance on most of the classes. Figure compiled from khrulkov2018geometry .
  21. Reconstruction Error. For many generative models, the reconstruction error on the training set is often explicitly optimized (e.g. 

     Variational Autoencoders 

    ledig2016photo ). It is therefore natural to evaluate generative models using a reconstruction error measure (e.g.   norm) computed on a test set. In the case of GANs, given a generator G and a set of test samples the reconstruction error of G on is defined as:

    (15)

    Since it is not possible to directly infer the optimal from , Xiang and Li xiang2017effects used the following alternative method. Starting from an all-zero vector, they performed gradient descent on the latent code to find the one that minimizes the norm between the sample generated from the code and the target one. Since the code is optimized instead of being computed from a feed-forward network, the evaluation process is time-consuming. Thus, they avoided performing this evaluation at every training iteration when monitoring the training process, and only used a reduced number of samples and gradient descent steps. Only for the final trained model, they performed an extensive evaluation on a larger test set, with a larger number of steps.

  22. Image Quality Measures (SSIM, PSNR and Sharpness Difference). Some researchers have proposed to use measures from the image quality assessment literature for training and evaluating GANs. They are explained next.

    1. The single-scale SSIM measure wang2004image is a well-characterized perceptual similarity measure that aims to discount aspects of an image that are not important for human perception. It compares corresponding pixels and their neighborhoods in two images, denoted by and , using three quantities—luminance (), contrast (), and structure ():

      The variables , , , and denote mean and standard deviations of pixel intensity in a local image patch centered at either or (typically a square neighborhood of 5 pixels). The variable denotes the sample correlation coefficient between corresponding pixels in the patches centered at and . The constants , , and are small values added for numerical stability. The three quantities are combined to form the SSIM score:

      SSIM assumes a fixed image sampling density and viewing distance. A variant of SSIM operates at multiple scales. The input images and are iteratively downsampled by a factor of 2 with a low-pass filter, with scale denoting the original images downsampled by a factor of . The contrast and structure components are applied to all scales. The luminance component is applied only to the coarsest scale, denoted . Further, contrast and structure components can be weighted at each scale. The final measure is:

      MS-SSIM ranges between 0 (low similarity) and 1 (high similarity). Snell et al.  ridgeway2015learning defined a loss function for training GANs which is the sum of structural-similarity scores over all image pixels,

      where and are the original and reconstructed images, and is an index over image pixels. This loss function has a simple analytical derivative wang2008maximum which allows performing gradient descent. See Fig. 17 for more details.

    2. PSNR measures the peak signal-to-noise ratio between two monochrome images and to assess the quality of a generated image compared to its corresponding real image (e.g.  for evaluating conditional GANs krishnaGAN ). The higher the PSNR (in db), the better quality of the generated image. It is computed as:

      (16)
      (17)

      where

      (18)

      and, is the maximum possible pixel value of the image (e.g.  255 for an 8 bit representation). This score can be used when a reference image is available for example in training conditional GANs using paired data (e.g.  isola2017image ; krishnaGAN ).

    3. Sharpness Difference (SD) measures the loss of sharpness during image generation. It is compute as:

      (19)

      where

      (20)

      and

      (21)

    Odena et al. odena2016conditional used 999or ‘abused’ since the original MS-SSIM measure is intended to measure similarity of an image with respect to a reference image. MS-SSIM to evaluate the diversity of generated images. The intuition is that image pairs with higher MS-SSIM seem more similar than pairs with lower MS-SSIM. They measured the MS-SSIM scores of 100 randomly chosen pairs of images within a given class. The higher (lower) diversity within a class, the lower (the higher) mean MS-SSIM score (See Fig. 11.A). Training images from the ImageNet training data contain a variety of mean MS-SSIM scores across the classes indicating the variability of image diversity in ImageNet classes. Fig. 11.B plots the mean MS-SSIM values for image samples versus training data for each class (after training was completed). It shows that 847 classes, out of 1000, have mean sample MS-SSIM scores below that of the maximum MS-SSIM for the training data. To identify whether the generator in AC-GAN odena2016conditional collapses during training, Odena et al. tracked the mean MS-SSIM score for all 1000 ImageNet classes (Fig. 11.C). Fig. 11

    .D shows the joint distribution of Inception accuracies versus MS-SSIM across all 1000 classes. It shows that Inception score and MS-SSIM are anti-correlated (

    = −0.16).

    Juefei-Xu et al.  juefei2017gang used the SSIM and PSNR measures to evaluate GANs in image completion tasks. The advantage here is that having 1-vs-1 comparison between the ground-truth and the completed image allows very straightforward visual examination of the GAN quality. It also allows head-to-head comparison between various GANs. In addition to the above mentioned image quality measures, some other measures such as Universal Quality Index (UQI) WangBovik and Visual Information Fidelity (VIF) sheikh2006image have also been adopted for assessing the quality of synthesized images. It has been reported that MS-SSIM finds large-scale mode collapses reliably but fails to diagnose smaller effects such as loss of variation in colors or textures. Its drawback is that it does not directly assess image quality in terms of similarity to the training set odena2016conditional .

    Figure 11: MS-SSIM score used for measuring image diversity. A) MS-SSIM scores for samples generated by the AC-GAN odena2016conditional model (top row) and training samples (bottom row). B) The mean MS-SSIM scores between pairs of images within a given class of the ImageNet dataset versus AC-GAN samples. The horizontal red line marks the maximum MS-SSIM across all ImageNet classes (over training data). Each data point represents the mean MS-SSIM value for samples from one class. C) Intra-class MS-SSIM for five ImageNet classes throughout a training run. Here, decreasing trend means successful training (black lines) whereas increasing trend indicates that the generator ‘collapses’ (red line). D) Comparison of Inception score vs. MS-SSIM for all 1000 ImageNet classes ( = −0.16). AC-GAN samples do not achieve variability at the expense of discriminability. Figure compiled from odena2016conditional .
  23. Low-level Image Statistics. Natural scenes make only a tiny fraction of the space of all possible images and have certain characteristics (e.g.  geisler2008visual ; simoncelli2001natural ; torralba2003statistics ; ruderman1994statistics ). It has been shown that statistics of natural images remain the same when the images are scaled (i.e.  scale invariance) srivastava2003advances ; zhu2003statistical . The average power spectrum magnitude over natural images has the form  deriugin1956power ; cohen1975image ; burton1987color ; field1987relations . Another important property of natural image statistics is the non-Gaussianity srivastava2003advances ; zhu2003statistical ; wainwright1999scale

    . This means that marginal distribution of almost any zero mean linear filter response on virtually any dataset of images is sharply peaked at zero, with heavy tails and high kurtosis (greater than 3) 

    lee2001occlusion . Recent studies have shown that the contrast statistics of the majority of natural images follows a Weibull distribution ghebreab2009biologically .

    Zeng et al.  zeng2017statistics proposed to evaluate generative models in terms of low-level statistics of their generated images with respect to natural scenes. They considered four statistics including 1) the mean power spectrum, 2) the number of connected components in a given image area, 3) the distribution of random filter responses, and 4) the contrast distribution. Their results show that although generated images by DCGAN radford2015unsupervised , WGAN arjovsky2017wasserstein , and VAE kingma2013auto resemble natural scenes in terms of low level statistics, there are still significant differences. For example, generated images do not have scale invariant mean power spectrum magnitude, which indicates existence of extra structures in these images caused by deconvolution operations.

    Low-level image statistics can be used for regularizing GANs to optimize the discriminator to inspect whether the generator’s output matches expected statistics of the real samples (a.k.a feature matching salimans2016improved ) using the loss function: , where represents the statistics of features. Karras et al.  karras2017progressive investigated the multi scale statistical similarities between distributions of local image patches drawn from the Laplacian pyramid burt1987laplacian representations of generated and real images. They used the Wasserstein distance to compare the distributions of patches101010This measure is known as the sliced Wasserstein distance (SWD). The multi-scale pyramid allows a detailed comparison of statistics. The distance between the patch sets extracted from the lowest resolution indicates similarity in large-scale image structures, while the finest-level patches encode information about pixel-level attributes such as sharpness of edges and noise.

  24. Precision, Recall and Score. Lucic et al.  lucic2017gans proposed to compute precision, recall and

    score to quantify the degree of overfitting in GANs. Intuitively precision measures the quality of the generated samples, whereas recall measures the proportion of the reference distribution covered by the learned distribution. They argue that IS only captures precision as it does not penalize a model for not producing all modes of the data distribution. Rather, it only penalizes the model for not producing all classes. FID score, on the other hand, captures both precision and recall.

    To approximate these scores for a model, Lucic et al. proposed to use toy datasets for which the data manifold is known and distances of generated samples to the manifold can be computed. An example of such dataset is the manifold of convex shapes (See Fig. 12). To compute these scores, first the latent representation of each test sample is estimated, through gradient descent, by inverting the generator . Precision is defined as the fraction of the generated samples whose distance to the manifold is below a certain threshold. Recall, on the other hand, is given by the fraction of test samples whose distance to is below the threshold. If the samples from the model distribution are (on average) close to the manifold (see lucic2017gans for details), its precision is high. Simlarly, high recall implies that the generator can recover (i.e.  generate something close to) any sample from the manifold, thus capturing most of the manifold.

    The major drawback of these scores is that they are impractical for real images where the data manifold is unknown, and their use is limited to evaluations on synthetic data. In a recent effort, Sajjadi et al. sajjadi2018assessing introduced a novel definition of precision and recall to address this limitation.

    Figure 12: Samples from a model trained on gray-scale triangles. These triangles belong to a low-dimensional manifold embedded in . A good generative model should be able to capture the factors of variation in this manifold (e.g.  rotation, translation, minimum angle size). a) high recall and precision, b) high precision, but low recall (lacking in diversity), c) low precision, but high recall (can decently reproduce triangles, but fails to capture convexity), and d) low precision and low recall. Figure from lucic2017gans .

2.3 Qualitative Measures

Visual examination of samples by human ratersis one of the common and most intuitive ways to evaluate GANs (e.g.  denton2015deep ; salimans2016improved ; millerhuman ). While it greatly helps inspect and tune models, it suffers from the following drawbacks. First, evaluating the quality of generated images with human vision is expensive and cumbersome, biased (e.g.  depends on the structure and pay of the task, community reputation of the experimenter, etc in crowd sourcing setups silberman2015stop ) difficult to reproduce, and does not fully reflect the capacity of models. Second, human inspectors may have high variance which makes it necessary to average over a large number of subjects. Third, an evaluation based on samples could be biased towards models that overfit and therefore a poor indicator of a good density model in a log-likelihood sense theis2015note For instance, it fails to tell whether a model drops modes. In fact, mode dropping generally helps visual sample quality as the model can choose to focus on only few common modes that correspond to typical samples.

In what follows, I discuss the ways that have been followed in the literature to qualitatively inspect the quality of generated images by a model and explore its learned latent space.

  1. Nearest Neighbors. To detect overfitting, traditionally some samples are shown next to their nearest neighbors in the training set (e.g.  Fig. 13). There are, however, two concerns regarding this manner of evaluation:

    1. Nearest neighbors are typically determined based on the Euclidean distance which is very sensitive to minor perceptual perturbations. This is a well known phenomenon in the psychophysics literature (See Wang and Bovik wang2009mean ). It is trivial to generate samples that are visually almost identical to a training image, but have large Euclidean distances with it theis2015note . See Fig. 14.A for some examples.

    2. A model that stores (transformed) training images (i.e.  memory GAN) can trivially pass the nearest-neighbor overfitting test theis2015note . This problem can be alleviated by choosing nearest neighbors based on perceptual measures, and by showing more than one nearest neighbor.

      Figure 13: Generated samples nearest to real images from CIFAR-10. In each of the two panels, the first column shows real images, followed by the nearest image generated by DCGAN radford2015unsupervised , ALI dumoulin2016adversarially , Unrolled GAN metz2016unrolled , and VEEGAN srivastava2017veegan , respectively. Figure compiled from srivastava2017veegan .
    Figure 14: A) Small changes to an image can lead to large changes in Euclidean distance affecting the choice of the nearest neighbor. The left column shows a query image shifted 1 to 4 pixels (top to bottom). The right column shows the corresponding nearest neighbor from the training set. The gray lines indicate Euclidean distance of the query image to 100 randomly picked images from the training set. B) Parzen window estimates for a Gaussian evaluated on 6 by 6 pixel image patches from the CIFAR-10 dataset. Even for small patches and a very large number of samples, the Parzen window estimate is far from the true log-likelihood. C) Using Parzen window estimates to evaluate various models trained on MNIST, samples from the true distribution perform worse than samples from a simple model trained with k-means. Figure compiled from theis2015note .
  2. Rapid Scene Categorization. These measures are inspired by prior studies who have shown that humans are capable of reporting certain characteristics of scenes in a short glance (e.g.  scene category, visual layout  oliva2005gist ; serre2007feedforward ). To obtain a quantitative measure of quality of samples, Denton et al.  denton2015deep asked volunteers to distinguish their generated samples from real images. The subjects were presented with the user interface shown in Fig. 15(right) and were asked to click the appropriate button to indicate if they believed the image was real or generated. They varied the viewing time from 50ms to 2000ms (11 durations). Fig. 15(left) shows the results over samples generated by three GAN models. They concluded that their model was better than the original GAN goodfellow2014generative since it did better in fooling the subjects (lower bound here is 0% and upper bound is 100%). See also Fig. 16 for another example of fake vs. real experiment but without time constraints (conducted by Salimans et al.  salimans2016improved ).

    This “Turing-like” test is very intuitive and seems inevitable to ultimately answer the question of whether generative models are as good as the nature in generating images. However, there are several concerns in conducting such a test in practice (especially when dealing with models that are far from perfect; See Fig. 15(left)). Aside from experimental conditions which are hard to control in crowd-sourced platforms (e.g.  presentation time, screen size, subject’s distance to the screen, subjects’ motivations, age, mood, feedback, etc) and high cost, these tests fall short in evaluating models in terms of diversity of generated samples and may be biased towards models that overfit to training data.

    Figure 15: Left: Human evaluation of real CIFAR10 images (red) and samples from Goodfellow et al.  goodfellow2014generative (magenta), and LAPGAN denton2015deep and a class conditional LAPGAN (green). Around 40% of the samples generated by the class conditional LAPGAN model are realistic enough to fool a human into thinking they are real images. This compares with 10% of images from the standard GAN model, but is still a lot lower than the > 90% rate for real images. Right: The user-interface presented to the subjects. Figure from denton2015deep .
    Figure 16: Web interface given to annotators in the experiments conducted by Salimans et al.  salimans2016improved . Annotators are asked to distinguish computer generated images from real ones (left) and are provided with feedback (right). Figure compiled from salimans2016improved .
  3. Rating and Preference Judgment. These types of experiments ask subjects to rate models in terms of the fidelity of their generated images. For example, Snell et al., snell2015learning studied whether observers prefer reconstructions produced by perceptually-optimized networks or by the pixelwise-loss optimized networks. Participants were shown image triplets with the original (reference) image in the center and the SSIM- and MSE-optimized reconstructions on either side with the locations counterbalanced. Participants were instructed to select which of the two reconstructed images they preferred (See Fig. 17). Similar approaches have been followed in huang2017stacked ; zhang2017stackgan ; xiao2018generating ; yi2017dualgan ; zhang2016colorful ; upchurch2016deep ; donahue2017semantically ; liu2017auto ; lu2017sketch . Often the first few trials in these experiments are spared for practice.

    Figure 17: An example of a user judgment study by Snell et al.  snell2015learning . Left) Human judgments of generated images (a) Fully connected network: Proportion of participants preferring SSIM to MSE for each of 100 image triplets. (b) Deterministic conv. network: Distribution of image quality ranking for MS-SSIM, MSE, and MAE for 1000 images from the STL-10 hold-out set. Right) Image triplets consisting of—from left to right—the MSE reconstruction, the original image, and the SSIM reconstruction. Image triplets are ordered, from top to bottom and left to right, by the percentage of participants preferring SSIM. (c) Eight images for which participants strongly preferred SSIM over MSE. (d) Eight images for which the smallest proportion of participants preferred SSIM. Figure compiled from snell2015learning .
  4. Evaluating Mode Drop and Mode Collapse. GANs have been repeatedly criticized for failing to model the entire data distribution, while being able to generate realistically looking images. Mode collapse, a.k.a the Helvetica scenario, is the phenomenon when the generator learns to map several different input vectors to the same output (possibly due to low model capacity or inadequate optimization arora2017gans ). It causes lack of diversity in the generated samples as the generator assigns low probability mass to significant subsets of the data distribution’s support. Mode drop occurs when some hard-to-represent modes of are simply “ignored” by . This is different than mode collapse where several modes of are “averaged” by into a single mode, possibly located at a midpoint. An ideal GAN evaluation measure should be sensitive to these two phenomena.

    Detecting mode collapse in GANs trained on large scale image datasets is very challenging111111See srivastava2017veegan ; huang2018an for analysis of mode drop and mode collapse over real datasets.. However, it can be accurately measured on synthetic datasets where the true distribution and its modes are known (e.g.  Gaussian mixtures). Srivastava et al.  srivastava2017veegan proposed a measure to quantify mode collapse behavior as follows:

    1. First, some points are sampled from the generator. A sample is counted as high quality, if it is within a certain distance of its nearest mode center (e.g.   over a 2D dataset, or over a 1200D dataset).

    2. Then, the number of modes captured is the number of mixture components whose mean is nearest to at least one high quality sample. Accordingly, a mode is considered lost if there is no sample in the generated test data within a certain standard deviations from the center of that mode. This is illustrated in Fig. 19.

    Santurkar et al. santurkar2018classification , to investigate mode distribution/collapse over natural datasets, propose to train GANs over a well-balanced dataset (i.e.  a dataset that contains equal number of samples from each class) and then test whether generated data also generates a well-balanced dataset. Steps are as follows:

    1. Train the GAN unconditionally (without class labels) on the chosen balanced multi-class dataset D.

    2. Train a multi-class classifier on the same dataset D (to be used as an annotator).

    3. Generate a synthetic dataset by sampling N images from the GAN. Then use the classifier trained in Step 2 above to obtain labels for this synthetic dataset.

    An example is shown in Fig. 18. It reveals that GANs often exhibit mode collapse.

    Figure 18: Illustration of mode collapse in GANs trained on select subsets of CelebA and LSUN datasets using the technique in santurkar2018classification . Left panel shows the relative distribution of modes in samples drawn from the GANs, and compares is to the true data distribution (leftmost plots). Right panel shows the evolution of class distributions in different GANs over the course of training. It can be seen that these GANs introduce covariate shift through mode collapse. Figure compiled from santurkar2018classification .

    The reverse KL divergence over the modes has been used in lin2017pacgan to measure the quality of mode collapse as follows. Each generated sample is assigned to its closest mode. This induces an empirical, discrete distribution with an alphabet size equal to the number of observed modes in the generated samples. A similar induced discrete distribution is computed from the real data samples. The reverse KL divergence between the induced distribution from generated samples and the induced distribution from the real samples is used as a measure.

    The shortcoming of the described measures is that they only work for datasets with known modes (e.g.  synthetic or labeled datasets). Overall, it is hard to quantitatively measure mode collapse and mode drop since they are poorly understood. Further, finding nearest neighbors and nearest mode center is non-trivial in high dimensional spaces is non-trivial. Active research is ongoing in this direction.

    Figure 19: Density plots of the true data and generator distributions from different GAN methods trained on mixtures of Gaussians arranged in a ring (top) or a grid (bottom). Figure from srivastava2017veegan .
  5. Investigating and Visualizing the Internals of Networks. Other ways of evaluating generative models are studying how and what they learn, exploring their internal dynamics, and understanding the landscape of their latent spaces. While this is a broad topic and many papers fall under it, here I bring few examples to give the reader some insights.

    1. Disentangled representations. “Disentanglement” regards the alignment of “semantic” visual concepts to axes in the latent space. Some tests can check the existence of semantically meaningful directions in the latent space, meaning that varying the seed along those directions leads to predictable changes (e.g.  changes in facial hair, or pose). Some others (e.g.  chen2016infogan ; higgins2016beta ; mathieu2016disentangling ; lipton2017precise ) assess the quality of internal representations by checking whether they satisfy certain properties, such as being “disentangled”. A measure of disentanglement proposed in higgins2016beta checks whether the latent space captures the true factors of variation in a simulated dataset where parameters are known by construction (e.g.  using a graphics engine). Radford et al.  radford2015unsupervised investigated their trained generators and discriminators in a variety of ways. They proposed that walking on the learned manifold can tell us about signs of memorization (if there are sharp transitions) and about the way in which the space is hierarchically collapsed. If walking in this latent space results in semantic changes to the image generations (such as objects being added and removed), one can reason that the model has learned relevant and interesting representations. They also showed interesting results of performing vector arithmetic on the vectors of sets of exemplar samples for visual concepts (e.g.  smiling woman - neutral woman + neutral man = smiling man; using ’s averaged over several samples).

    2. Space continuity. Related to above, the goal here it to study the level of detail a model is capable of extracting. For example, given two random seed vectors and that generated two realistic images, we can check the images produced using seeds lying on the line joining and

      . If such “interpolated” images are reasonable and visually appealing, then this may be taken as a sign that a model can produce novel images rather than simply memorizing them (

      e.g.  berthelot2017began ; See Fig. 20). Some other examples include donahue2017semantically ; nguyen2016synthesizing . White white2016sampling suggests that replacing linear interpolation with spherical linear interpolation prevents diverging from a model’s prior distribution and produces sharper samples. Vedantam et al.  vedantam2017generative studied “visually grounded semantic imagination” and proposed several ways to evaluate their models in terms of the quality of the learned semantic latent space.

      Figure 20: Top: Interpolations on between real images at resolution (from BEGAN berthelot2017began ). These images were not part of the training data. The first and last columns contain the real images to be represented and interpolated. The images immediately next to them are their corresponding approximations while the images in between are the results of linear interpolation in . Middle: Latent space interpolations for three ImageNet classes. Left-most and right-columns show three pairs of image samples - each pair from a distinct class. Intermediate columns highlight linear interpolations in the latent space between these three pairs of images (From odena2016conditional ). Bottom: Class-independent information contains global structure about the synthesized image. Each column is a distinct bird class while each row corresponds to a fixed latent code (From odena2016conditional ).
    3. Visualizing the discriminator features

      . Motivated by previous studies on investigating the representations and features learned by convolutional neural networks trained for scene classification (

      e.g.  zeiler2014visualizing ; bau2017network ; zhou2014object ), some works have attempted to visualize the internal parts of generators and discriminators in GANs. For example, Radford et al.  radford2015unsupervised

      showed that DCGAN trained on a large image dataset can also learn a hierarchy of interesting features. Using guided backpropagation 

      springenberg2014striving , they showed that the features learned by the discriminator fire on typical parts of a bedroom, such as beds and windows (See Fig. 5 in radford2015unsupervised ). The t-SNE method maaten2008visualizing has also been frequently used to project the learned latent spaces in 2D.

3 Discussion

3.1 Other Evaluation Measures

In addition to measures discussed above, there exist some other non-trivial or task-specific ways to evaluate GANs. Vedantam et al.  vedantam2017generative proposed a model for visually grounded imagination to create images of novel semantic concepts. To evaluate the quality of the generated images, they proposed three measures including a) correctness: fraction of attributes for each generated image that match those specified in the concept’s description, b) coverage: diversity of values for the unspecified or missing attributes, measured as the difference between the empirical distributions of attribute values in the generated set and the true distribution for this attribute induced by the training set, and c) compositionality: correctness of generated images in response to test concepts that differ in at least one attribute from the training concepts. To measure diversity of generated samples, Zhu et al.  zhu2017toward

randomly sampled from their model and computed the average pair-wise distance in a deep feature space using cosine distance and compared it with the same measure calculated from ground truth real images. This is akin to the image retrieval performance measure described above. Im

et al.  jiwoong2018quantitatively proposed to evaluate GANs by exploring the divergence and distance measures that were used during training GANs. They showed that rankings produced by four measures including 1) Jensen-Shannon Divergence, 2) Constrained Pearson , 3) Maximum Mean Discrepancy, and 4) Wasserstein Distance, are consistent and robust across measures.

Desiderata
Measure

Discriminability

Detecting Overfitting

Disentangled Latent Spaces

Well-defined Bounds

Perceptual Judgments

Sensitivity to Distortions

Comp. & Sample Efficiency

1. Average Log- likelihood  goodfellow2014generative ; theis2015note low low - [-, ] low low low
2. Coverage Metric  tolstikhin2017adagan low low - [0, 1] low low -
3. Inception Score (IS)  salimans2016improved high moderate - [1, ] high moderate high
4. Modified Inception Score (m-IS)  gurumurthy2017deligan high moderate - [1, ] high moderate high
5. Mode Score (MS)  che2016mode high moderate - [0, ] high moderate high
6. AM Score  zhou2018activation high moderate - [0, ] high moderate high
7. Fréchet Inception Distance (FID)  heusel2017gans high moderate - [0, ] high high high
8. Maximum Mean Discrepancy (MMD)  gretton2012kernel high low - [0, ] - - -
9. The Wasserstein Critic  arjovsky2017wasserstein high moderate - [0, ] - - low
10. Birthday Paradox Test  arora2017gans low high - [1, ] low low -
11. Classifier Two Sample Test (C2ST)  lehmann2006testing high low - [0, 1] - - -
12. Classification Performance  radford2015unsupervised ; isola2017image high low - [0, 1] low - -
13. Boundary Distortion  santurkar2018classification low low - [0, 1] - - -
14. NDB  Richardson2018 low high - [0, ] - low -
15. Image Retrieval Performance  wang2016ensembles moderate low - * low - -
16. Generative Adversarial Metric (GAM)  im2016generating high low - * - - moderate
17. Tournament Win Rate and Skill Rating  olsson2018skill high high - * - - low
18. NRDS  zhang2018decoupled high low - [0, 1] - - poor
19. Adversarial Accuracy & Divergence yang2017lr high low - [0, 1], [0, ] - - -
20. Geometry Score  khrulkov2018geometry low low - [0, ] - low low
21. Reconstruction Error  xiang2017effects low low - [0, ] - moderate moderate
22. Image Quality Measures  wang2004image ; ridgeway2015learning ; juefei2017gang low moderate - * high high high
23. Low-level Image Statistics  zeng2017statistics ; karras2017progressive low low - * low low -
24. Precision, Recall and score  lucic2017gans low high [0, 1] - - -


Table 2: Meta measure of GAN quantitative evaluation scores. Notice that the ratings are relative. “-” means unknown (hence warranting further research). “*” indicates that several bounds for several scores in that family measure are available. Also, notice that tighter bounds for some of the measures might be possible. It seems that most of the measures do not systematically evaluate disentanglement in the latent space.

3.2 Sample and Computational Efficiencies

Here, I provide more details on two items in the list of desired properties of GAN evaluation measures. They will be used in the next subsection for assessing the measures. Huang et al.  huang2018an argue that a practical GAN evaluation measure should be computed using a reasonable number of samples and within an affordable computation cost. This is particularly important during monitoring the training process of models. They proposed the following ways to assess evaluation measures:

  1. Sample efficiency: It regards the number of samples needed for a measure to discriminate a set of generated samples from a set of real samples . To do this, a reference set is uniformly sampled from the real training data (but disjoint with ). All three sets have the same size (i.e.  ). An ideal measure is expected to correctly score lower than with a relatively small . In other words, the number of samples needed for a measure to distinguish and can be viewed as its sample complexity.

  2. Computational efficiency: Fast computation of the empirical measure is of practical concern as it helps researchers monitor the training process and diagnose problems early on (e.g.  for early stopping). This can be measured in terms of seconds per number of evaluated samples.

    3.3 What is the Best GAN Evaluation Measure?

    To answer this question, lets first take a look at how well the measures perform with respect to the desired properties mentioned in Section 2.1. Results are shown in Table 2. I find that:

    1. only two measures are designed to explicitly address overfitting,

    2. the majority of the measures do not consider disentangled representations,

    3. few measures have both lower and upper bounds,

    4. the agreement between the measures and human perceptual judgments is less clear,

    5. several highly regarded measures have high sample and computational efficiencies, and

    6. the sensitivity of measures to image distortions is less explored.

    A detailed discussion and comparison of GAN evaluation measures comes next.

    As of yet, there is no consensus regarding the best score. Different scores assess various aspects of the image generation process, and it is unlikely that a single score can cover all aspects. Nevertheless, some measures seem more plausible than others (e.g.  FID score). Detailed analyses by Theis et al.  theis2015note showed that average likelihood is not a good measure. Parzen windows estimation of likelihood favors trivial models and is irrelevant to visual fidelity of samples. Further, it fails to approximate the true likelihood in high dimensional spaces or to rank models (Fig. 14). Similarly, the Wasserstein distance between generated samples and the training data is also intractable in high dimensions karras2017progressive . Two widely accepted scores, Inception Score and Fréchet Inception Distance, rely on pre-trained deep networks to represent and statistically compare original and generated samples. This brings along two significant drawbacks. First, the deep network is trained to be invariant to image transformations and artifacts making the evaluation method also insensitive to those distortions. Second, since the deep network is often trained on large scale natural scene datasets (e.g.  ImageNet), applying them to other domains (e.g.  faces, digits) is questionable. Some evaluation methods (e.g.  MS-SSIM odena2016conditional , Birthday Paradox Test) aim to assess the diversity of the generated samples, regardless of the data distribution. While being able to detect severe cases of mode collapse, these methods fall short in measuring how well a generator captures the true data distribution karras2017progressive .

    Quality measures such as nearest neighbor visualizations or rapid categorization tasks may favor models that overfit. Overall, it seems that the main challenge is to have a measure that evaluates both diversity and visual fidelity simultaneously. The former implies that all modes are covered while the latter implies that the generated samples should have high likelihood. Perhaps due to these challenges, Theis et al.  theis2015note argued against evaluating models for task-independent image generation and proposed to evaluate GANs with respect to specific applications. For different applications then, different measures might be more appropriate. For example, the likelihood is good for measuring compression methods theis2017lossy while psychophysics and user ratings are fit for evaluating image reconstruction and synthesis methods ledig2016photo ; gerhard2013sensitive . Some measures are suitable for evaluating generic GANs (when input is a noise vector), while some others are suitable for evaluating conditional GANs (e.g.  FCN score) where correspondences are available (e.g.  generating an image corresponding to a segmentation map).

    Despite having different formulations, several scores are based on similar concepts. C2ST, adversarial accuracy, and classification performance employ classifiers to determine how separable generated images are from real images (on a validation dataset). FID, Wasserstein and MMD measures compute the distance between two distributions. Inception score and its variants including m-IS, Mode and AM scores use conditional and marginal distributions over generated data or real data to evaluate diversity and fidelity of samples. Average log likelihood and coverage metric estimate the probability distributions. Reconstruction error and some quality measures determine how dissimilar generated images are from their corresponding (or closest) images in the train set. Some measures use individual samples (e.g.  IS) while others need pairs of samples (e.g.  MMD). One important concern regarding many measures is that they are sensitive to the choice of the feature space (e.g.  different CNNs) as well as the distance type (e.g.   vs. ).

    Fidelity, diversity and controllable sampling are the main aspects of a model that a measure should capture. A good score should have well defined bounds and also be sensitive to image distortions and transformations (See Fig. 21 and 4). One major problem with qualitative measures such as SSIM and PSNR is that they only tap visual fidelity and not diversity of samples. Humans are also often biased towards the visual quality of generated images and are less affected by the lack of image diversity. On the other hand, some quantitative measures mostly concentrate on evaluating diversity (e.g.  Birthday Paradox Test) and discard fidelity. Ideally, a good measure should take both into account.

    Figure 21: Robustness analysis of different GAN evaluation measures to small image transformations (rotations and translations). A good measure is expected to remain constant across all mixes of real and transformed real samples, since the transformations do not alter semantics of the image. Some measures are more susceptible to changes in the pixel space than the convolutional space. Figure from huang2018an .

    Fig. 22 shows a comparison of GAN evaluation measures in terms of sample and computational efficiency. While some measures are practical to compute for a small sample size (about 2000 images), some others (e.g. Wasserstein distance) do not scale to large sample sizes. Please see huang2018an for further details.

Figure 22: Measurement of wall-clock time for computing various measures as a function of the number of samples. As it shows, all measures are practical to compute for a sample of size 2000, but Wasserstein distance does not scale to large sample sizes. Figure from huang2018an .

4 Summary and Future Work

In this work, I provided a critical review of the strengths and limitations of 24 quantitative and 5 qualitative measures that have been introduced so far for evaluating GANs. Seeking appropriate measures for this purpose continues to be an important open problem, not only for fair model comparison but also for understanding, improving, and developing generative models. Lack of a universal powerful measure can hinder the progress. In a recent benchmark study, Lucic et al.  lucic2017gans found no empirical evidence in favor of GAN models who claimed superiority over the original GAN. In this regard, borrowing from other fields such as natural scene statistics and cognitive vision can be rewarding. For example, understanding how humans perceive symmetry driver1992preserved ; funk2017beyond or image clutter rosenholtz2007measuring in generated images versus natural scenes can give clues regarding the plausibility of the generated images.

Ultimately, I suggest the following directions for future research in this area:

  1. creating a code repository of evaluation measures,

  2. conducting detailed comparative empirical and analytical studies of available measures, and

  3. benchmarking models under the same conditions (e.g.  architectures, optimization, hyperparameters, computational budget) using more than one measure.

References

  • (1) A. Radford, L. Metz, S. Chintala, Unsupervised representa