Generative models are a fundamental component of a variety of important machine learning and computer vision algorithms. They are increasingly used to estimate the underlying statistical structure of high dimensional signals and artificially generate various kinds of data including high-quality images, videos, and audio. They can be utilized for purposes such as representation learning and semi-supervised learningradford2015unsupervised ; odena2016conditional ; salimans2016improved , domain adaptation ganin2016domain ; tzeng2017adversarial , text to image synthesis reed2016generative , compression theis2017lossy ledig2016photo , inpainting pathak2016context ; yeh2017semantic , saliency prediction pan2017salgan , image enhancement zhang2017image , style transfer and texture synthesis gatys2016image ; johnson2016perceptual isola2017image ; zhu2017unpaired , and video generation and prediction vondrick2016generating . A recent class of generative models known as Generative Adversarial Networks (GANs) by Goodfellow et al. goodfellow2014generative has attracted much attention. A sizable volume of follow-up papers have been published since introduction of GANs in 2014. There has been substantial progress in terms of theory and applications and a large number of GAN variants have been introduced. However, relatively less effort has been spent in evaluating GANs and grounded ways to quantitatively and qualitatively assess them are still missing.
Generative models can be classified into two broad categories ofexplicit and implicit approaches. The former class assumes access to the model likelihood function, whereas the latter uses a sampling mechanism to generate data. Examples of explicit models are variational auto-encoders (VAEs) kingma2013auto ; kingma2014semi and PixelCNN van2016conditional . Examples of implicit generative models are GANs. Explicit models are typically trained by maximizing the likelihood or its lower bound. GANs aim to approximate a data distribution , using a parameterized model distribution . They achieve this by jointly optimizing two adversarial networks: a generator and a discriminator. The generator
is trained to synthesize from a noise vector an image that is close to the true data distribution. The discriminatoris optimized to accurately distinguish between the synthesized images coming from the generator and the real images from the data distribution. GANs have shown a dramatic ability to generate realistic high resolution images.
Several evaluation measures have surfaced with the emergence of new models. Some of them attempt to quantitatively evaluate models while some others emphasize on qualitative ways such as user studies or analyzing internals of models. Both of these approaches have strengths and limitations. For example, one may think that fooling a person in distinguishing generated images from real ones can be the ultimate test. Such a measure, however, may favor models that concentrate on limited sections of the data (i.e.
overfitting or memorizing; low diversity; mode dropping). Quantitative measures, while being less subjective, may not directly correspond to how humans perceive and judge generated images. These, along with other issues such as the variety of probability criteria and the lack of a perceptually meaningful image similarity measures, have made evaluating generative models notoriously difficulttheis2015note . In spite of no agreement regarding the best GAN evolution measure, few works have already started to benchmark GANs (e.g. lucic2017gans ; kurach2018gan ; shmelkov2018good ). While such studies are indeed helpful, further research is needed to understand GAN evaluation measures and assess their strengths and limitations (e.g. theis2015note ; huang2018an ; arora2017gans ; chen2017metrics ; wu2016quantitative ; anonymous2018an ).
My main goal in this paper is to critically review available GAN measures and help the researchers objectively assess them. At the end, I will offer some suggestions for designing more efficient measures for fair GAN evaluation and comparison.
2 GAN Evaluation Measures
I will enumerate the GAN evaluation measures while discussing their pros and cons. They will be organized in two categories: quantitative and qualitative. Notice that some of these measures (e.g. Wasserstein distance, reconstruction error or SSIM) can also be used for model optimization during training. In the next subsection, I will first provide a set of desired properties for GAN measures (a.k.a meta measures or desiderata) followed by an evaluation of whether a given measure or a family of measures is compatible with them.
Table 1 shows the list of measures. Majority of the measures return a single value while few (GAM im2016generating and NRDS zhang2018decoupled ) perform relative comparison. The rationale behind the latter is that if it is difficult to obtain the perfect measure, at least we can evaluate which model generates better images than others.
Before delving into the explanation of evaluation measures, first I list a number of desired properties that an efficient GAN evaluation measure should fulfill. These properties can serve as meta measures to evaluate and compare the GAN evaluation measures. Here, I emphasize on the qualitative aspects of these measures. As will be discussed in Section 3, some recent works have attempted to compare the meta measures quantitatively (e.g. computational complexity of a measure). An efficient GAN evaluation measure should:
favor models that generate high fidelity samples (i.e. ability to distinguish generated samples from real ones; discriminability),
favor models that generate diverse samples (and thus is sensitive to overfitting, mode collapse and mode drop, and can undermine trivial models such as the memory GAN),
favor models with disentangled latent spaces as well as space continuity (a.k.a controllable sampling),
have well-defined bounds (lower, upper, and chance),
be sensitive to image distortions and transformations. GANs are often applied to image datasets where certain transformations to the input do not change semantic meanings. Thus, an ideal measure should be invariant to such transformations. For instance, score of a generator trained on CelebA face dataset should not change much if its generated faces are shifted by a few pixels or rotated by a small angle.
agree with human perceptual judgments and human rankings of models, and
have low sample and computational complexity.
2.2 Quantitative Measures
A schematic layout for sample based GAN evaluation measures is shown in Fig. 1
. Some measures discussed in the following are “model agnostic” in that the generator is used as a black box to sample images and they do not require a density estimation from the model. On the contrary, some other measures such as average log-likelihood demand estimating a probability distribution from samples.
|Quantitative||1. Average Log-likelihood goodfellow2014generative ; theis2015note||Log likelihood of explaining realworld held out/test data using a density estimated from the generated data (e.g. using KDE or Parzen window estimation).|
|2. Coverage Metric tolstikhin2017adagan||The probability mass of the true data “covered” by the model distribution with such that|
|3. Inception Score (IS) salimans2016improved||KLD between conditional and marginal label distributions over generated data.|
|4. Modified Inception Score (m-IS) gurumurthy2017deligan||Encourages diversity within images sampled from a particular category.|
|5. Mode Score (MS) che2016mode||Similar to IS but also takes into account the prior distribution of the labels over real data.|
|6. AM Score zhou2018activation||Takes into account the KLD between distributions of training labels vs. predicted labels, as well as the entropy of predictions.|
||7. Fréchet Inception Distance (FID) heusel2017gans||Wasserstein-2 distance between multi-variate Gaussians fitted to data embedded into a feature space|
|8. Maximum Mean Discrepancy (MMD) gretton2012kernel||Measures the dissimilarity between two probability distributions and using samples drawn independently from each distribution.|
|9. The Wasserstein Critic arjovsky2017wasserstein||The critic (e.g. an NN) is trained to produce high values at real samples and low values at generated samples|
|10. Birthday Paradox Test arora2017gans||Measures the support size of a discrete (continuous) distribution by counting the duplicates (near duplicates)|
|11. Classifier Two Sample Test (C2ST) lehmann2006testing||Answers whether two samples are drawn from the same distribution (e.g. by training a binary classifier)|
|12. Classification Performance radford2015unsupervised ; isola2017image||
An indirect technique for evaluating the quality of unsupervised representations
feature extraction; FCN score). See also the GAN Quality Index (GQI)ye2018gan .
|13. Boundary Distortion santurkar2018classification||Measures diversity of generated samples and covariate shift using classification methods.|
|14. Number of Statistically-Different Bins (NDB) Richardson2018||Given two sets of samples from the same distribution, the number of samples that fall into a given bin should be the same up to sampling noise|
|15. Image Retrieval Performance wang2016ensembles||Measures the distributions of distances to the nearest neighbors of some query images (i.e. diversity)|
|16. Generative Adversarial Metric (GAM) im2016generating||Compares two GANs by having them engaged in a battle against each other by swapping discriminators or generators. )|
|17. Tournament Win Rate and Skill Rating olsson2018skill||Implements a tournament in which a player is either a discriminator that attempts to distinguish between real and fake data or a generator that attempts to fool the discriminators into accepting fake data as real.|
|18. Normalized Relative Discriminative Score (NRDS) zhang2018decoupled||
Compares GANs based on the idea that if the generated samples are closer to real ones,
more epochs would be needed to distinguish them from real samples.
|19. Adversarial Accuracy and Divergence yang2017lr||Adversarial Accuracy. Computes the classification accuracies achieved by the two classifiers, one trained on real data and another on generated data, on a labeled validation set to approximate and . Adversarial Divergence: Computes|
|20. Geometry Score khrulkov2018geometry||Compares geometrical properties of the underlying data manifold between real and generated data.|
|21. Reconstruction Error xiang2017effects||Measures the reconstruction error (e.g. norm) between a test image and its closest generated image by optimizing for (i.e. )|
|22. Image Quality Measures wang2004image ; ridgeway2015learning ; juefei2017gang||Evaluates the quality of generated images using measures such as SSIM, PSNR, and sharpness difference|
|23. Low-level Image Statistics zeng2017statistics ; karras2017progressive||Evaluates how similar low-level statistics of generated images are to those of natural scenes in terms of mean power spectrum, distribution of random filter responses, contrast distribution, etc.|
|24. Precision, Recall and score lucic2017gans||These measures are used to quantify the degree of overfitting in GANs, often over toy datasets.|
|1. Nearest Neighbors||To detect overfitting, generated samples are shown next to their nearest neighbors in the training set|
|2. Rapid Scene Categorization goodfellow2014generative||In these experiments, participants are asked to distinguish generated samples from real images in a short presentation time (e.g. 100 ms); i.e. real v.s fake|
|3. Preference Judgment huang2017stacked ; zhang2017stackgan ; xiao2018generating ; yi2017dualgan||Participants are asked to rank models in terms of the fidelity of their generated images (e.g. pairs, triples)|
|4. Mode Drop and Collapse srivastava2017veegan ; lin2017pacgan||Over datasets with known modes (e.g. a GMM or a labeled dataset), modes are computed as by measuring the distances of generated data to mode centers|
|5. Network Internals radford2015unsupervised ; chen2016infogan ; higgins2016beta ; mathieu2016disentangling ; zeiler2014visualizing ; bau2017network||Regards exploring and illustrating the internal representation and dynamics of models (e.g. space continuity) as well as visualizing learned features|
Average Log-likelihood.Kernel density estimation (KDE or Parzen window estimation) is a well-established method for estimating the density function of a distribution from samples111Each sample is a vector shown in boldface (e.g. ).. For a probability kernel (most often an isotropic Gaussian) and i.i.d samples , a density function at is defined as , where is a normalizing constant. This allows the use of classical measures such as KLD and JSD (Jensen Shannon divergence). However, despite its widespread use, its suitability for estimating the density of GANs has been questioned by Theis et al. theis2015note .
Log-likelihood (or equivalently Kullback-Leibler divergence) has been the de-facto standard for training and evaluating generative modelstolstikhin2017adagan . It measures the likelihood of the true data under the generated distribution on samples from the data, i.e. . Since estimating likelihood in higher dimensions is not feasible, generated samples can be used to infer something about a model’s log-likelihood. The intuition is that a model with maximum likelihood (zero KL divergence) will produce perfect samples.
The Parzen window approach to density estimation works by taking a finite set of samples generated by a model and then using those as the centroids of a Gaussian mixture. The constructed Parzen windows mixture is then used to compute a log-likelihood score on a set of test examples. Wu et al. wu2016quantitative proposed to use annealed importance sampling (AIS) neal2001annealed
to estimate log-likelihoods using a Gaussian observation model with a fixed variance. The key drawback of this approach is the assumption of the Gaussian observation model which may not work quite well in high-dimensional spaces. They found that AIS is two orders of magnitude more accurate than KDE, and is accurate enough for comparing generative models.
While likelihood is very intuitive, it suffers from several drawbacks theis2015note :
For a large number of samples, Parzen window estimates fall short in approximating a model’s true log-likelihood when the data dimensionality is high. Even for the fairly low dimensional space of image patches, it requires a very large number of samples to come close to the true log-likelihood of a model. See Fig. 14.B.
Theis et al.
showed that the likelihood is generally uninformative about the quality of samples and vice versa. In other words, log-likelihood and sample quality are moderate unrelated. A model can have poor log-likelihood and produce great samples, or have great log-likelihood and produce poor samples. An example in the former case is a mixture of Gaussian distributions where the means are training images (i.e. akin to a look-up table). Such a model will generate great samples but will still have very poor log-likelihood. An example of the latter is a mixture model combined of a good model, with a very low weight (e.g. ), and a bad model with a high weight . Such a model has a large average log-likelihood but generates very poor samples (See theis2015note for the proof).
Parzen window estimates of the likelihood produce rankings different from other measures (See Fig. 14.C).
Due to the above issues, it becomes difficult to answer basic questions such as whether GANs are simply memorizing training examples, or whether they are missing important modes of the data distribution. For further discussions on other drawbacks of average likelihood measures consult huszar2015not .
Coverage Metric. Tolstikhin et al. tolstikhin2017adagan proposed to use the probability mass of the real data “covered” by the model distribution as a metric. They compute with such that . A kernel density estimation method was used to approximate the density of . They claim that this metric is more interpretable than the likelihood, making it easier to assess the difference in performance of the algorithms.
Inception Score (IS). Proposed by Salimans et al. salimans2016improved , it is perhaps the most widely adopted score for GAN evaluation (e.g. in fedus2017many ). It uses a pre-trained neural network (the Inception Net szegedy2016rethinking
trained on the ImageNetdeng2009imagenet ) to capture the desirable properties of generated samples: highly classifiable and diverse with respect to class labels. It measures the average KL divergence between the conditional label distribution of samples (expected to have low entropy for easily classifiable samples; better sample quality) and the marginal distribution obtained from all the samples (expected to have high entropy if all classes are equally represented in the set of samples; high diversity). It favors low entropy of but a large entropy of .
where is the conditional label distribution for image estimated using a pretrained Inception model szegedy2016rethinking , and is the marginal distribution: . represents entropy of variable .
The Inception score shows a reasonable correlation with the quality and diversity of generated images salimans2016improved . IS over real images can serve as the upper bound. Despite these appealing properties, IS has several limitations:
First, similar to log-likelihood, it favors a “memory GAN” that stores all training samples, thus is unable to detect overfitting (i.e. can be fooled by generating centers of data modes yang2017lr ). This is aggravated by the fact that it does not make use of a holdout validation set.
Second, it fails to detect whether a model has been trapped into one bad mode (i.e. is agnostic to mode collapse). Zhou et al. zhou2018activation , however, shows results on the contrary.
Third, since IS uses Inception model that has been trained on ImageNet with many object classes, it may favor models that generate good objects rather realistic images.
Fourth, IS only considers and ignores . Manipulations such as mixing in natural images from an entirely different distribution could deceive this score. As a result, it may favor models that simply learn sharp and diversified images, instead of huang2018an 222This also applies to the Mode Score..
Fifth, it is an asymmetric measure.
Finally, it is affected by image resolution. See Fig. 2.
Zhou et al. zhou2018activation provide an interesting analysis of the Inception score. They experimentally measured the two components of the IS score, entropy terms in Eq. 1, during training and showed that behaves as expected (i.e. decreasing) while does not. See Fig. 3 (top row). They found that CIFAR-10 data are not evenly distributed over the classes under the Inception model trained on ImageNet. See Fig. 3(d). Using the Inception model trained over ImageNet or CIFAR-10, results in two different values for . Also, the value of varies for each specific sample in the training data (i.e. some images are deemed less real than others). Further, a mode-collapsed generator usually gets a low Inception score (See Fig. 5 in zhou2018activation ), which is a good sign. Theoretically, in an extreme case when all the generated samples are collapsed into a single point (thus ), then the minimal Inception score of 1.0 will be achieved. Despite this, it is believed that the Inception score can not reliably measure whether a model has collapsed. For example, a class-conditional model that simply memorizes one example per each ImageNet class, will achieve high IS values. Please refer to barratt2018note for further analysis on the inception score.
Modified Inception Score (m-IS). Inception score assigns a higher value to models with a low entropy class conditional distribution over all generated data . However, it is desirable to have diversity within samples in a particular category. To characterize this diversity, Gurumurthy et al. gurumurthy2017deligan suggested to use a cross-entropy style score where s are samples from the same class as based on the inception model’s output. Incorporating this term into the original inception-score results in:
which is calculated on a per-class basis and is then averaged over all classes. Essentially, m-IS can be viewed as a proxy for measuring both intra-class sample diversity as well as sample quality.
Mode Score. Introduced in che2016mode , this score addresses an important drawback of the Inception score which is ignoring the the prior distribution of the ground truth labels (i.e. disregarding the dataset):
AM Score. Zhou et al. zhou2018activation argue that the entropy term on in the Inception score is not suitable when the data is not evenly distributed over classes. To take into account, they proposed to replace with the KL divergence between and . The AM score is then defined as
The AM score consists of two terms. The first one is minimized when is close to . The second term is minimized when the predicted class label for sample (i.e. ) has low entropy. Thus, the smaller the AM score, the better.
It has been shown that the Inception score with being the Inception model trained with ImageNet, correlates with human evaluation on CIFAR10. CIFAR10 data, however, is not evenly distributed over the ImageNet Inception model. The entropy term on average distribution of the Inception score may thus not work well (See Fig. 3). With a pre-trained CIFAR10 classifier, the AM score can well capture the statistics of the average distribution. Thus, should be a pre-trained classifier on a given dataset.
Fréchet Inception Distance (FID). Introduced by Heusel et al. heusel2017gans , FID embeds a set of generated samples into a feature space given by a specific layer of Inception Net (or any CNN). Viewing the embedding layer as a continuous multivariate Gaussian, the mean and covariance are estimated for both the generated data and the real data. The Fréchet distance between these two Gaussians (a.k.a Wasserstein-2 distance) is then used to quantify the quality of generated samples, i.e. ,
where and are the mean and covariance of the real data and model distributions, respectively. Lower FID means smaller distances between synthetic and real data distributions.
FID performs well in terms of discriminability, robustness and computational efficiency. It appears to be a good measure, even though it only takes into consideration the first two order moments of the distributions. However, it assumes that features are of Gaussian distribution which is often not guaranteed. It has been shown that FID is consistent with human judgments and is more robust to noise than ISheusel2017gans (e.g. negative correlation between the FID and visual quality of generated samples). Unlike IS however, it is able to detect intra-class mode dropping333On the contrary, Sajjadi et al. sajjadi2018assessing show that FID is sensitive to both the addition of spurious modes as well as to mode dropping., i.e. a model that generates only one image per class can score a high IS but will have a bad FID. Also, unlike IS, the FID worsens as various types of artifacts are added to images (See Fig. 4). IS and AM scores measure the diversity and quality of generated samples, while FID measures the distance between the generated and real distributions. An empirical analysis of FID can be found in lucic2017gans . See also liu2018improved for a class-aware version of FID.
Maximum Mean Discrepancy (MMD). This measure computes the dissimilarity between two probability distributions444Distinguishing two distributions by finite samples is known as Two-Sample Test in statistics. and using samples drawn independently from each fortet1953convergence . A lower MMD hence means that is closer to . MMD can be regarded as two-sample testing since, as in classifier two samples test, it tests whether one model or another is closer to the true data distribution muandet2017kernel ; sutherland2016generative ; bounliphone2015test . Such hypothesis tests allow choosing one evaluation measure over another.
The kernel MMD gretton2012kernel measures (square) MMD between and for some fixed characteristic kernel function (e.g. Gaussian kernel ) as follows555Please beware that here represents the generated samples, and not the class labels.:
In practice, finite samples from distributions are used to estimate MMD distance. Given and , one estimator of is:
Because of the sampling variance, may not be zero even when . Li et al. li2017mmd put forth a remedy to address this. Kernel MMD works surprisingly well when it operates in the feature space of a pre-trained CNN. It is able to distinguish generated images from real images, and both its sample complexity and computational complexity are low huang2018an .
Kernel MMD has also been used for training GANs. For example, the Generative Moment Matching Network (GMMN) li2015generative ; dziugaite2015training ; li2017mmd replaces the discriminator in GAN with a two-sample test based on kernel MMD. See also binkowski2018demystifying for more analyses on MMD and its use in GAN training.
The Wasserstein Critic. The Wasserstein critic arjovsky2017wasserstein provides an approximation of the Wasserstein distance between the real data distribution and the generator distribution :
where is a Lipschitz continuous function. In practice, the critic is a neural network with clipped weights to have bounded derivatives. It is trained to produce high values at real samples and low values at generated samples (i.e. is an approximation):
where is a batch of samples from a test set, is a batch of generated samples, and is the independent critic. For discrete distributions with densities and , the Wasserstein distance is often referred to as the Earth Mover’s Distance (EMD) which intuitively is the minimum mass displacement to transform one distribution into the other. A variant of this score known as sliced Wasserstein distance (SWD) approximates the Wasserstein-1 distance between real and generated images, and is computed as the statistical similarity between local image patches extracted from Laplacian pyramid representations of these images karras2017progressive . We will discuss SWD in more detail later under scores that utilize low-level image statistics.
This measure addresses both overfitting and mode collapse. If the generator memorizes the training set, the critic trained on test data can distinguish between samples and data. If mode collapse occurs, the critic will have an easy job in distinguishing between data and samples. Further, it does not saturate when the two distributions do not overlap. The magnitude of the distance indicates how easy it is for the critic to distinguish between samples and data.
The Wasserstein distance works well when the base distance is computed in a suitable feature space. A key limitation of this distance is its high sample and time complexity. These make Wasserstein distance less appealing as a practical evaluation measure, compared to other ones (See arora2017gans ).
Birthday Paradox Test. This test approximates the support size666The support of a real-valued function is the subset of the domain containing those elements which are not mapped to zero. of a discrete distribution. Arora and Zhang arora2017gans proposed to use the birthday paradox777The “Birthday theorem” states that with probability at least 50%, a uniform sample (with replacement) of size from a set of elements will have a duplicate given . test to evaluate GANs as follows:
Pick a sample of size from the generated distribution
Use an automated measure of image similarity to flag the (e.g. ) most similar pairs in the sample
Visually inspect the flagged pairs and check for duplicates
The suggested plan is to manually check for duplicates in a sample of size . If a duplicate exists, then the estimated support size is . It is not possible to find exact duplicates as the distribution of generated images is continuous. Instead, a distance measure can be used to find near-duplicates (e.g. using the
norm). In practice, they first created a candidate pool of potential near-duplicates by choosing the 20 closest pairs according to some heuristic measure, and then visually identified the near duplicated. Following this procedure and using Euclidean distance in pixel space, Arora and Zhangarora2017gans found that with probability , a batch of about samples generated from the CelebA dataset liu2015faceattributes contains at least one pair of duplicates for both DCGAN and MIX+DCGAN (thus leading to support size of ). The birthday theorem assumes uniform sampling. Arora and Zhang arora2017gans , however, claim that the birthday paradox holds even if data are distributed in a highly nonuniform way. This test can be used to detect mode collapse in GANs.
Classifier Two-sample Tests (C2ST). The goal of two-sample tests is to assess whether two samples are drawn from the same distribution lehmann2006testing . In other words, decide whether two probability distributions, denoted by P and Q, are equal. The generator is evaluated on a held out test set. This set is split into a test-train and test-test subsets. The test-train set is used to train a fresh discriminator, which tries to distinguish generated images from the real images. Afterwards, the final score is computed as the performance of this new discriminator on the test-test set and the freshly generated images. More formally, assume we have access to two samples and where , for all
. To test whether the null hypothesisis true, these five steps need to be completed:
Construct the following dataset
Randomly shuffle , and split it into two disjoint training and testing subsets and , where and .
Train a binary classifier on . In the following, assume that
is an estimate of the conditional probability distribution.
Calculate the classification accuracy on :
as the C2ST statistic, where is the indicator function. The intuition here is that if , the test accuracy in Eq. 10 should remain near chance-level. In contrast, if binary classifier performs better than chance then it implies that .
To accept or reject the null hypothesis, compute a -value using the null distribution of the C2ST.
In principle, any binary classifier can be adopted for computing C2ST. Huang et al. huang2018an introduce a variation of this measure known as the 1-Nearest Neighbor
classifier. The advantage of using 1-NN over other classifiers is that it requires no special training and little hyperparameter tuning. Given two sets of realand generated samples with the same size (i.e. ), one can compute the leave-one-out (LOO) accuracy of a 1-NN classifier trained on and with positive labels for and negative labels for . The LOO accuracy can vary from 0% to 100%. If the GAN memorizes samples in and re-generate them exactly, i.e. , then the accuracy would be 0%. This is because every sample from would have its nearest neighbor from with zero distance (and vice versa). If it generates samples that are widely different than real images (and thus completely separable), then the performance would be 100%. Notice that chance level here is 50% which happens when a label is randomly assigned to an image. Lopez-Paz and Oquab lopez2016revisiting offer a revisit of classifier two-sample tests in lopez2016revisiting .
Classifier two-sample tests can be considered a different form of two-sample test to MMD. MMD has the advantage of a U-statistic estimator with Gaussian asymptotic distribution, while the classifier 2-sample test has a different form (cf. ). MMD can be better when the U-statistic convergence outweighs the potentially more powerful classifier (e.g. from a deep network), while a classifier based test could be better if the classifier is better than the choice of kernel.
Classification Performance. One common indirect technique for evaluating the quality of unsupervised representation learning algorithms is to apply them as feature extractors on labeled datasets and evaluate the performance of linear models fitted on top of the learned features. For example, to evaluate the quality of the representations learned by DCGANs, Radford et al. radford2015unsupervised trained their model on ImageNet dataset and then used the discriminator’s convolutional features from all layers to train a regularized linear L2-SVM to classify CIFAR-10 images. They achieved 82.8% accuracy on par with or better than several baselines trained directly on CIFAR-10 data.
A similar strategy has also been followed in evaluating conditional GANs (e.g. the ones proposed for style transfer). For example, an off-the-shelf classifier is utilized by Zhang et al. zhang2016colorful to assess the realism of synthesized images. They fed their fake colorized images to a VGG network that was trained on real color photos. If the classifier performs well, this indicates that the colorizations are accurate enough to be informative about object class. They call this “semantic interpretability”. Similarly, Isola et al. isola2017image proposed the “FCN score” to measure the quality of the generated images conditioned on an input segmentation map. They fed the generated images to the fully-convolutional semantic segmentation network (FCN) long2015fully and then measured the error between the output segmentation map and the ground truth segmentation mask.
Ye et al. ye2018gan proposed an objective measure known as the GAN Quality Index (GQI) to evaluate GANs. First, a generator is trained on a labeled real dataset with classes. Next, a classifier is trained on the real dataset. The generated images are then fed to this classifier to obtain labels. A second classifier, called the GAN-induced classifier , is trained on the generated data. Finally, the GQI is defined as the ratio of the accuracies of the two classifiers:
GQI is an integer in the range of 0 to 100. Higher GQI means that the GAN distribution better matches the real data distribution.
Data Augmentation Utility: Some works measure the utility of GANs for generating additional training samples. This can be interpreted as a measure of the diversity of the generated images. Similar to Ye et al. ye2018gan , Lesort et al. lesort2018evaluation proposed to use a mixture of real and generated data to train a classifier and then test it on a labeled test dataset. The result is then compared with the score of the same classifier trained on the real training data mixed with noise. Along the same line, recently, Shmelkov et al. shmelkov2018good proposed to compare class-conditional GANs with GAN-train and GAN-test scores using a neural net classifier. GAN-train is a network trained on GAN generated images and is evaluated on real-world images. GAN-test, on the other hand, is the accuracy of a network trained on real images and evaluated on the generated images. They analyzed the diversity of the generated images by evaluating GAN-train accuracy with varying amounts of generated data. The intuition is that a model with low diversity generates redundant samples, and thus increasing the quantity of data generated in this case does not result in better GAN-train accuracy. In contrast, generating more samples from a model with high diversity produces a better GAN-train score.
Above mentioned measures are indirect and rely heavily on the choice of the classifier. Nonetheless, they are useful for evaluating generative models based on the notion that a better generative model should result in better representations for surrogate tasks (e.g. supervised classification). This, however, does not necessary imply that generated images have high diversity.
Boundary Distortion. Santurkar et al. santurkar2018classification aimed to measure diversity of generated samples using classification methods. This phenomenon can be viewed as a form of covariate shift in GANs wherein the generator concentrates a large probability mass on a few modes of the true distribution. It is illustrated using two toy examples in Fig. 5. The first example regards learning a unimodal spherical Gaussian distribution using a vanilla GAN goodfellow2014generative . As can be seen in Fig. 5
.A, the spectrum (eigenvalues of the covariance matrix) of GAN data shows a decaying behavior (unlike true data). The second example considers binary classification using logistic regression where the true distribution for each class is a unimodal spherical Gaussian. The synthetic distribution for one of the classes undergoes boundary distortion which causes a skew between the classifiers trained on true and synthetic data (Fig.5.B). Naturally, such errors would lead to poor generalization performance on true data as well. Taken together, these examples show that a) boundary distortion is a form of covariate shift that GANs can realistically introduce, and b) this form of diversity loss can be detected and quantified even using classification.
Specifically, Santurkar et al. proposed the following method to measure boundary distortion introduced by a GAN:
Train two separate instances of the given unconditional GAN, one for each class in true dataset D (assume two classes).
Generate a balanced dataset by drawing N/2 from each of these GANs.
Train a binary classifier based on the labeled GAN dataset obtained in Step 2 above.
Train an identical, in terms of architecture and hyperparameters, classifier on the true data D for comparison.
Afterwards, the performance of both classifiers is measured on a hold-out set of true data. Performance of the classifier trained on synthetic data on this set acts as a proxy measure for diversity loss through covariate shift. Notice that this measure is akin to the classification performance discussed above.
Number of Statistically-Different Bins (NDB). To measure diversity of generates samples and mode collapse, Richardson and Weiss Richardson2018 propose an evaluation method based on the following observation: Given two sets of samples from the same distribution, the number of samples that fall into a given bin should be the same up to sampling noise. More formally, let be the indicator function for bin . if the sample falls into the bin and zero otherwise. Let be samples from distribution (e.g. training samples) and be samples from distribution (e.g. testing samples), then if , it is expected that . The pooled sample proportion (the proportion that falls into
in the joined sets) and its standard error:
are calculated. The test statistic is the-score: , where and are the proportions from each sample that fall into bin . If is smaller than a threshold (i.e. significance level) then the number is statistically different. This test is performed on all bins and then the number of statistically-different bins (NDB) is reported.
To perform binning, one option is to use a uniform grid. The drawback here is that in high dimensions, a randomly chosen bin in a uniform grid is very likely to be empty. Richardson and Weiss proposed to use Voronoi cells to guarantee that each bin will contain some samples. Fig. 6 demonstrates this procedure using a toy example in . To define the Voronoi cells, training samples are clustered into K (
) clusters using K-means. Each training sampleis assigned to one of the cells (bins). Each generated sample is then assigned to the nearest () of the centroids.
Unlike IS and FID, NDB measure is applied directly on the image pixels rather than pre-learned deep representations. This makes NDB domain agnostic and sensitive to different image artifacts (as opposed to using pre-trained deep models). One advantage of NDB over MS-SSIM and Birthday Paradox Test is that NDB offers a measure between the data and generated distributions and not just measuring the general diversity of the generated samples. One concern regarding NDB is that using distance in pixel space as a measure of similarity may not be meaningful.
Image Retrieval Performance. Wang et al. wang2016ensembles proposed an image retrieval measure to evaluate GANs. The main idea is to investigate images in the dataset that are badly modeled by a network. Images from a held-out test set as well as generated images are represented using a discriminatively trained CNN lecun1998gradient . The nearest neighbors of generated images in the test dataset are then retrieved. To evaluate the quality of the retrieval results, they proposed two measures:
Measure 1: Consider to be the distance of the nearest image generated by method to test image , and the set of -nearest distances to all test images ( is often set to 1). The Wilcoxon signed-rank test is then used to test the hypothesis that the median of the difference between two nearest distance distributions by two generators is zero, in which case they are equally good (i.e. the median of the distribution ). If they are not equal, the test can be used to assess which method is statistically better.
Measure 2: Consider to be the distribution of the nearest distance of the train images to the test dataset. Since train and test sets are drawn from the same dataset, the distribution can be considered the optimal distribution that a generator could attain (assuming it generates an equal number of images present in the train set). To model the difference with this ideal distribution, the relative increase in mean nearest neighbor distance is computed as:
where is the size of the test dataset. As an example, for a model means that the average distance to the nearest neighbor of a query image is 10% higher than for data drawn from the real distribution.
Generative Adversarial Metric (GAM). Im et al. im2016generating proposed to compare two GANs by having them engaged in a battle against each other by swapping discriminators or generators across the two models (See Fig. 7). GAM measures the relative performance of two GANs by measuring the likelihood ratio of the two models. Consider two GANs with their respective trained partners, and , where and are the generators, and and are the discriminators. The hypothesis is that is better than if fools more than fools , and vice versa for the hypothesis . The likelihood-ratio is defined as:
where and are the swapped pairs and , is the likelihood of generated from the data distribution by model , and indicates that discriminator thinks is a real sample.
Then, one can measure which generator fools the opponent’s discriminator more, where and . To do so, Im et al. proposed a sample ratio test to declare a winner or a tie.
A variation of GAM known as generative multi-adversarial metric (GMAM), that is amenable to training with multiple discriminators, has been proposed in durugkar2016generative .
GAM suffers from two main caveats: a) it has a constraint where the two discriminators must have an approximately similar performance on a calibration dataset, which can be difficult to satisfy in practice, and b) it is expensive to compute because it has to be computed for all pairs of models (i.e. pairwise comparisons between independently trained GAN).
Tournament Win Rate and Skill Rating. Inspired by GAM and GMAM scores (mentioned above) as well as skill rating systems in games such as chess or tennis, Olsson et al. olsson2018skill utilized tournaments between generators and discriminators for GAN evaluation. They introduced two methods for summarizing tournament outcomes: tournament win rate and skill rating. Evaluations are useful in different contexts, including a) monitoring the progress of a single model as it learns during the training process, and b) comparing the capabilities of two different fully trained models. The former regards a single model playing against past and future versions of itself producing a useful measure of training progress (a.k.a within trajectory tournament). The latter regards multiple separate models (using different seeds, hyperparameters, and architectures) and provides a useful relative comparison between different trained GANs (a.k.a multiple trajectory tournament). Each player in a tournament is either a discriminator that attempts to distinguish between real and fake data or a generator that attempts to fool the discriminators into accepting fake data as real.
Tournament Win Rate: To determine the outcome of a match between discriminator and generator , the discriminator judges two batches: one batch of samples from generator , and one batch of real data. Every sample that is not judged correctly by the discriminator (e.g. for the generated data or for the real data) counts as a win for the generator and is used to compute its win rate. A match win rate of for means that ’s performance against is no better than chance. The tournament win rate for generator is computed as its average win rate over all discriminators in . Tournament win rates are interpretable only within the context of the tournament they were produced from, and cannot be directly compared with those from other tournaments.
Olsson et al. ran a tournament between 20 saved checkpoints of discriminators and generators from the same training run of a DCGAN trained on SVHN netzer2011reading using an evaluation batch size of 64. Fig. 8.A shows the raw tournament outcomes from the within-trajectory tournament, alongside the same tournament outcomes summarized using tournament win rate and skill rating, as well as SVHN classifier score and SVHN Fréchet distance computed from 10,000 samples, for comparison888To compute these score, a pre-trained SVHN classifier is used rather than an ImageNet classifier.. It shows that tournament win rate and skill rating both provide a comparable measure of training progress to SVHN classifier score.
Skill Rating: Here the idea is to use a skill rating system to summarize tournament outcomes in a way that takes into account the amount of new information each match provides. Olsson et al. used the Glicko2 system glickman1995comprehensive . In a nutshell, a player’s skill rating is represented as a Gaussian distribution, with a mean and standard deviation, representing the current state of the evidence about their “true” skill rating. See olsson2018skill for details of the algorithm.
Olsson et al. constructed a tournament from saved snapshots from six SVHN GANs that differ slightly from one another, including different loss functions and architectures. They included 20 saved checkpoints of discriminators and generators from each GAN experiment, a single snapshot of 6-auto, and a generator player that produces batches of real data as a benchmark. Fig.8.B shows the results compared to Inception and Fréchet distances.
One advantage of these scores is that they are not limited to fixed feature sets and players can learn to attend to any features that are useful to win. Another advantage is that human judges are eligible to play as discriminators, and could participate to receive a skill rating. This allows a principled method to incorporate human perceptual judgments in model evaluation. The downside is providing relative rather than absolute score of a model’s ability thus making reproducing results challenging and expensive.
Normalized Relative Discriminative Score (NRDS). The main idea behind this measure proposed by Zhang et al. zhang2018decoupled is that more epochs would be needed to distinguish good generated samples from real samples (compared to separating poor ones from real samples). They used a binary classifier (discriminator) to separate the real samples from fake ones generated by all the models in comparison. In each training epoch, the discriminator’s output for each sample is recorded. The average discriminator output of real samples will increase with epoch (approaching 1), while that of generated samples from each model will decrease (approaching 0). However, the decrement rate of each model varies based on how close the generated samples are to the real ones. The samples closer to real ones show slower decrement rate whereas poor samples will show a faster decrement rate. Therefore, comparing the “decrement rate” of each model can be an indication of how well it performs relative to other models.
There are three steps to compute the NRDS:
Obtain the curve () of discriminator average output versus epoch (or mini-batch) for each model (assuming models in comparison) during training,
Compute the area under each curve (as the decrement rate), and
Compute NRDS of the th model by
The higher the NRDS, the better. Fig. 9 illustrates the computation of NRDS over a toy example.
Adversarial Accuracy and Adversarial Divergence. Yang et al. yang2017lr proposed two measures based on the intuition that a sufficient, but unnecessary, condition for closeness of generated data distribution and the real data distribution is closeness of and , i.e. distributions of generated data and real data conditioned on all possible variables of interest , e.g. category labels. One way to obtain the variable of interest is by asking human participants to annotate the images (sampled from and ).
Since it is not feasible to directly compare and , they proposed to compare and instead (following the Bayes rule) which is a much easier task. Two classifiers are then trained from human annotations to approximate and for different categories. These classifiers are used to compute the following evaluation measures:
Adversarial Accuracy: Computes the classification accuracies achieved by the two classifiers on a validation set (i.e. another set of real images). If is close to , then similar accuracies are expected.
Adversarial Divergence: Computes the KL divergence between and . The lower the adversarial divergence, the closer the two distributions. The lower bound for this measure is exactly zero, which means for all samples in the validation set.
One drawback of these measures is that a lot of human effort is needed to label the real and generated samples. To mitigate this, Yang et al. yang2017lr first trained one generator per category using a labeled training set and then generated samples from all categories. Notice that these measures overlap with classification performance discussed above.
Geometry Score. Khrulkov and Oseledets khrulkov2018geometry proposed to compare geometrical properties of the underlying data manifold between real and generated data. This score, however, involves a lot of technical details making it hard to understand and compute. Here, we provide an intuitive description.
The core idea is to build a simplicial complex from data using proximity information (e.g. pairwise distances between samples). To investigate the structure of the manifold, a threshold is varied and generated simplices are added into the approximation. An example is shown in Fig. 10.A. For each value of , topological properties of the corresponding simplicial complex, namely homologies, are computed. A homology encodes the number of holes of various dimensions in a space. Eventually, a barcode (signature) is constructed reflecting how long generated holes (homologies) persist in simplicial complexes (Fig. 10.B). In general, to find the rank of a -homology (i.e. the number of -dimensional holes) at some fixed value , one has to count intersections of the vertical line with the intervals at the desired block .
Since computing the barcode using all data is intractable, in practice often subsets of data (e.g. by randomly selecting points) are used. For each subset, Relative Living Times (RLT) of each number of holes is computed which is defined as the ratio of the total time when this number was present and of the value when points connect into a single blob. The RLT over random subsets are then averaged to give the Mean Relative Living Times (MRLT). By construction, they add up to 1. To quantitatively evaluate the topological difference between two datasets, the distance between these distributions is computed.
Fig. 10.C shows an example over synthetic data. Intuitively, the value at location in the bar chart (on axis), indicates that for that amount of time, the 1D hole existed by varying the threshold. For example, in the left most histogram, nearly never none, 2 or 3 1D holes were observed and most of the time only one hole appeared. Similarly, for the 4th pattern from the left, most of the time one 1D hole is observed. Comparing the MRLTs of the patterns with the ground truth pattern (leftmost one) reveals that this one is indeed the closest to the ground truth.
Fig. 10.D shows comparison of two GANs, WGAN arjovsky2017wasserstein and WGAN-GP gulrajani2017improved , over the MNIST dataset using the method above over single digits and the entire dataset. It shows that both models produce distributions that are very close to the ground truth, but for almost all classes WGAN-GP shows better performance.
The geometry score does not use auxiliary networks and is not limited to visual data. However, since it only takes topological properties into account (which do not change if for example the entire dataset is shifted by 1) assessing the visual quality of samples may be difficult based only on this score. Due to this, authors propose to use this score in conjunction with other measures such as FID when dealing with natural images. .
Reconstruction Error. For many generative models, the reconstruction error on the training set is often explicitly optimized (e.g.
Variational Autoencodersledig2016photo ). It is therefore natural to evaluate generative models using a reconstruction error measure (e.g. norm) computed on a test set. In the case of GANs, given a generator G and a set of test samples the reconstruction error of G on is defined as:
Since it is not possible to directly infer the optimal from , Xiang and Li xiang2017effects used the following alternative method. Starting from an all-zero vector, they performed gradient descent on the latent code to find the one that minimizes the norm between the sample generated from the code and the target one. Since the code is optimized instead of being computed from a feed-forward network, the evaluation process is time-consuming. Thus, they avoided performing this evaluation at every training iteration when monitoring the training process, and only used a reduced number of samples and gradient descent steps. Only for the final trained model, they performed an extensive evaluation on a larger test set, with a larger number of steps.
Image Quality Measures (SSIM, PSNR and Sharpness Difference). Some researchers have proposed to use measures from the image quality assessment literature for training and evaluating GANs. They are explained next.
The single-scale SSIM measure wang2004image is a well-characterized perceptual similarity measure that aims to discount aspects of an image that are not important for human perception. It compares corresponding pixels and their neighborhoods in two images, denoted by and , using three quantities—luminance (), contrast (), and structure ():
The variables , , , and denote mean and standard deviations of pixel intensity in a local image patch centered at either or (typically a square neighborhood of 5 pixels). The variable denotes the sample correlation coefficient between corresponding pixels in the patches centered at and . The constants , , and are small values added for numerical stability. The three quantities are combined to form the SSIM score:
SSIM assumes a fixed image sampling density and viewing distance. A variant of SSIM operates at multiple scales. The input images and are iteratively downsampled by a factor of 2 with a low-pass filter, with scale denoting the original images downsampled by a factor of . The contrast and structure components are applied to all scales. The luminance component is applied only to the coarsest scale, denoted . Further, contrast and structure components can be weighted at each scale. The final measure is:
MS-SSIM ranges between 0 (low similarity) and 1 (high similarity). Snell et al. ridgeway2015learning defined a loss function for training GANs which is the sum of structural-similarity scores over all image pixels,
where and are the original and reconstructed images, and is an index over image pixels. This loss function has a simple analytical derivative wang2008maximum which allows performing gradient descent. See Fig. 17 for more details.
PSNR measures the peak signal-to-noise ratio between two monochrome images and to assess the quality of a generated image compared to its corresponding real image (e.g. for evaluating conditional GANs krishnaGAN ). The higher the PSNR (in db), the better quality of the generated image. It is computed as:
and, is the maximum possible pixel value of the image (e.g. 255 for an 8 bit representation). This score can be used when a reference image is available for example in training conditional GANs using paired data (e.g. isola2017image ; krishnaGAN ).
Sharpness Difference (SD) measures the loss of sharpness during image generation. It is compute as:
Odena et al. odena2016conditional used 999or ‘abused’ since the original MS-SSIM measure is intended to measure similarity of an image with respect to a reference image. MS-SSIM to evaluate the diversity of generated images. The intuition is that image pairs with higher MS-SSIM seem more similar than pairs with lower MS-SSIM. They measured the MS-SSIM scores of 100 randomly chosen pairs of images within a given class. The higher (lower) diversity within a class, the lower (the higher) mean MS-SSIM score (See Fig. 11.A). Training images from the ImageNet training data contain a variety of mean MS-SSIM scores across the classes indicating the variability of image diversity in ImageNet classes. Fig. 11.B plots the mean MS-SSIM values for image samples versus training data for each class (after training was completed). It shows that 847 classes, out of 1000, have mean sample MS-SSIM scores below that of the maximum MS-SSIM for the training data. To identify whether the generator in AC-GAN odena2016conditional collapses during training, Odena et al. tracked the mean MS-SSIM score for all 1000 ImageNet classes (Fig. 11.C). Fig. 11
.D shows the joint distribution of Inception accuracies versus MS-SSIM across all 1000 classes. It shows that Inception score and MS-SSIM are anti-correlated (= −0.16).
Juefei-Xu et al. juefei2017gang used the SSIM and PSNR measures to evaluate GANs in image completion tasks. The advantage here is that having 1-vs-1 comparison between the ground-truth and the completed image allows very straightforward visual examination of the GAN quality. It also allows head-to-head comparison between various GANs. In addition to the above mentioned image quality measures, some other measures such as Universal Quality Index (UQI) WangBovik and Visual Information Fidelity (VIF) sheikh2006image have also been adopted for assessing the quality of synthesized images. It has been reported that MS-SSIM finds large-scale mode collapses reliably but fails to diagnose smaller effects such as loss of variation in colors or textures. Its drawback is that it does not directly assess image quality in terms of similarity to the training set odena2016conditional .
Low-level Image Statistics. Natural scenes make only a tiny fraction of the space of all possible images and have certain characteristics (e.g. geisler2008visual ; simoncelli2001natural ; torralba2003statistics ; ruderman1994statistics ). It has been shown that statistics of natural images remain the same when the images are scaled (i.e. scale invariance) srivastava2003advances ; zhu2003statistical . The average power spectrum magnitude over natural images has the form deriugin1956power ; cohen1975image ; burton1987color ; field1987relations . Another important property of natural image statistics is the non-Gaussianity srivastava2003advances ; zhu2003statistical ; wainwright1999scale
. This means that marginal distribution of almost any zero mean linear filter response on virtually any dataset of images is sharply peaked at zero, with heavy tails and high kurtosis (greater than 3)lee2001occlusion . Recent studies have shown that the contrast statistics of the majority of natural images follows a Weibull distribution ghebreab2009biologically .
Zeng et al. zeng2017statistics proposed to evaluate generative models in terms of low-level statistics of their generated images with respect to natural scenes. They considered four statistics including 1) the mean power spectrum, 2) the number of connected components in a given image area, 3) the distribution of random filter responses, and 4) the contrast distribution. Their results show that although generated images by DCGAN radford2015unsupervised , WGAN arjovsky2017wasserstein , and VAE kingma2013auto resemble natural scenes in terms of low level statistics, there are still significant differences. For example, generated images do not have scale invariant mean power spectrum magnitude, which indicates existence of extra structures in these images caused by deconvolution operations.
Low-level image statistics can be used for regularizing GANs to optimize the discriminator to inspect whether the generator’s output matches expected statistics of the real samples (a.k.a feature matching salimans2016improved ) using the loss function: , where represents the statistics of features. Karras et al. karras2017progressive investigated the multi scale statistical similarities between distributions of local image patches drawn from the Laplacian pyramid burt1987laplacian representations of generated and real images. They used the Wasserstein distance to compare the distributions of patches101010This measure is known as the sliced Wasserstein distance (SWD). The multi-scale pyramid allows a detailed comparison of statistics. The distance between the patch sets extracted from the lowest resolution indicates similarity in large-scale image structures, while the finest-level patches encode information about pixel-level attributes such as sharpness of edges and noise.
Precision, Recall and Score. Lucic et al. lucic2017gans proposed to compute precision, recall and
score to quantify the degree of overfitting in GANs. Intuitively precision measures the quality of the generated samples, whereas recall measures the proportion of the reference distribution covered by the learned distribution. They argue that IS only captures precision as it does not penalize a model for not producing all modes of the data distribution. Rather, it only penalizes the model for not producing all classes. FID score, on the other hand, captures both precision and recall.
To approximate these scores for a model, Lucic et al. proposed to use toy datasets for which the data manifold is known and distances of generated samples to the manifold can be computed. An example of such dataset is the manifold of convex shapes (See Fig. 12). To compute these scores, first the latent representation of each test sample is estimated, through gradient descent, by inverting the generator . Precision is defined as the fraction of the generated samples whose distance to the manifold is below a certain threshold. Recall, on the other hand, is given by the fraction of test samples whose distance to is below the threshold. If the samples from the model distribution are (on average) close to the manifold (see lucic2017gans for details), its precision is high. Simlarly, high recall implies that the generator can recover (i.e. generate something close to) any sample from the manifold, thus capturing most of the manifold.
The major drawback of these scores is that they are impractical for real images where the data manifold is unknown, and their use is limited to evaluations on synthetic data. In a recent effort, Sajjadi et al. sajjadi2018assessing introduced a novel definition of precision and recall to address this limitation.
2.3 Qualitative Measures
Visual examination of samples by human ratersis one of the common and most intuitive ways to evaluate GANs (e.g. denton2015deep ; salimans2016improved ; millerhuman ). While it greatly helps inspect and tune models, it suffers from the following drawbacks. First, evaluating the quality of generated images with human vision is expensive and cumbersome, biased (e.g. depends on the structure and pay of the task, community reputation of the experimenter, etc in crowd sourcing setups silberman2015stop ) difficult to reproduce, and does not fully reflect the capacity of models. Second, human inspectors may have high variance which makes it necessary to average over a large number of subjects. Third, an evaluation based on samples could be biased towards models that overfit and therefore a poor indicator of a good density model in a log-likelihood sense theis2015note For instance, it fails to tell whether a model drops modes. In fact, mode dropping generally helps visual sample quality as the model can choose to focus on only few common modes that correspond to typical samples.
In what follows, I discuss the ways that have been followed in the literature to qualitatively inspect the quality of generated images by a model and explore its learned latent space.
Nearest Neighbors. To detect overfitting, traditionally some samples are shown next to their nearest neighbors in the training set (e.g. Fig. 13). There are, however, two concerns regarding this manner of evaluation:
Nearest neighbors are typically determined based on the Euclidean distance which is very sensitive to minor perceptual perturbations. This is a well known phenomenon in the psychophysics literature (See Wang and Bovik wang2009mean ). It is trivial to generate samples that are visually almost identical to a training image, but have large Euclidean distances with it theis2015note . See Fig. 14.A for some examples.
A model that stores (transformed) training images (i.e. memory GAN) can trivially pass the nearest-neighbor overfitting test theis2015note . This problem can be alleviated by choosing nearest neighbors based on perceptual measures, and by showing more than one nearest neighbor.
Figure 13: Generated samples nearest to real images from CIFAR-10. In each of the two panels, the first column shows real images, followed by the nearest image generated by DCGAN radford2015unsupervised , ALI dumoulin2016adversarially , Unrolled GAN metz2016unrolled , and VEEGAN srivastava2017veegan , respectively. Figure compiled from srivastava2017veegan .
Rapid Scene Categorization. These measures are inspired by prior studies who have shown that humans are capable of reporting certain characteristics of scenes in a short glance (e.g. scene category, visual layout oliva2005gist ; serre2007feedforward ). To obtain a quantitative measure of quality of samples, Denton et al. denton2015deep asked volunteers to distinguish their generated samples from real images. The subjects were presented with the user interface shown in Fig. 15(right) and were asked to click the appropriate button to indicate if they believed the image was real or generated. They varied the viewing time from 50ms to 2000ms (11 durations). Fig. 15(left) shows the results over samples generated by three GAN models. They concluded that their model was better than the original GAN goodfellow2014generative since it did better in fooling the subjects (lower bound here is 0% and upper bound is 100%). See also Fig. 16 for another example of fake vs. real experiment but without time constraints (conducted by Salimans et al. salimans2016improved ).
This “Turing-like” test is very intuitive and seems inevitable to ultimately answer the question of whether generative models are as good as the nature in generating images. However, there are several concerns in conducting such a test in practice (especially when dealing with models that are far from perfect; See Fig. 15(left)). Aside from experimental conditions which are hard to control in crowd-sourced platforms (e.g. presentation time, screen size, subject’s distance to the screen, subjects’ motivations, age, mood, feedback, etc) and high cost, these tests fall short in evaluating models in terms of diversity of generated samples and may be biased towards models that overfit to training data.
Rating and Preference Judgment. These types of experiments ask subjects to rate models in terms of the fidelity of their generated images. For example, Snell et al., snell2015learning studied whether observers prefer reconstructions produced by perceptually-optimized networks or by the pixelwise-loss optimized networks. Participants were shown image triplets with the original (reference) image in the center and the SSIM- and MSE-optimized reconstructions on either side with the locations counterbalanced. Participants were instructed to select which of the two reconstructed images they preferred (See Fig. 17). Similar approaches have been followed in huang2017stacked ; zhang2017stackgan ; xiao2018generating ; yi2017dualgan ; zhang2016colorful ; upchurch2016deep ; donahue2017semantically ; liu2017auto ; lu2017sketch . Often the first few trials in these experiments are spared for practice.
Figure 17: An example of a user judgment study by Snell et al. snell2015learning . Left) Human judgments of generated images (a) Fully connected network: Proportion of participants preferring SSIM to MSE for each of 100 image triplets. (b) Deterministic conv. network: Distribution of image quality ranking for MS-SSIM, MSE, and MAE for 1000 images from the STL-10 hold-out set. Right) Image triplets consisting of—from left to right—the MSE reconstruction, the original image, and the SSIM reconstruction. Image triplets are ordered, from top to bottom and left to right, by the percentage of participants preferring SSIM. (c) Eight images for which participants strongly preferred SSIM over MSE. (d) Eight images for which the smallest proportion of participants preferred SSIM. Figure compiled from snell2015learning .
Evaluating Mode Drop and Mode Collapse. GANs have been repeatedly criticized for failing to model the entire data distribution, while being able to generate realistically looking images. Mode collapse, a.k.a the Helvetica scenario, is the phenomenon when the generator learns to map several different input vectors to the same output (possibly due to low model capacity or inadequate optimization arora2017gans ). It causes lack of diversity in the generated samples as the generator assigns low probability mass to significant subsets of the data distribution’s support. Mode drop occurs when some hard-to-represent modes of are simply “ignored” by . This is different than mode collapse where several modes of are “averaged” by into a single mode, possibly located at a midpoint. An ideal GAN evaluation measure should be sensitive to these two phenomena.
Detecting mode collapse in GANs trained on large scale image datasets is very challenging111111See srivastava2017veegan ; huang2018an for analysis of mode drop and mode collapse over real datasets.. However, it can be accurately measured on synthetic datasets where the true distribution and its modes are known (e.g. Gaussian mixtures). Srivastava et al. srivastava2017veegan proposed a measure to quantify mode collapse behavior as follows:
First, some points are sampled from the generator. A sample is counted as high quality, if it is within a certain distance of its nearest mode center (e.g. over a 2D dataset, or over a 1200D dataset).
Then, the number of modes captured is the number of mixture components whose mean is nearest to at least one high quality sample. Accordingly, a mode is considered lost if there is no sample in the generated test data within a certain standard deviations from the center of that mode. This is illustrated in Fig. 19.
Santurkar et al. santurkar2018classification , to investigate mode distribution/collapse over natural datasets, propose to train GANs over a well-balanced dataset (i.e. a dataset that contains equal number of samples from each class) and then test whether generated data also generates a well-balanced dataset. Steps are as follows:
Train the GAN unconditionally (without class labels) on the chosen balanced multi-class dataset D.
Train a multi-class classifier on the same dataset D (to be used as an annotator).
Generate a synthetic dataset by sampling N images from the GAN. Then use the classifier trained in Step 2 above to obtain labels for this synthetic dataset.
An example is shown in Fig. 18. It reveals that GANs often exhibit mode collapse.
The reverse KL divergence over the modes has been used in lin2017pacgan to measure the quality of mode collapse as follows. Each generated sample is assigned to its closest mode. This induces an empirical, discrete distribution with an alphabet size equal to the number of observed modes in the generated samples. A similar induced discrete distribution is computed from the real data samples. The reverse KL divergence between the induced distribution from generated samples and the induced distribution from the real samples is used as a measure.
The shortcoming of the described measures is that they only work for datasets with known modes (e.g. synthetic or labeled datasets). Overall, it is hard to quantitatively measure mode collapse and mode drop since they are poorly understood. Further, finding nearest neighbors and nearest mode center is non-trivial in high dimensional spaces is non-trivial. Active research is ongoing in this direction.
Investigating and Visualizing the Internals of Networks. Other ways of evaluating generative models are studying how and what they learn, exploring their internal dynamics, and understanding the landscape of their latent spaces. While this is a broad topic and many papers fall under it, here I bring few examples to give the reader some insights.
Disentangled representations. “Disentanglement” regards the alignment of “semantic” visual concepts to axes in the latent space. Some tests can check the existence of semantically meaningful directions in the latent space, meaning that varying the seed along those directions leads to predictable changes (e.g. changes in facial hair, or pose). Some others (e.g. chen2016infogan ; higgins2016beta ; mathieu2016disentangling ; lipton2017precise ) assess the quality of internal representations by checking whether they satisfy certain properties, such as being “disentangled”. A measure of disentanglement proposed in higgins2016beta checks whether the latent space captures the true factors of variation in a simulated dataset where parameters are known by construction (e.g. using a graphics engine). Radford et al. radford2015unsupervised investigated their trained generators and discriminators in a variety of ways. They proposed that walking on the learned manifold can tell us about signs of memorization (if there are sharp transitions) and about the way in which the space is hierarchically collapsed. If walking in this latent space results in semantic changes to the image generations (such as objects being added and removed), one can reason that the model has learned relevant and interesting representations. They also showed interesting results of performing vector arithmetic on the vectors of sets of exemplar samples for visual concepts (e.g. smiling woman - neutral woman + neutral man = smiling man; using ’s averaged over several samples).
Space continuity. Related to above, the goal here it to study the level of detail a model is capable of extracting. For example, given two random seed vectors and that generated two realistic images, we can check the images produced using seeds lying on the line joining and
. If such “interpolated” images are reasonable and visually appealing, then this may be taken as a sign that a model can produce novel images rather than simply memorizing them (e.g. berthelot2017began ; See Fig. 20). Some other examples include donahue2017semantically ; nguyen2016synthesizing . White white2016sampling suggests that replacing linear interpolation with spherical linear interpolation prevents diverging from a model’s prior distribution and produces sharper samples. Vedantam et al. vedantam2017generative studied “visually grounded semantic imagination” and proposed several ways to evaluate their models in terms of the quality of the learned semantic latent space.
Figure 20: Top: Interpolations on between real images at resolution (from BEGAN berthelot2017began ). These images were not part of the training data. The first and last columns contain the real images to be represented and interpolated. The images immediately next to them are their corresponding approximations while the images in between are the results of linear interpolation in . Middle: Latent space interpolations for three ImageNet classes. Left-most and right-columns show three pairs of image samples - each pair from a distinct class. Intermediate columns highlight linear interpolations in the latent space between these three pairs of images (From odena2016conditional ). Bottom: Class-independent information contains global structure about the synthesized image. Each column is a distinct bird class while each row corresponds to a fixed latent code (From odena2016conditional ).
Visualizing the discriminator featureszeiler2014visualizing ; bau2017network ; zhou2014object ), some works have attempted to visualize the internal parts of generators and discriminators in GANs. For example, Radford et al. radford2015unsupervised
showed that DCGAN trained on a large image dataset can also learn a hierarchy of interesting features. Using guided backpropagationspringenberg2014striving , they showed that the features learned by the discriminator fire on typical parts of a bedroom, such as beds and windows (See Fig. 5 in radford2015unsupervised ). The t-SNE method maaten2008visualizing has also been frequently used to project the learned latent spaces in 2D.
3.1 Other Evaluation Measures
In addition to measures discussed above, there exist some other non-trivial or task-specific ways to evaluate GANs. Vedantam et al. vedantam2017generative proposed a model for visually grounded imagination to create images of novel semantic concepts. To evaluate the quality of the generated images, they proposed three measures including a) correctness: fraction of attributes for each generated image that match those specified in the concept’s description, b) coverage: diversity of values for the unspecified or missing attributes, measured as the difference between the empirical distributions of attribute values in the generated set and the true distribution for this attribute induced by the training set, and c) compositionality: correctness of generated images in response to test concepts that differ in at least one attribute from the training concepts. To measure diversity of generated samples, Zhu et al. zhu2017toward
randomly sampled from their model and computed the average pair-wise distance in a deep feature space using cosine distance and compared it with the same measure calculated from ground truth real images. This is akin to the image retrieval performance measure described above. Imet al. jiwoong2018quantitatively proposed to evaluate GANs by exploring the divergence and distance measures that were used during training GANs. They showed that rankings produced by four measures including 1) Jensen-Shannon Divergence, 2) Constrained Pearson , 3) Maximum Mean Discrepancy, and 4) Wasserstein Distance, are consistent and robust across measures.
Disentangled Latent Spaces
Sensitivity to Distortions
Comp. & Sample Efficiency
|1. Average Log- likelihood goodfellow2014generative ; theis2015note||low||low||-||[-, ]||low||low||low|
|2. Coverage Metric tolstikhin2017adagan||low||low||-||[0, 1]||low||low||-|
|3. Inception Score (IS) salimans2016improved||high||moderate||-||[1, ]||high||moderate||high|
|4. Modified Inception Score (m-IS) gurumurthy2017deligan||high||moderate||-||[1, ]||high||moderate||high|
|5. Mode Score (MS) che2016mode||high||moderate||-||[0, ]||high||moderate||high|
|6. AM Score zhou2018activation||high||moderate||-||[0, ]||high||moderate||high|
|7. Fréchet Inception Distance (FID) heusel2017gans||high||moderate||-||[0, ]||high||high||high|
|8. Maximum Mean Discrepancy (MMD) gretton2012kernel||high||low||-||[0, ]||-||-||-|
|9. The Wasserstein Critic arjovsky2017wasserstein||high||moderate||-||[0, ]||-||-||low|
|10. Birthday Paradox Test arora2017gans||low||high||-||[1, ]||low||low||-|
|11. Classifier Two Sample Test (C2ST) lehmann2006testing||high||low||-||[0, 1]||-||-||-|
|12. Classification Performance radford2015unsupervised ; isola2017image||high||low||-||[0, 1]||low||-||-|
|13. Boundary Distortion santurkar2018classification||low||low||-||[0, 1]||-||-||-|
|14. NDB Richardson2018||low||high||-||[0, ]||-||low||-|
|15. Image Retrieval Performance wang2016ensembles||moderate||low||-||*||low||-||-|
|16. Generative Adversarial Metric (GAM) im2016generating||high||low||-||*||-||-||moderate|
|17. Tournament Win Rate and Skill Rating olsson2018skill||high||high||-||*||-||-||low|
|18. NRDS zhang2018decoupled||high||low||-||[0, 1]||-||-||poor|
|19. Adversarial Accuracy & Divergence yang2017lr||high||low||-||[0, 1], [0, ]||-||-||-|
|20. Geometry Score khrulkov2018geometry||low||low||-||[0, ]||-||low||low|
|21. Reconstruction Error xiang2017effects||low||low||-||[0, ]||-||moderate||moderate|
|22. Image Quality Measures wang2004image ; ridgeway2015learning ; juefei2017gang||low||moderate||-||*||high||high||high|
|23. Low-level Image Statistics zeng2017statistics ; karras2017progressive||low||low||-||*||low||low||-|
|24. Precision, Recall and score lucic2017gans||low||high||✓||[0, 1]||-||-||-|
3.2 Sample and Computational Efficiencies
Here, I provide more details on two items in the list of desired properties of GAN evaluation measures. They will be used in the next subsection for assessing the measures. Huang et al. huang2018an argue that a practical GAN evaluation measure should be computed using a reasonable number of samples and within an affordable computation cost. This is particularly important during monitoring the training process of models. They proposed the following ways to assess evaluation measures:
Sample efficiency: It regards the number of samples needed for a measure to discriminate a set of generated samples from a set of real samples . To do this, a reference set is uniformly sampled from the real training data (but disjoint with ). All three sets have the same size (i.e. ). An ideal measure is expected to correctly score lower than with a relatively small . In other words, the number of samples needed for a measure to distinguish and can be viewed as its sample complexity.
Computational efficiency: Fast computation of the empirical measure is of practical concern as it helps researchers monitor the training process and diagnose problems early on (e.g. for early stopping). This can be measured in terms of seconds per number of evaluated samples.
3.3 What is the Best GAN Evaluation Measure?
only two measures are designed to explicitly address overfitting,
the majority of the measures do not consider disentangled representations,
few measures have both lower and upper bounds,
the agreement between the measures and human perceptual judgments is less clear,
several highly regarded measures have high sample and computational efficiencies, and
the sensitivity of measures to image distortions is less explored.
A detailed discussion and comparison of GAN evaluation measures comes next.
As of yet, there is no consensus regarding the best score. Different scores assess various aspects of the image generation process, and it is unlikely that a single score can cover all aspects. Nevertheless, some measures seem more plausible than others (e.g. FID score). Detailed analyses by Theis et al. theis2015note showed that average likelihood is not a good measure. Parzen windows estimation of likelihood favors trivial models and is irrelevant to visual fidelity of samples. Further, it fails to approximate the true likelihood in high dimensional spaces or to rank models (Fig. 14). Similarly, the Wasserstein distance between generated samples and the training data is also intractable in high dimensions karras2017progressive . Two widely accepted scores, Inception Score and Fréchet Inception Distance, rely on pre-trained deep networks to represent and statistically compare original and generated samples. This brings along two significant drawbacks. First, the deep network is trained to be invariant to image transformations and artifacts making the evaluation method also insensitive to those distortions. Second, since the deep network is often trained on large scale natural scene datasets (e.g. ImageNet), applying them to other domains (e.g. faces, digits) is questionable. Some evaluation methods (e.g. MS-SSIM odena2016conditional , Birthday Paradox Test) aim to assess the diversity of the generated samples, regardless of the data distribution. While being able to detect severe cases of mode collapse, these methods fall short in measuring how well a generator captures the true data distribution karras2017progressive .
Quality measures such as nearest neighbor visualizations or rapid categorization tasks may favor models that overfit. Overall, it seems that the main challenge is to have a measure that evaluates both diversity and visual fidelity simultaneously. The former implies that all modes are covered while the latter implies that the generated samples should have high likelihood. Perhaps due to these challenges, Theis et al. theis2015note argued against evaluating models for task-independent image generation and proposed to evaluate GANs with respect to specific applications. For different applications then, different measures might be more appropriate. For example, the likelihood is good for measuring compression methods theis2017lossy while psychophysics and user ratings are fit for evaluating image reconstruction and synthesis methods ledig2016photo ; gerhard2013sensitive . Some measures are suitable for evaluating generic GANs (when input is a noise vector), while some others are suitable for evaluating conditional GANs (e.g. FCN score) where correspondences are available (e.g. generating an image corresponding to a segmentation map).
Despite having different formulations, several scores are based on similar concepts. C2ST, adversarial accuracy, and classification performance employ classifiers to determine how separable generated images are from real images (on a validation dataset). FID, Wasserstein and MMD measures compute the distance between two distributions. Inception score and its variants including m-IS, Mode and AM scores use conditional and marginal distributions over generated data or real data to evaluate diversity and fidelity of samples. Average log likelihood and coverage metric estimate the probability distributions. Reconstruction error and some quality measures determine how dissimilar generated images are from their corresponding (or closest) images in the train set. Some measures use individual samples (e.g. IS) while others need pairs of samples (e.g. MMD). One important concern regarding many measures is that they are sensitive to the choice of the feature space (e.g. different CNNs) as well as the distance type (e.g. vs. ).
Fidelity, diversity and controllable sampling are the main aspects of a model that a measure should capture. A good score should have well defined bounds and also be sensitive to image distortions and transformations (See Fig. 21 and 4). One major problem with qualitative measures such as SSIM and PSNR is that they only tap visual fidelity and not diversity of samples. Humans are also often biased towards the visual quality of generated images and are less affected by the lack of image diversity. On the other hand, some quantitative measures mostly concentrate on evaluating diversity (e.g. Birthday Paradox Test) and discard fidelity. Ideally, a good measure should take both into account.
Fig. 22 shows a comparison of GAN evaluation measures in terms of sample and computational efficiency. While some measures are practical to compute for a small sample size (about 2000 images), some others (e.g. Wasserstein distance) do not scale to large sample sizes. Please see huang2018an for further details.
4 Summary and Future Work
In this work, I provided a critical review of the strengths and limitations of 24 quantitative and 5 qualitative measures that have been introduced so far for evaluating GANs. Seeking appropriate measures for this purpose continues to be an important open problem, not only for fair model comparison but also for understanding, improving, and developing generative models. Lack of a universal powerful measure can hinder the progress. In a recent benchmark study, Lucic et al. lucic2017gans found no empirical evidence in favor of GAN models who claimed superiority over the original GAN. In this regard, borrowing from other fields such as natural scene statistics and cognitive vision can be rewarding. For example, understanding how humans perceive symmetry driver1992preserved ; funk2017beyond or image clutter rosenholtz2007measuring in generated images versus natural scenes can give clues regarding the plausibility of the generated images.
Ultimately, I suggest the following directions for future research in this area:
creating a code repository of evaluation measures,
conducting detailed comparative empirical and analytical studies of available measures, and
benchmarking models under the same conditions (e.g. architectures, optimization, hyperparameters, computational budget) using more than one measure.
- (1) A. Radford, L. Metz, S. Chintala, Unsupervised representa