Pros and Cons of GAN Evaluation Measures: New Developments

03/17/2021 ∙ by Ali Borji, et al. ∙ 0

This work is an update of a previous paper on the same topic published a few years ago. With the dramatic progress in generative modeling, a suite of new quantitative and qualitative techniques to evaluate models has emerged. Although some measures such as Inception Score, Fréchet Inception Distance, Precision-Recall, and Perceptual Path Length are relatively more popular, GAN evaluation is not a settled issue and there is still room for improvement. For example, in addition to quality and diversity of synthesized images, generative models should be evaluated in terms of bias and fairness. I describe new dimensions that are becoming important in assessing models, and discuss the connection between GAN evaluation and deepfakes.



There are no comments yet.


page 5

page 6

page 13

page 15

page 16

page 18

page 19

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative models have revolutionized AI and are now being widely used111For a list of GAN applications, please refer to They are remarkably effective at synthesizing strikingly realistic and diverse images, and learn to do so in an unsupervised manner wang2019generative ; liu2021generative ; bond2021deep . Models such as generative adversarial networks (GANs) goodfellow2014generative

, autoregressive models such PixelCNNs 


, variational autoencoders (VAEs) 

kingma2013auto , and recently transformer GANs hudson2021generative ; jiang2021transgan have been constantly improving the state of the art in image generation (see for example brock2018large ; karras2019style ; ramesh2021zero ; razavi2019generating ).

A key difficulty in generative modeling is evaluating performance, i.e. how good is a model in approximating a data distribution? Quantitative evaluation222Generative models are usually evaluated in terms of fidelity (how realistic is a generated image) and diversity (how well generated samples capture the variations in real data) of the learned distribution. of generative models of images and video is an open problem. Here, I give a summary and discussion of the latest progress (since 2018 borji2019pros ) in this field covering newly proposed measures, benchmarks, and GAN visualization and diagnosis techniques.

1.1 Popular Scores and Their Shortcomings

The two most common GAN evaluation measures are Inception Score (IS) and Fréchet Inception Distance (FID). IS salimans2016improved computes the KL divergence between the conditional class distribution and the marginal class distribution over the generated data. FID heusel2017gans calculates the Wasserstein-2 (a.k.a Fréchet) distance between multivariate Gaussians fitted to the embedding space of the Inception-v3 network of generated and real images333To learn how to implement FID, please seeéchet-inception-distance-fid-from-scratch/.

IS does not capture intra-class diversity, is insensitive to the prior distribution over labels, is very sensitive to the model parameters and implementations barratt2018note , and requires a large sample size to be reliable. FID has been widely adopted because of its consistency with human inspection and sensitivity to small changes in the real distribution (e.g. slight blurring or small artifacts in synthesized images). Unlike IS, FID can detect intra-class mode collapse444Mode collapse happens when a generator produces data from only a few modes of the target distribution, thus failing to generate diverse enough outputs. Please see

. The Gaussian assumption in FID calculation, however, might not hold in practice. A major drawback with FID is its high bias. The sample size to calculate FID has to be large enough (usually above 50K). Smaller sample sizes can lead to over-estimating the actual FID 

chong2020effectively . While single-value metrics like IS and FID successfully capture many aspects of the generator, they lump quality and diversity evaluation and are therefore not ideal for diagnostic purposes. Further, they continue to have a blind spot for image quality karras2019style .

Another popular approach to evaluate GANs is manual inspection. While it can directly tell us about the image quality, it is limited in assessing sample diversity. It is also subjective as the reviewer may incorporate her own biases and opinions. Further, it requires knowledge of what is realistic and what is not for the target domain (which might be hard to learn in some domains) and is constrained by the number of images that can be reviewed in a reasonable time.

2 New Quantitative GAN Evaluation Measures

2.1 New Variants of Fréchet Inception Distance and Inception Score

2.1.1 Unbiased FID and IS

Chong and Forsyth chong2020effectively show that the FID and IS are biased in the sense that their expected value, computed using a finite sample set, is not their true value. They find that bias depends on the number of images used for calculating the score and the generators themselves, making objective comparisons between models difficult. To mitigate the issue, they propose an extrapolation approach to obtain a bias-free estimate of the scores, called and , computed with an infinite number of samples. These extrapolated scores are simple, drop-in replacements for the finite sample scores555Their code is available at

2.1.2 Fast Fréchet Inception Distance

Commonly, FID is used to compare generative models after training. It can also be used during training to monitor model improvement, but the problem is the high computational cost using typical sample sizes (about 50K). Mathiasen et al. mathiasen2020fast propose a method to speed up the FID computation. The idea is based on the fact that the real data samples do not change during training, so their Inception encodings need to be computed once. Therefore, reducing the number of fake samples lowers the time to compute Inception encodings. The bottleneck, however, is computing in FID formula. and are the covariance matrices of Gaussians fitted to the real and generated data. FastFID circumvents this issue without explicitly computing . Depending on the number of fake samples, FastFID can speed up FID computation 25 to 500 times.

2.1.3 Fréchet Video Distance (FVD)

Unterthiner et al. unterthiner2019fvd propose an extension of the FID, called FVD, to evaluate generative models of video. In addition to the quality of each frame, FVD captures the temporal coherence in a video. To obtain a suitable feature video representation, they used a pre-trained network (the Inflated 3D Convnet carreira2017quo

pre-trained on Kinetics-400 and Kinetics-600 datasets) that considers the temporal coherence of the visual content across a sequence of frames. The I3D network generalizes the Inception architecture to sequential data, and is trained to perform action-recognition on the Kinetics dataset consisting of human-centered YouTube videos. FVD considers a distribution over videos, thereby avoiding the drawbacks of frame-level measures.

Some other measures for video synthesis evaluation include Average Content Distance tulyakov2018mocogan

(the average L2 distance among all consecutive frames in a video), Cumulative Probability Blur Detection (CPBD) 


, Frequency Domain Blurriness Measure (FDBM) 

de2013image , Pose Error yang2018pose , and Landmark Distance (LMD) chen2018lip . For more details on video generation and evaluation, please consult oprea2020review .

2.2 Methods based on Analysing Data Manifold

These methods attempt to measure disentanglement in the representations learned by generative models, and are useful for improving the generalization, robustness, and interpretability of models.

2.2.1 Local Intrinsic Dimensionality (LID)

Barua et al. barua2019quality introduce a new evaluation measure called “CrossLID” to assess the local intrinsic dimensionality (LID) of real-world data with respect to neighborhoods found in GAN-generated samples. Intuitively, CrossLID measures the degree to which the manifolds of two data distributions coincide with each other. The idea behind CrossLID is depicted in Fig. 1. Barua et al. compare CrossLID with other measures and show that it a) is strongly correlated with the progress of GAN training, b) is sensitive to mode collapse, c) is robust to small-scale noise and image transformations, and 4) is robust to sample size.

Figure 1: Four 2D examples illustrating how generated samples by a GAN (triangles) relate to real data samples from a bimodal Gaussian (circles), together with CrossLID scores. (a) generated data distributed uniformly, spatially far from the real data. (b) generated data with two modes, spatially far from the real data (c) generated data associated with only one mode of the real data. (d) generated data associated with both modes of the real data (the desired situation). The lower the CrossLID, the better. Figure compiled from barua2019quality .

2.2.2 Intrinsic Multi-scale Distance (IMD)

Tsitsulin et al. tsitsulin2019shape

argue that current evaluation measures only reflect the first two or three moments of distributions (mean and covariance). As illustrated in Fig. 

2, FID and KID (Kernel Inception Distance binkowski2018demystifying ) are insensitive to the global structure of the data distribution. They propose a measure, called IMD, to take all data moments into account, and claim that IMD is intrinsic666It is invariant to isometric transformations of the manifold such as translation or rotation. and multi-scale777It captures both local and global information.. Tsitsulin et al. experimentally show that their method is effective in discerning the structure of data manifolds even on unaligned data. IMD compares data distributions based on their geometry and is similar to the Geometry Score khrulkov2018geometry . These approaches, however, are somewhat complicated.

Figure 2: The motivation behind the Intrinsic Multi-scale Distance (IMD) score. Here, the two distributions have the same first 3 moments. Since FID and KID are insensitive to higher moments, they can not distinguish the two distributions, whereas the IMD score can. Figure from tsitsulin2019shape .

2.2.3 Perceptual Path Length (PPL)

First introduced in StyleGAN3 karras2019style , PPL measures whether and how much the latent space of a generator is entangled (or if it is smooth and the factors of variation are properly separated888

For example, features that are absent in either endpoint may appear in the middle of a linear interpolation path.

). Intuitively, a less curved latent space should result in perceptually smoother transition than a highly curved latent space. Formally, PPL is the empirical mean of the perceptual difference between consecutive images in the latent space , over all possible endpoints:


where , is the generator, and is the perceptual distance between the resulting images. denotes the spherical interpolation shoemake1985animating , and is the step size. The LPIPS (learned perceptual image patch similarity) zhang2018unreasonable can be used to measure perceptual distance between two images. LPIPS is a weighted L2 difference between two VGG16 simonyan2014very embeddings, where the weights are learned to make the metric agree with human perceptual similarity judgments. PPL captures semantics and image quality as shown in Fig. 3. Fig. 4 illustrates the superiority of PPL over FID and Precision-Recall scores in comparing two models.

Figure 3: Correlation between perceptual path length and image quality (generated by StyleGAN). Panels (a) and (b) show random examples with low and high (per-image) PPL. PPL score is capable of capturing the consistency of the images. The lower the PPL, the better. Figure from karras2020analyzing .
Figure 4: Synthesized samples from two generative models trained on LSUN yu2015lsun CAT without truncation (i.e. 

drawing latent vectors from a truncated or shrunk sampling space to improve average image quality, at the expense of loosing diversity 

kingma2013auto ; brock2018large ; karras2020analyzing ). FID, precision (P), and recall (R) are similar for the two models, even though the latter produces cat-shaped objects more often. Perceptual path length (PPL) shows a clear preference for model 2. Figure from karras2020analyzing .

2.2.4 Linear Separability in Latent Space

Karras et al. karras2019style

propose another measure in StyleGAN3 to quantify the latent space disentanglement by measuring how well the latent-space points can be separated into two distinct sets via a linear hyperplane (

e.g. separating the binary attribute of the image such as male and female for faces). To measure the separability of an attribute, they generate some images with

and classify them using an auxiliary binary classifier trained on real data. They then fit a linear SVM to predict the label based on the latent-space points and classify the points by this plane. The conditional entropy

is then computed where is the classes predicted by the SVM and is the classes determined by the pre-trained classifier. A low value suggests consistent latent space directions for the corresponding factor(s) of variation. The final separability score is , where enumerates over a set of attributes.

2.3 Classification Accuracy Score (CAS)

Ravuri and Vinyals ravuri2019classification propose a measure that is in essence similar to the classifier two-sample tests (C2ST) lopez2016revisiting . They argue that if a generative model is learning the data distribution in a perceptually meaningful space then it should perform well in downstream tasks. They use class-conditional generative models from a number of generative models such as GANs and VAEs to infer the class labels of real data. They then train an image classifier using only synthetic data and use it to predict labels of real images in the test set. They find that a) when using a state-of-the-art GAN (BigGAN-deep brock2018large ), Top-1 and Top-5 accuracy decrease by 27.9% and 41.6%, respectively, compared to the original data, b) CAS automatically identifies particular classes for which generative models fail to capture the data distribution (Fig. 5), and c) IS and FID are neither predictive of CAS, nor useful when evaluating non-GAN models.

Figure 5: CAS can identify classes for which a generative model (here BigGAN-deep brock2018large ) fails to capture the data distribution (top row: real images, bottom two rows: generated samples). Figure compiled from ravuri2019classification .

2.4 Non-Parametric Tests to Detect Data-Copying

Figure 6: Illustration of the data-copying concept introduced in meehan2020non

. Each panel depicts a single instance space partitioned into two regions. Panel (a) shows an over-represented region (top) and an under-represented region (bottom). This is the kind of overfitting evaluated by measures such as FID and Precision-Recall. Panel (b) shows a data-copied region (top) and an underfitted region (bottom). Panel (c) shows VAE-generated and training samples from a data-copied (top) and underfitted (bottom) region over MNIST. In each 10-image strip, the bottom row provides random generated samples from the region and the top row shows their training nearest neighbors. Samples in the bottom region are on average further to their training nearest neighbor than held-out test samples in the region, and samples in the top region are closer, and thus “data-copying”. Figure from 

meehan2020non .

Meehan et al. meehan2020non formalize a notion of overfitting called data-copying where a generative model memorizes the training samples or their small variations (Fig. 6). They provide a three sample test for detecting data-copying that uses the training set, a separate held-out sample from the target distribution, and a generated sample from the model. The key insight is that an overfitted GAN generates samples that are on average closer to the training samples, therefore the average distance of generated samples to training samples is smaller than the corresponding distances between held-out test samples and the training samples. They also divide the instance space into cells and conduct their test separately in each cell. This is because, as they argue, generative models tend to behave differently in different regions of space.

2.5 Measures that Probe Generalization in GANs

Zhao et al. zhao2018bias utilize carefully designed training datasets to characterize how existing models generate novel attributes and their combinations. Some of their findings are as follows. When presented with a training set with all images having exactly 3 objects, both GANs and VAEs typically generate 2-5 objects (Fig. 7.B). Over a multi-modal training distribution (e.g. images with either 2 or 10 objects), the model acts as if it is trained separately on each mode, and it averages the two modes. When the modes are close to each other (e.g. 2 and 4 objects), the learned distribution assigns higher probability to the mean number of objects (3 in this example), even though there was no image with 3 objects in the training set (Fig. 7.C). When the training set contains certain combinations (e.g. red cubes but not yellow cubes; Fig. 7.D), the model memorizes the combinations in the training set when it contains a small number of them (e.g. 20), and generates novel combinations when there is more variety (e.g. 80). Xuan et al. xuan2019anomalous made a similar observation shown in Fig. 8. On a geometric-object training dataset where the number of objects for each training image is fixed, they found a variable number of objects in generated images. Some other studies that have investigated GAN generalization include o2018evaluating ; van2020investigating .

Figure 7: A) A generative model can be probed with carefully designed training data. Examining the learned distribution when training data B) takes a single value for a feature (e.g. all training images have 3 objects), C) has multiple modes for a feature (e.g. all training images have 2, 4 or 10 objects), or D) has multiple modes over multiple features. Figure compiled from zhao2018bias .
Figure 8: Xuan et al. xuan2019anomalous show that training a GAN over images with exactly two rectangles results in a model that generates one, two, or three rectangles (anomalous ones shown in red). They also propose a model that generates a high fraction of correct images (the right-most panel).

2.6 New Ideas based on Precision and Recall (P&R)

Sajjadi et al. sajjadi2018assessing

proposed to use precision and recall to explicitly quantify the trade off between quality (precision) and coverage (recall). Precision measures the similarity of generated instances to the real ones and recall measures the ability of a generator to synthesize all instances found in the training set (Fig. 

9). P&R curves can distinguish mode-collapse (poor recall) and bad quality (poor precision). Some new ideas based on P&R are summarized below.

Figure 9: (a) Illustration of precision-recall for distribution of real images (blue) and the distribution of generated images (red). (b) Precision is the probability that a random image from falls within the support of . (c) Recall is the probability that a random image from falls within the support of . Figure compiled from kynkaanniemi2019improved .

2.6.1 Density and Coverage

Naeem et al. naeem2020reliable

argue that even the latest version of the precision and recall metrics are still not reliable as they a) fail to detect the match between two identical distributions, and b) are not robust against outliers, and have arbitrarily selected evaluation hyperparameters. To solve these issues, they propose density and coverage metrics. Precision counts the binary decision of whether the fake data

is contained in any neighbourhood sphere. Density, instead, counts how many real-sample neighbourhood spheres contain . See Fig. 10. They analytically and experimentally show that density and coverage provide more interpretable and reliable signals for practitioners than the existing measures999Their code is available at

Figure 10: Pictorial depiction of density and coverage measures naeem2020reliable (See text for definitions). Note that in the recall versus coverage figure (panel b), the real and fake samples are identical across left and right. a) Here, the real manifold is overestimated due to the real outlier sample. Generating many fake samples around the real outlier increases the precision measure. b) Here, although the fake samples are far from the modes in real samples, the recall is perfect because of the overestimated fake manifold. Figure compiled from naeem2020reliable .

2.6.2 Alpha Precision and Recall

Alaa et al. alaa2021faithful

introduce a 3-dimensional evaluation metric, (

-Precision, -Recall, Authenticity), to characterizes the fidelity, diversity and generalization power of generative models (Fig. 11). The first two assume that a fraction 1 − (or 1 − ) of the real (and synthetic) data are “outliers”, and (or ) are “typical”. -Precision is the fraction of synthetic samples that resemble the “most typical” real samples, whereas -Recall is the fraction of real samples covered by the most typical synthetic samples. -Precision and -Recall are evaluated for all , , providing entire precision and recall curves instead of a single number. To compute both metrics, the (real and synthetic) data are embedded into hyperspheres with most samples concentrated around the centers. Typical samples are located near the centers whereas outliers are close to the boundaries. To quantify Generalization they introduce the Authenticity metric to measure the probability that a synthetic sample is copied from the training data.

Some other works that have proposed extensions to P&R include djolonga2020precision ; simon2019revisiting ; kynkaanniemi2019improved .

Figure 11: Illustration of -Precision, -Recall and Authenticity metrics alaa2021faithful . Blue and red spheres correspond to the and -supports of real and generative distributions, respectively. Blue and red points correspond to real and synthetic data. (a) Generated samples falling outside the blue sphere look unrealistic or noisy. (b) Overfitted models can generate high-quality samples that are “unauthentic” because they are copied from training data. (c) High-quality samples should reside in the blue sphere. (d) Outliers do not count in the -Recall metric. (Here, ==0.9, -Precision=8/9, -Recall=4/9, Authenticity=9/10). Figure from alaa2021faithful .

2.7 Neuroscore

Figure 12: Left) An example of RSVP experimental protocol in which a rapid image stream (4 images per second) containing target and non-target images is presented to participants. Participants are instructed to search for real face images. The idea is that GAN generated faces will elicit a different response than real faces. Right) Schematic diagram of neuro-AI interface for computing the Neuroscore. Generated images by a GAN are shown to the participants and the corresponding recorded neural responses are used to evaluate the GAN performance. Figure compiled from wang2020use .

Wang et al. wang2020use outline a method called Neuroscore using neural signals and rapid serial visual presentation (RSVP) to directly measure human perceptual response to generated stimuli. Participants are instructed to attend to target images (real and generated) amongst a larger set of non-target images (left panel in Fig. 12). This paradigm, known as the oddball paradigm, is commonly used to elicit the P300 event-related potential (ERP), which is a positive voltage deflection typically occurring between 300 ms and 600 ms after the appearance of a rare visual target. Fig. 12 shows a depiction of this approach. Wang et al.

 show that Neuroscore is more consistent with human judgments compared to the conventional metrics. They also trained a convolutional neural network to predict Neuroscore from GAN-generated images directly without the need for neural responses.

2.8 Duality GAP Metric

Grnarova et al. grnarova2019domain

leverage the notion of duality gap from game theory to propose a domain-agnostic and computationally-efficient measure that can be used to assess different models as well as monitoring the progress of a single model throughout training. Intuitively, duality gap measures the sub-optimality (

w.r.t. an equilibrium) of a given solution (G,D) where G and D and the generator and the discriminator, respectively. They also show that their measure highly correlates with FID on natural image datasets, and can also be used in other modalities such as text and sound. Further, their measure requires no labels or a pretrained classifier, making it domain agnostic. This approach is, however, complicated and it is not clear how it measures fidelity and diversity of samples.

3 New Qualitative GAN Evaluation Measures

A number of new qualitative measures have also been proposed. These measures typically focus on how convincing a generated image is from human perception perspective.

3.1 Human Eye Perceptual Evaluation (HYPE)

Zhou et al. zhou2019hype propose an evaluation approach that is grounded in psychophysics research on visual perception (Fig. 13). They introduce two variants. The first one measures visual perception under adaptive time constraints to determine the threshold at which a model’s outputs appear real (e.g. 250 ms). The second variant is less expensive and measures human error rate on fake and real images without time constraints. Kolchinski et al. (kolchinski2019approximating, ) propose an approach to approximate the HYPE and report 66% accuracy in predicting human scores of image realism, matching the human inter-rater agreement rate. A major drawback with human evaluation approaches such as HYPE is scaling them. One way to remedy this problem is to train a model from human judgments and interact with a person only when the model is not certain.

Figure 13: HYPE tests generative models for how realistic their images look to the human eye zhou2019hype . Top) HYPE scores of different models over CelebA and FFHQ datasets. A score of 50% represents indistinguishable results from real, while a score above 50% represents hyper-realism. Bottom) Example images sampled with the truncation trick from StyleGAN trained on FFHQ dataset. Images on the right have the highest HYPE scores (i.e. exhibit the highest perceptual fidelity). Figure compiled from zhou2019hype .

3.2 Seeing What a GAN Can Not Generate

In bau2019seeing , Bau et al. visualize mode collapse at both distribution level and instance level. They employ a semantic segmentation network to compare the distribution of segmented objects in the generated images versus the real images. Differences in statistics reveal object classes that are omitted by a GAN. Their approach allows to visualize the GAN’s omissions for an omitted class and to compare differences between individual photos and their approximate inversions by a GAN. Fig. 14 illustrates this approach.

Figure 14: Seeing what a GAN cannot generate bau2019seeing . (a) Distribution of object segmentations in the training set of LSUN churches vs. the corresponding distribution over the generated images. Objects such as people, cars, and fences are dropped by the generator. (b) Pairs of real images and their reconstructions in which individual instances of a person and a fence cannot be generated. Figure compiled from bau2019seeing .

3.3 Measuring GAN Steerability

Jahanian et al. jahanian2019steerability proposed a method to quantify the degree to which basic visual transformations are achieved by navigating the latent space of a GAN (See Fig. 15). They first learn a -dimensional vector representing the optimal path in the latent space for a given transformation. Formally, the task it to learn the walk by minimizing the following objective function:


where is the step size, and is the distance between the generated image after taking an -step in the latent direction and the target image edit(, ). The latter is the new image derived from the source image . To quantify how well a desired image manipulation under each transformation is achieved, the distributions of a given attribute (e.g. “luminance”) in the real data and generated images (after walking in latent space) are compared.

Figure 15: Top) A walk in the latent space of a GAN corresponds to visual transformations such as zoom and camera shift, Bottom) The goal is to find a path in space (linear or non-linear) to transform the generated image to its edited version , e.g. an zoom. To measure steerability, the distributions of a given attribute in real images and generated images (after walking in the latent space) are compared. Figure from jahanian2019steerability .

3.4 GAN Dissection

Figure 16: An overview of the GAN dissection approach bau2018gan . Top) Measuring the relationship between representation units and trees in the output using (a) dissection and (b) intervention. Dissection measures agreement between a unit and a concept by comparing its thresholded upsampled heatmap with a semantic segmentation of the generated image . Intervention measures the causal effect of a set of units on a concept by comparing the effect of forcing these units on (unit insertion) or off (unit ablation). The segmentation reveals that trees increase after insertion and decrease after ablation. The average difference in the tree pixels measures the average causal effect. Bottom) Applying the dissection method to a generated outdoor church image. Dissection method can also be used to diagnose and improve GANs by identifying and ablating the artifact-causing units (panels e to g). Figure compiled from bau2018gan .

Bau et al. bau2018gan proposed a technique to dissect and visualize the inner workings of an image generator101010A demo of this work is available at The main idea is to identify GAN units (i.e. 

generator neurons) that are responsible for semantic concepts and objects (such as tree, sky, and clouds) in the generated images. Having this level of granularity into the neurons allows editing existing images (

e.g. to add or remove trees as shown in Fig. 16) by forcefully activating and deactivating (ablating) the corresponding units for the desired objects. Their technique also allows finding artifacts in the generated images and hence can be used to evaluate and improve GANs. A similar approach has been proposed in park2019semantic for semantic manipulation and editing of GAN generated images111111See here for an illustration

3.5 A Universal Fake vs. Real Detector

Wang et al. wang2020cnn ask whether it is possible to create a “universal” detector to distinguish real images from synthetic ones using a dataset of synthetic images generated by 11 CNN-based generative models. See Fig. 17. With careful pre- and post-processing and data augmentation, they show that an image classifier trained on only one specific CNN generator is able to generalize well to unseen architectures, datasets, and training methods. They highlight that today’s CNN-generated images share common systematic flaws, preventing them from achieving realistic image generation. Similar works have been reported in chai2020makes ; yu2019attributing .

Figure 17: Wang et al. wang2020cnn show that a classifier trained to distinguish images generated by only one GAN (ProGAN, the left-most column) from real ones can detect the images generated by other generative models (remaining columns) as well. Please see

4 Discussion

4.1 GAN Benchmarks and Analysis Studies

Following previous works (e.g.  lucic2017gans ; shmelkov2018good ), new studies have investigated formulating good criteria for GAN evaluation, or have conducted systematic GAN benchmarks. Gulrajani et al. gulrajani2020towards argue that a good evaluation measure should not have a trivial solution (e.g. memorizing the dataset) and show that many scores such as IS and FID can be won by simply memorizing the training data (Fig. 18). They suggest that a necessary condition for a metric not to behave this way is to have a large number of samples. They also propose a measure based on neural network divergences (NND). NND works by training a discriminative model to discriminate between samples of the generative model and samples from a held-out test set. The poorer the discriminative model performs, the better the generative model is. Through experimental validation, they show that NND can effectively measure diversity, sample quality, and generalization. In lee2020mimicry , authors introduce Mimicry121212

, a lightweight PyTorch library that provides implementations of popular GANs and evaluation metrics to closely reproduce reported scores in the literature. They also compare several GANs on seven widely-used datasets by training them under the same conditions, and evaluating them using three popular GAN metrics. Some tutorials and interaction tools have also been developed to understand, visualize, and evaluate GANs

131313See for example and

Figure 18: Left) Gulrajani et al. gulrajani2020towards show that common evaluation measures such as IS and FID prefer a model that memorizes the dataset (, red) to a model (, green) which imperfectly fits the true distribution (, blue) but covers more of ’s support. Right) Neural network divergence, DCNN, discounts memorization and prefers GANs that generalize beyond the training set. Figure compiled from gulrajani2020towards

4.2 Assessing Fairness and Bias of Generative Models

Fairness and bias in ML algorithms, datasets, and commercial products have become growing concerns recently (e.g.  buolamwini2018gender ) and have attracted widespread attention from public and media141414 Even without a single precise definition for fairness verma2018fairness , it is still possible to observe the lack of fairness across many domains and models, with GANs being no exception. Some recent works (e.g.  xu2018fairgan ; yu2020inclusive have tried to mitigate the bias in GANs or use GANs to reduce bias in other ML algorithms (e.g.  sattigeri2019fairness ; mcduff2019characterizing ). Bias can enter a model during training (through data, labeling, architecture), evaluation (such as who created the evaluation measure), as well as deployment (how is the model being distributed and for what purposes). Therefore it is critical to address these aspects when evaluating and comparing generative models.

4.3 Connection to Deepfakes

A alarming application of generative models is fabricating fake content151515Some concerns can be found at,,, and It is thus crucial to develop tools and techniques to detect and limit their use (See tolosana2020deepfakes for a survey on deepfakes). There is a natural connection between deepfake detection and GAN evaluation. Obviously, as generative models improve, it becomes increasingly harder for a human to distinguish between what is real and what is fake (manipulated images, videos, text, and audio). Telling the degree of fakeness of an image or video directly tells us about the performance of a generator. Even over faces, where generative models excel, it is still possible to detect fake images (Fig. 19), although in some cases it can be very daunting (Fig. 20). Difficulty of deepfake detection for humans is also category dependent, with some categories such as faces, cats, and dogs being easier than bedrooms or cluttered scenes (Fig. 21). Fortunately, or unfortunately, it is still possible to build deep networks that can detect the subtle artifacts in the doctored images (e.g. using the universal detectors mentioned above wang2020cnn ; chai2020makes ). Moving forward, it is important to study whether and how GAN evaluation measures can help us mitigate the threat from deepfakes.

Figure 19: Sample generated/fake faces and cues to tell them apart from real ones. See also Image courtesy of Twitter.
Figure 20: Can you determine which face in each pair is real? Key (L for left and R for right) row wise: L, L, R, L, R, R, L, L, L, R, R, R, R, L. Images taken from
Figure 21: Some sample generated cats (top) and beds (bottom), generated from and, respectively. See also Although images look realistic in the first glance, a closer examination reveals the artifacts (see it for yourself!).

5 Summary and Conclusion

Here, I reviewed a number of recently proposed GAN evaluation measures. Similar to generative models, evaluation measures are also evolving and improving over time. Although some measures such as IS, FID, P&R, and PPL are relatively more popular than others, objective and comprehensive evaluation of generative models is still an open problem. Some directions for future research in GAN evaluation are as follows.

  1. Prior research has been focused on examining generative models of faces (or scenes containing one or few objects), and relatively less effort has been devoted to assess how good GANs are in generating more complex scenes such as bedrooms, street scenes, and nature scenes. casanova2020generating is an early effort in this direction.

  2. Generative models have been primarily evaluated in terms of the quality and diversity of their generated images. Other dimensions such as generalization and fairness have been less explored. Generalization assessment can provide a deeper look into what generative models learn. For example, it can tell how and to what degree models capture compositionality (e.g. Does a model generate the right number of paws, nose, eyes, etc for dogs?) and logic (Does a model properly capture physical properties such as gravity, light direction, reflection, and shadow?). Evaluating models in terms of fairness is critical for mitigating the potential risks that may arise at the deployment time and to ensure that a model has the right societal impact.

  3. An important matter in GAN evaluation is task dependency. In other words, how well a generative model works depends on its intended use. In some tasks (e.g. 

    graphics applications such as image synthesis, image translation, image inpainting, and attribute manipulation) image quality is more important, whereas in some other tasks (

    e.g. generating synthetic data for data augmentation) diversity may weigh more. Thus, evaluation metrics should be tailored to the target task.

  4. Having good evaluation measures is important not only for ranking models but also for diagnosing their errors. Reliable GAN evaluation is in particular important in domains where humans are less attuned to discern the quality of samples (e.g. medical images). An important question thus is whether current evaluation measures generalize across different domains (i.e. are domain-agnostic).

  5. Ultimately, future works should also study how research in evaluating performance of generative models can help mitigate the threat from fabricated content.