1 Introduction
Generative models have revolutionized AI and are now being widely used^{1}^{1}1For a list of GAN applications, please refer to https://machinelearningmastery.com/impressiveapplicationsofgenerativeadversarialnetworks/. They are remarkably effective at synthesizing strikingly realistic and diverse images, and learn to do so in an unsupervised manner wang2019generative ; liu2021generative ; bond2021deep . Models such as generative adversarial networks (GANs) goodfellow2014generative
, autoregressive models such PixelCNNs
oord2016conditional, variational autoencoders (VAEs)
kingma2013auto , and recently transformer GANs hudson2021generative ; jiang2021transgan have been constantly improving the state of the art in image generation (see for example brock2018large ; karras2019style ; ramesh2021zero ; razavi2019generating ).A key difficulty in generative modeling is evaluating performance, i.e. how good is a model in approximating a data distribution? Quantitative evaluation^{2}^{2}2Generative models are usually evaluated in terms of fidelity (how realistic is a generated image) and diversity (how well generated samples capture the variations in real data) of the learned distribution. of generative models of images and video is an open problem. Here, I give a summary and discussion of the latest progress (since 2018 borji2019pros ) in this field covering newly proposed measures, benchmarks, and GAN visualization and diagnosis techniques.
1.1 Popular Scores and Their Shortcomings
The two most common GAN evaluation measures are Inception Score (IS) and Fréchet Inception Distance (FID). IS salimans2016improved computes the KL divergence between the conditional class distribution and the marginal class distribution over the generated data. FID heusel2017gans calculates the Wasserstein2 (a.k.a Fréchet) distance between multivariate Gaussians fitted to the embedding space of the Inceptionv3 network of generated and real images^{3}^{3}3To learn how to implement FID, please see https://machinelearningmastery.com/howtoimplementtheFréchetinceptiondistancefidfromscratch/.
IS does not capture intraclass diversity, is insensitive to the prior distribution over labels, is very sensitive to the model parameters and implementations barratt2018note , and requires a large sample size to be reliable. FID has been widely adopted because of its consistency with human inspection and sensitivity to small changes in the real distribution (e.g. slight blurring or small artifacts in synthesized images). Unlike IS, FID can detect intraclass mode collapse^{4}^{4}4Mode collapse happens when a generator produces data from only a few modes of the target distribution, thus failing to generate diverse enough outputs. Please see https://wandb.ai/authors/DCGANndbtest/reports/MeasuringModeCollapseinGANsVmlldzoxNzg5MDk
. The Gaussian assumption in FID calculation, however, might not hold in practice. A major drawback with FID is its high bias. The sample size to calculate FID has to be large enough (usually above 50K). Smaller sample sizes can lead to overestimating the actual FID
chong2020effectively . While singlevalue metrics like IS and FID successfully capture many aspects of the generator, they lump quality and diversity evaluation and are therefore not ideal for diagnostic purposes. Further, they continue to have a blind spot for image quality karras2019style .Another popular approach to evaluate GANs is manual inspection. While it can directly tell us about the image quality, it is limited in assessing sample diversity. It is also subjective as the reviewer may incorporate her own biases and opinions. Further, it requires knowledge of what is realistic and what is not for the target domain (which might be hard to learn in some domains) and is constrained by the number of images that can be reviewed in a reasonable time.
2 New Quantitative GAN Evaluation Measures
2.1 New Variants of Fréchet Inception Distance and Inception Score
2.1.1 Unbiased FID and IS
Chong and Forsyth chong2020effectively show that the FID and IS are biased in the sense that their expected value, computed using a finite sample set, is not their true value. They find that bias depends on the number of images used for calculating the score and the generators themselves, making objective comparisons between models difficult. To mitigate the issue, they propose an extrapolation approach to obtain a biasfree estimate of the scores, called and , computed with an infinite number of samples. These extrapolated scores are simple, dropin replacements for the finite sample scores^{5}^{5}5Their code is available at https://github.com/mchong6/FID_IS_infinity.
2.1.2 Fast Fréchet Inception Distance
Commonly, FID is used to compare generative models after training. It can also be used during training to monitor model improvement, but the problem is the high computational cost using typical sample sizes (about 50K). Mathiasen et al. mathiasen2020fast propose a method to speed up the FID computation. The idea is based on the fact that the real data samples do not change during training, so their Inception encodings need to be computed once. Therefore, reducing the number of fake samples lowers the time to compute Inception encodings. The bottleneck, however, is computing in FID formula. and are the covariance matrices of Gaussians fitted to the real and generated data. FastFID circumvents this issue without explicitly computing . Depending on the number of fake samples, FastFID can speed up FID computation 25 to 500 times.
2.1.3 Fréchet Video Distance (FVD)
Unterthiner et al. unterthiner2019fvd propose an extension of the FID, called FVD, to evaluate generative models of video. In addition to the quality of each frame, FVD captures the temporal coherence in a video. To obtain a suitable feature video representation, they used a pretrained network (the Inflated 3D Convnet carreira2017quo
pretrained on Kinetics400 and Kinetics600 datasets) that considers the temporal coherence of the visual content across a sequence of frames. The I3D network generalizes the Inception architecture to sequential data, and is trained to perform actionrecognition on the Kinetics dataset consisting of humancentered YouTube videos. FVD considers a distribution over videos, thereby avoiding the drawbacks of framelevel measures.
Some other measures for video synthesis evaluation include Average Content Distance tulyakov2018mocogan
(the average L2 distance among all consecutive frames in a video), Cumulative Probability Blur Detection (CPBD)
narvekar2009no, Frequency Domain Blurriness Measure (FDBM)
de2013image , Pose Error yang2018pose , and Landmark Distance (LMD) chen2018lip . For more details on video generation and evaluation, please consult oprea2020review .2.2 Methods based on Analysing Data Manifold
These methods attempt to measure disentanglement in the representations learned by generative models, and are useful for improving the generalization, robustness, and interpretability of models.
2.2.1 Local Intrinsic Dimensionality (LID)
Barua et al. barua2019quality introduce a new evaluation measure called “CrossLID” to assess the local intrinsic dimensionality (LID) of realworld data with respect to neighborhoods found in GANgenerated samples. Intuitively, CrossLID measures the degree to which the manifolds of two data distributions coincide with each other. The idea behind CrossLID is depicted in Fig. 1. Barua et al. compare CrossLID with other measures and show that it a) is strongly correlated with the progress of GAN training, b) is sensitive to mode collapse, c) is robust to smallscale noise and image transformations, and 4) is robust to sample size.
2.2.2 Intrinsic Multiscale Distance (IMD)
Tsitsulin et al. tsitsulin2019shape
argue that current evaluation measures only reflect the first two or three moments of distributions (mean and covariance). As illustrated in Fig.
2, FID and KID (Kernel Inception Distance binkowski2018demystifying ) are insensitive to the global structure of the data distribution. They propose a measure, called IMD, to take all data moments into account, and claim that IMD is intrinsic^{6}^{6}6It is invariant to isometric transformations of the manifold such as translation or rotation. and multiscale^{7}^{7}7It captures both local and global information.. Tsitsulin et al. experimentally show that their method is effective in discerning the structure of data manifolds even on unaligned data. IMD compares data distributions based on their geometry and is similar to the Geometry Score khrulkov2018geometry . These approaches, however, are somewhat complicated.2.2.3 Perceptual Path Length (PPL)
First introduced in StyleGAN3 karras2019style , PPL measures whether and how much the latent space of a generator is entangled (or if it is smooth and the factors of variation are properly separated^{8}^{8}8
For example, features that are absent in either endpoint may appear in the middle of a linear interpolation path.
). Intuitively, a less curved latent space should result in perceptually smoother transition than a highly curved latent space. Formally, PPL is the empirical mean of the perceptual difference between consecutive images in the latent space , over all possible endpoints:(1) 
where , is the generator, and is the perceptual distance between the resulting images. denotes the spherical interpolation shoemake1985animating , and is the step size. The LPIPS (learned perceptual image patch similarity) zhang2018unreasonable can be used to measure perceptual distance between two images. LPIPS is a weighted L2 difference between two VGG16 simonyan2014very embeddings, where the weights are learned to make the metric agree with human perceptual similarity judgments. PPL captures semantics and image quality as shown in Fig. 3. Fig. 4 illustrates the superiority of PPL over FID and PrecisionRecall scores in comparing two models.
2.2.4 Linear Separability in Latent Space
Karras et al. karras2019style
propose another measure in StyleGAN3 to quantify the latent space disentanglement by measuring how well the latentspace points can be separated into two distinct sets via a linear hyperplane (
e.g. separating the binary attribute of the image such as male and female for faces). To measure the separability of an attribute, they generate some images withand classify them using an auxiliary binary classifier trained on real data. They then fit a linear SVM to predict the label based on the latentspace points and classify the points by this plane. The conditional entropy
is then computed where is the classes predicted by the SVM and is the classes determined by the pretrained classifier. A low value suggests consistent latent space directions for the corresponding factor(s) of variation. The final separability score is , where enumerates over a set of attributes.2.3 Classification Accuracy Score (CAS)
Ravuri and Vinyals ravuri2019classification propose a measure that is in essence similar to the classifier twosample tests (C2ST) lopez2016revisiting . They argue that if a generative model is learning the data distribution in a perceptually meaningful space then it should perform well in downstream tasks. They use classconditional generative models from a number of generative models such as GANs and VAEs to infer the class labels of real data. They then train an image classifier using only synthetic data and use it to predict labels of real images in the test set. They find that a) when using a stateoftheart GAN (BigGANdeep brock2018large ), Top1 and Top5 accuracy decrease by 27.9% and 41.6%, respectively, compared to the original data, b) CAS automatically identifies particular classes for which generative models fail to capture the data distribution (Fig. 5), and c) IS and FID are neither predictive of CAS, nor useful when evaluating nonGAN models.
2.4 NonParametric Tests to Detect DataCopying
Meehan et al. meehan2020non formalize a notion of overfitting called datacopying where a generative model memorizes the training samples or their small variations (Fig. 6). They provide a three sample test for detecting datacopying that uses the training set, a separate heldout sample from the target distribution, and a generated sample from the model. The key insight is that an overfitted GAN generates samples that are on average closer to the training samples, therefore the average distance of generated samples to training samples is smaller than the corresponding distances between heldout test samples and the training samples. They also divide the instance space into cells and conduct their test separately in each cell. This is because, as they argue, generative models tend to behave differently in different regions of space.
2.5 Measures that Probe Generalization in GANs
Zhao et al. zhao2018bias utilize carefully designed training datasets to characterize how existing models generate novel attributes and their combinations. Some of their findings are as follows. When presented with a training set with all images having exactly 3 objects, both GANs and VAEs typically generate 25 objects (Fig. 7.B). Over a multimodal training distribution (e.g. images with either 2 or 10 objects), the model acts as if it is trained separately on each mode, and it averages the two modes. When the modes are close to each other (e.g. 2 and 4 objects), the learned distribution assigns higher probability to the mean number of objects (3 in this example), even though there was no image with 3 objects in the training set (Fig. 7.C). When the training set contains certain combinations (e.g. red cubes but not yellow cubes; Fig. 7.D), the model memorizes the combinations in the training set when it contains a small number of them (e.g. 20), and generates novel combinations when there is more variety (e.g. 80). Xuan et al. xuan2019anomalous made a similar observation shown in Fig. 8. On a geometricobject training dataset where the number of objects for each training image is fixed, they found a variable number of objects in generated images. Some other studies that have investigated GAN generalization include o2018evaluating ; van2020investigating .
2.6 New Ideas based on Precision and Recall (P&R)
Sajjadi et al. sajjadi2018assessing
proposed to use precision and recall to explicitly quantify the trade off between quality (precision) and coverage (recall). Precision measures the similarity of generated instances to the real ones and recall measures the ability of a generator to synthesize all instances found in the training set (Fig.
9). P&R curves can distinguish modecollapse (poor recall) and bad quality (poor precision). Some new ideas based on P&R are summarized below.2.6.1 Density and Coverage
Naeem et al. naeem2020reliable
argue that even the latest version of the precision and recall metrics are still not reliable as they a) fail to detect the match between two identical distributions, and b) are not robust against outliers, and have arbitrarily selected evaluation hyperparameters. To solve these issues, they propose density and coverage metrics. Precision counts the binary decision of whether the fake data
is contained in any neighbourhood sphere. Density, instead, counts how many realsample neighbourhood spheres contain . See Fig. 10. They analytically and experimentally show that density and coverage provide more interpretable and reliable signals for practitioners than the existing measures^{9}^{9}9Their code is available at https://github.com/clovaai/generativeevaluationprdc.2.6.2 Alpha Precision and Recall
Alaa et al. alaa2021faithful
introduce a 3dimensional evaluation metric, (
Precision, Recall, Authenticity), to characterizes the fidelity, diversity and generalization power of generative models (Fig. 11). The first two assume that a fraction 1 − (or 1 − ) of the real (and synthetic) data are “outliers”, and (or ) are “typical”. Precision is the fraction of synthetic samples that resemble the “most typical” real samples, whereas Recall is the fraction of real samples covered by the most typical synthetic samples. Precision and Recall are evaluated for all , , providing entire precision and recall curves instead of a single number. To compute both metrics, the (real and synthetic) data are embedded into hyperspheres with most samples concentrated around the centers. Typical samples are located near the centers whereas outliers are close to the boundaries. To quantify Generalization they introduce the Authenticity metric to measure the probability that a synthetic sample is copied from the training data.Some other works that have proposed extensions to P&R include djolonga2020precision ; simon2019revisiting ; kynkaanniemi2019improved .
2.7 Neuroscore
Wang et al. wang2020use outline a method called Neuroscore using neural signals and rapid serial visual presentation (RSVP) to directly measure human perceptual response to generated stimuli. Participants are instructed to attend to target images (real and generated) amongst a larger set of nontarget images (left panel in Fig. 12). This paradigm, known as the oddball paradigm, is commonly used to elicit the P300 eventrelated potential (ERP), which is a positive voltage deflection typically occurring between 300 ms and 600 ms after the appearance of a rare visual target. Fig. 12 shows a depiction of this approach. Wang et al.
show that Neuroscore is more consistent with human judgments compared to the conventional metrics. They also trained a convolutional neural network to predict Neuroscore from GANgenerated images directly without the need for neural responses.
2.8 Duality GAP Metric
Grnarova et al. grnarova2019domain
leverage the notion of duality gap from game theory to propose a domainagnostic and computationallyefficient measure that can be used to assess different models as well as monitoring the progress of a single model throughout training. Intuitively, duality gap measures the suboptimality (
w.r.t. an equilibrium) of a given solution (G,D) where G and D and the generator and the discriminator, respectively. They also show that their measure highly correlates with FID on natural image datasets, and can also be used in other modalities such as text and sound. Further, their measure requires no labels or a pretrained classifier, making it domain agnostic. This approach is, however, complicated and it is not clear how it measures fidelity and diversity of samples.3 New Qualitative GAN Evaluation Measures
A number of new qualitative measures have also been proposed. These measures typically focus on how convincing a generated image is from human perception perspective.
3.1 Human Eye Perceptual Evaluation (HYPE)
Zhou et al. zhou2019hype propose an evaluation approach that is grounded in psychophysics research on visual perception (Fig. 13). They introduce two variants. The first one measures visual perception under adaptive time constraints to determine the threshold at which a model’s outputs appear real (e.g. 250 ms). The second variant is less expensive and measures human error rate on fake and real images without time constraints. Kolchinski et al. (kolchinski2019approximating, ) propose an approach to approximate the HYPE and report 66% accuracy in predicting human scores of image realism, matching the human interrater agreement rate. A major drawback with human evaluation approaches such as HYPE is scaling them. One way to remedy this problem is to train a model from human judgments and interact with a person only when the model is not certain.
3.2 Seeing What a GAN Can Not Generate
In bau2019seeing , Bau et al. visualize mode collapse at both distribution level and instance level. They employ a semantic segmentation network to compare the distribution of segmented objects in the generated images versus the real images. Differences in statistics reveal object classes that are omitted by a GAN. Their approach allows to visualize the GAN’s omissions for an omitted class and to compare differences between individual photos and their approximate inversions by a GAN. Fig. 14 illustrates this approach.
3.3 Measuring GAN Steerability
Jahanian et al. jahanian2019steerability proposed a method to quantify the degree to which basic visual transformations are achieved by navigating the latent space of a GAN (See Fig. 15). They first learn a dimensional vector representing the optimal path in the latent space for a given transformation. Formally, the task it to learn the walk by minimizing the following objective function:
(2) 
where is the step size, and is the distance between the generated image after taking an step in the latent direction and the target image edit(, ). The latter is the new image derived from the source image . To quantify how well a desired image manipulation under each transformation is achieved, the distributions of a given attribute (e.g. “luminance”) in the real data and generated images (after walking in latent space) are compared.
3.4 GAN Dissection
Bau et al. bau2018gan proposed a technique to dissect and visualize the inner workings of an image generator^{10}^{10}10A demo of this work is available at https://www.google.com/search?q=GAN+dissection&oq=GAN+dissection&aqs=chrome..69i57.7356j0j1&sourceid=chrome&ie=UTF8.. The main idea is to identify GAN units (i.e.
generator neurons) that are responsible for semantic concepts and objects (such as tree, sky, and clouds) in the generated images. Having this level of granularity into the neurons allows editing existing images (
e.g. to add or remove trees as shown in Fig. 16) by forcefully activating and deactivating (ablating) the corresponding units for the desired objects. Their technique also allows finding artifacts in the generated images and hence can be used to evaluate and improve GANs. A similar approach has been proposed in park2019semantic for semantic manipulation and editing of GAN generated images^{11}^{11}11See here for an illustration https://blogs.nvidia.com/blog/2019/03/18/gauganphotorealisticlandscapesnvidiaresearch/..3.5 A Universal Fake vs. Real Detector
Wang et al. wang2020cnn ask whether it is possible to create a “universal” detector to distinguish real images from synthetic ones using a dataset of synthetic images generated by 11 CNNbased generative models. See Fig. 17. With careful pre and postprocessing and data augmentation, they show that an image classifier trained on only one specific CNN generator is able to generalize well to unseen architectures, datasets, and training methods. They highlight that today’s CNNgenerated images share common systematic flaws, preventing them from achieving realistic image generation. Similar works have been reported in chai2020makes ; yu2019attributing .
4 Discussion
4.1 GAN Benchmarks and Analysis Studies
Following previous works (e.g. lucic2017gans ; shmelkov2018good ), new studies have investigated formulating good criteria for GAN evaluation, or have conducted systematic GAN benchmarks. Gulrajani et al. gulrajani2020towards argue that a good evaluation measure should not have a trivial solution (e.g. memorizing the dataset) and show that many scores such as IS and FID can be won by simply memorizing the training data (Fig. 18). They suggest that a necessary condition for a metric not to behave this way is to have a large number of samples. They also propose a measure based on neural network divergences (NND). NND works by training a discriminative model to discriminate between samples of the generative model and samples from a heldout test set. The poorer the discriminative model performs, the better the generative model is. Through experimental validation, they show that NND can effectively measure diversity, sample quality, and generalization. In lee2020mimicry , authors introduce Mimicry^{12}^{12}12https://github.com/kwotsin/mimicry
, a lightweight PyTorch library that provides implementations of popular GANs and evaluation metrics to closely reproduce reported scores in the literature. They also compare several GANs on seven widelyused datasets by training them under the same conditions, and evaluating them using three popular GAN metrics. Some tutorials and interaction tools have also been developed to understand, visualize, and evaluate GANs
^{13}^{13}13See for example https://poloclub.github.io/ganlab/ and https://www.google.com/search?q=grauating+in+gans&oq=grauating+in+gans&aqs=chrome..69i57.2327j0j1&sourceid=chrome&ie=UTF8.4.2 Assessing Fairness and Bias of Generative Models
Fairness and bias in ML algorithms, datasets, and commercial products have become growing concerns recently (e.g. buolamwini2018gender ) and have attracted widespread attention from public and media^{14}^{14}14https://thegradient.pub/pulselessons/. Even without a single precise definition for fairness verma2018fairness , it is still possible to observe the lack of fairness across many domains and models, with GANs being no exception. Some recent works (e.g. xu2018fairgan ; yu2020inclusive have tried to mitigate the bias in GANs or use GANs to reduce bias in other ML algorithms (e.g. sattigeri2019fairness ; mcduff2019characterizing ). Bias can enter a model during training (through data, labeling, architecture), evaluation (such as who created the evaluation measure), as well as deployment (how is the model being distributed and for what purposes). Therefore it is critical to address these aspects when evaluating and comparing generative models.
4.3 Connection to Deepfakes
A alarming application of generative models is fabricating fake content^{15}^{15}15Some concerns can be found at https://www.skynettoday.com/overviews/stateofdeepfakes2020, https://www.cnn.com/interactive/2020/10/us/manipulatedmediatechfakenewstrnd/, https://www.cnn.com/2019/02/28/tech/aifakefaces/index.html, and https://www.theguardian.com/technology/2020/jan/13/whataredeepfakesandhowcanyouspotthem. It is thus crucial to develop tools and techniques to detect and limit their use (See tolosana2020deepfakes for a survey on deepfakes). There is a natural connection between deepfake detection and GAN evaluation. Obviously, as generative models improve, it becomes increasingly harder for a human to distinguish between what is real and what is fake (manipulated images, videos, text, and audio). Telling the degree of fakeness of an image or video directly tells us about the performance of a generator. Even over faces, where generative models excel, it is still possible to detect fake images (Fig. 19), although in some cases it can be very daunting (Fig. 20). Difficulty of deepfake detection for humans is also category dependent, with some categories such as faces, cats, and dogs being easier than bedrooms or cluttered scenes (Fig. 21). Fortunately, or unfortunately, it is still possible to build deep networks that can detect the subtle artifacts in the doctored images (e.g. using the universal detectors mentioned above wang2020cnn ; chai2020makes ). Moving forward, it is important to study whether and how GAN evaluation measures can help us mitigate the threat from deepfakes.
5 Summary and Conclusion
Here, I reviewed a number of recently proposed GAN evaluation measures. Similar to generative models, evaluation measures are also evolving and improving over time. Although some measures such as IS, FID, P&R, and PPL are relatively more popular than others, objective and comprehensive evaluation of generative models is still an open problem. Some directions for future research in GAN evaluation are as follows.

Prior research has been focused on examining generative models of faces (or scenes containing one or few objects), and relatively less effort has been devoted to assess how good GANs are in generating more complex scenes such as bedrooms, street scenes, and nature scenes. casanova2020generating is an early effort in this direction.

Generative models have been primarily evaluated in terms of the quality and diversity of their generated images. Other dimensions such as generalization and fairness have been less explored. Generalization assessment can provide a deeper look into what generative models learn. For example, it can tell how and to what degree models capture compositionality (e.g. Does a model generate the right number of paws, nose, eyes, etc for dogs?) and logic (Does a model properly capture physical properties such as gravity, light direction, reflection, and shadow?). Evaluating models in terms of fairness is critical for mitigating the potential risks that may arise at the deployment time and to ensure that a model has the right societal impact.

An important matter in GAN evaluation is task dependency. In other words, how well a generative model works depends on its intended use. In some tasks (e.g.
graphics applications such as image synthesis, image translation, image inpainting, and attribute manipulation) image quality is more important, whereas in some other tasks (
e.g. generating synthetic data for data augmentation) diversity may weigh more. Thus, evaluation metrics should be tailored to the target task. 
Having good evaluation measures is important not only for ranking models but also for diagnosing their errors. Reliable GAN evaluation is in particular important in domains where humans are less attuned to discern the quality of samples (e.g. medical images). An important question thus is whether current evaluation measures generalize across different domains (i.e. are domainagnostic).

Ultimately, future works should also study how research in evaluating performance of generative models can help mitigate the threat from fabricated content.
References

(1)
A. Borji, Pros and cons of gan evaluation measures, Computer Vision and Image Understanding 179 (2019) 41–65.
 (2) Z. Wang, Q. She, T. E. Ward, Generative adversarial networks in computer vision: A survey and taxonomy, arXiv preprint arXiv:1906.01529.
 (3) M.Y. Liu, X. Huang, J. Yu, T.C. Wang, A. Mallya, Generative adversarial networks for image and video synthesis: Algorithms and applications, Proceedings of the IEEE.
 (4) S. BondTaylor, A. Leach, Y. Long, C. G. Willcocks, Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energybased and autoregressive models, arXiv preprint arXiv:2103.04922.
 (5) I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks, arXiv preprint arXiv:1406.2661.
 (6) A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, K. Kavukcuoglu, Conditional image generation with pixelcnn decoders, arXiv preprint arXiv:1606.05328.
 (7) D. P. Kingma, M. Welling, Autoencoding variational bayes, arXiv preprint arXiv:1312.6114.
 (8) D. A. Hudson, C. L. Zitnick, Generative adversarial transformers, arXiv preprint arXiv:2103.01209.
 (9) Y. Jiang, S. Chang, Z. Wang, Transgan: Two transformers can make one strong gan, arXiv preprint arXiv:2102.07074.
 (10) A. Brock, J. Donahue, K. Simonyan, Large scale gan training for high fidelity natural image synthesis, arXiv preprint arXiv:1809.11096.

(11)
T. Karras, S. Laine, T. Aila, A stylebased generator architecture for generative adversarial networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
 (12) A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, I. Sutskever, Zeroshot texttoimage generation, arXiv preprint arXiv:2102.12092.
 (13) A. Razavi, A. v. d. Oord, O. Vinyals, Generating diverse highfidelity images with vqvae2, arXiv preprint arXiv:1906.00446.
 (14) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training gans, arXiv preprint arXiv:1606.03498.
 (15) M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, Gans trained by a two timescale update rule converge to a local nash equilibrium, arXiv preprint arXiv:1706.08500.
 (16) S. Barratt, R. Sharma, A note on the inception score, arXiv preprint arXiv:1801.01973.
 (17) M. J. Chong, D. Forsyth, Effectively unbiased fid and inception score and where to find them, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6070–6079.
 (18) A. Mathiasen, F. Hvilshøj, Fast fr’echet inception distance, arXiv preprint arXiv:2009.14075.
 (19) T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, S. Gelly, Fvd: A new metric for video generation.
 (20) J. Carreira, A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, in: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
 (21) S. Tulyakov, M.Y. Liu, X. Yang, J. Kautz, Mocogan: Decomposing motion and content for video generation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1526–1535.
 (22) N. D. Narvekar, L. J. Karam, A noreference perceptual image sharpness metric based on a cumulative probability of blur detection, in: 2009 International Workshop on Quality of Multimedia Experience, IEEE, 2009, pp. 87–91.
 (23) K. De, V. Masilamani, Image sharpness measure for blurred images in frequency domain, Procedia Engineering 64 (2013) 149–158.
 (24) C. Yang, Z. Wang, X. Zhu, C. Huang, J. Shi, D. Lin, Pose guided human video generation, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 201–216.
 (25) L. Chen, Z. Li, R. K. Maddox, Z. Duan, C. Xu, Lip movements generation at a glance, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 520–535.
 (26) S. Oprea, P. MartinezGonzalez, A. GarciaGarcia, J. A. CastroVargas, S. OrtsEscolano, J. GarciaRodriguez, A. Argyros, A review on deep learning techniques for video prediction, IEEE Transactions on Pattern Analysis and Machine Intelligence.
 (27) S. Barua, X. Ma, S. M. Erfani, M. E. Houle, J. Bailey, Quality evaluation of gans using cross local intrinsic dimensionality, arXiv preprint arXiv:1905.00643.
 (28) A. Tsitsulin, M. Munkhoeva, D. Mottin, P. Karras, A. Bronstein, I. Oseledets, E. Müller, The shape of data: Intrinsic distance for data distributions, arXiv preprint arXiv:1905.11141.
 (29) M. Bińkowski, D. J. Sutherland, M. Arbel, A. Gretton, Demystifying mmd gans, arXiv preprint arXiv:1801.01401.

(30)
V. Khrulkov, I. Oseledets, Geometry score: A method for comparing generative adversarial networks, in: International Conference on Machine Learning, PMLR, 2018, pp. 2621–2629.
 (31) K. Shoemake, Animating rotation with quaternion curves, in: Proceedings of the 12th annual conference on Computer graphics and interactive techniques, 1985, pp. 245–254.

(32)
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, O. Wang, The unreasonable effectiveness of deep features as a perceptual metric, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
 (33) K. Simonyan, A. Zisserman, Very deep convolutional networks for largescale image recognition, arXiv preprint arXiv:1409.1556.
 (34) T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, T. Aila, Analyzing and improving the image quality of stylegan, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8110–8119.
 (35) F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, J. Xiao, Lsun: Construction of a largescale image dataset using deep learning with humans in the loop, arXiv preprint arXiv:1506.03365.
 (36) S. Ravuri, O. Vinyals, Classification accuracy score for conditional generative models, in: Advances in Neural Information Processing Systems, 2019, pp. 12268–12279.
 (37) D. LopezPaz, M. Oquab, Revisiting classifier twosample tests, arXiv preprint arXiv:1610.06545.
 (38) C. Meehan, K. Chaudhuri, S. Dasgupta, A nonparametric test to detect datacopying in generative models, arXiv preprint arXiv:2004.05675.
 (39) S. Zhao, H. Ren, A. Yuan, J. Song, N. Goodman, S. Ermon, Bias and generalization in deep generative models: An empirical study, in: Advances in Neural Information Processing Systems, 2018, pp. 10792–10801.
 (40) J. Xuan, Y. Yang, Z. Yang, D. He, L. Wang, On the anomalous generalization of gans, arXiv preprint arXiv:1909.12638.
 (41) S. O’Brien, M. Groh, A. Dubey, Evaluating generative adversarial networks on explicitly parameterized distributions, arXiv preprint arXiv:1812.10782.
 (42) S. van Steenkiste, K. Kurach, J. Schmidhuber, S. Gelly, Investigating object compositionality in generative adversarial networks, Neural Networks 130 (2020) 309–325.
 (43) M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, S. Gelly, Assessing generative models via precision and recall, in: Advances in Neural Information Processing Systems, 2018, pp. 5228–5237.
 (44) T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, T. Aila, Improved precision and recall metric for assessing generative models, arXiv preprint arXiv:1904.06991.
 (45) M. F. Naeem, S. J. Oh, Y. Uh, Y. Choi, J. Yoo, Reliable fidelity and diversity metrics for generative models, arXiv preprint arXiv:2002.09797.
 (46) A. M. Alaa, B. van Breugel, E. Saveliev, M. van der Schaar, How faithful is your synthetic data? samplelevel metrics for evaluating and auditing generative models, arXiv preprint arXiv:2102.08921.

(47)
J. Djolonga, M. Lucic, M. Cuturi, O. Bachem, O. Bousquet, S. Gelly, Precisionrecall curves using information divergence frontiers, in: International Conference on Artificial Intelligence and Statistics, 2020, pp. 2550–2559.
 (48) L. Simon, R. Webster, J. Rabin, Revisiting precision and recall definition for generative model evaluation, arXiv preprint arXiv:1905.05441.
 (49) Z. Wang, G. Healy, A. F. Smeaton, T. E. Ward, Use of neural signals to evaluate the quality of generative adversarial network performance in facial image generation, Cognitive Computation 12 (1) (2020) 13–24.
 (50) P. Grnarova, K. Y. Levy, A. Lucchi, N. Perraudin, I. Goodfellow, T. Hofmann, A. Krause, A domain agnostic measure for monitoring and evaluating gans, in: Advances in Neural Information Processing Systems, 2019, pp. 12092–12102.
 (51) S. Zhou, M. Gordon, R. Krishna, A. Narcomey, L. F. FeiFei, M. Bernstein, Hype: A benchmark for human eye perceptual evaluation of generative models, in: Advances in Neural Information Processing Systems, 2019, pp. 3449–3461.
 (52) Y. A. Kolchinski, S. Zhou, S. Zhao, M. Gordon, S. Ermon, Approximating human judgment of generated image quality, arXiv preprint arXiv:1912.12121.
 (53) D. Bau, J.Y. Zhu, J. Wulff, W. Peebles, H. Strobelt, B. Zhou, A. Torralba, Seeing what a gan cannot generate, in: Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 4502–4511.
 (54) A. Jahanian, L. Chai, P. Isola, On the”steerability" of generative adversarial networks, arXiv preprint arXiv:1907.07171.
 (55) D. Bau, J.Y. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T. Freeman, A. Torralba, Gan dissection: Visualizing and understanding generative adversarial networks, arXiv preprint arXiv:1811.10597.
 (56) T. Park, M.Y. Liu, T.C. Wang, J.Y. Zhu, Semantic image synthesis with spatiallyadaptive normalization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2337–2346.
 (57) S.Y. Wang, O. Wang, R. Zhang, A. Owens, A. A. Efros, Cnngenerated images are surprisingly easy to spot… for now, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 7, 2020.
 (58) L. Chai, D. Bau, S.N. Lim, P. Isola, What makes fake images detectable? understanding properties that generalize, in: European Conference on Computer Vision, Springer, 2020, pp. 103–120.
 (59) N. Yu, L. S. Davis, M. Fritz, Attributing fake images to gans: Learning and analyzing gan fingerprints, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7556–7566.
 (60) M. Lucic, K. Kurach, M. Michalski, S. Gelly, O. Bousquet, Are gans created equal? a largescale study, arXiv preprint arXiv:1711.10337.
 (61) K. Shmelkov, C. Schmid, K. Alahari, How good is my gan?, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 213–229.
 (62) I. Gulrajani, C. Raffel, L. Metz, Towards gan benchmarks which require generalization, arXiv preprint arXiv:2001.03653.
 (63) K. S. Lee, C. Town, Mimicry: Towards the reproducibility of gan research, arXiv preprint arXiv:2005.02494.
 (64) J. Buolamwini, T. Gebru, Gender shades: Intersectional accuracy disparities in commercial gender classification, in: Conference on fairness, accountability and transparency, PMLR, 2018, pp. 77–91.
 (65) S. Verma, J. Rubin, Fairness definitions explained, in: 2018 ieee/acm international workshop on software fairness (fairware), IEEE, 2018, pp. 1–7.
 (66) D. Xu, S. Yuan, L. Zhang, X. Wu, Fairgan: Fairnessaware generative adversarial networks, in: 2018 IEEE International Conference on Big Data (Big Data), IEEE, 2018, pp. 570–575.
 (67) N. Yu, K. Li, P. Zhou, J. Malik, L. Davis, M. Fritz, Inclusive gan: Improving data and minority coverage in generative models, in: European Conference on Computer Vision, Springer, 2020, pp. 377–393.
 (68) P. Sattigeri, S. C. Hoffman, V. Chenthamarakshan, K. R. Varshney, Fairness gan: Generating datasets with fairness properties using a generative adversarial network, IBM Journal of Research and Development 63 (4/5) (2019) 3–1.
 (69) D. McDuff, S. Ma, Y. Song, A. Kapoor, Characterizing bias in classifiers using generative models, arXiv preprint arXiv:1906.11891.
 (70) R. Tolosana, R. VeraRodriguez, J. Fierrez, A. Morales, J. OrtegaGarcia, Deepfakes and beyond: A survey of face manipulation and fake detection, Information Fusion 64 (2020) 131–148.
 (71) A. Casanova, M. Drozdzal, A. RomeroSoriano, Generating unseen complex scenes: are we there yet?, arXiv preprint arXiv:2012.04027.