1 Introduction
Evaluating generative models of highdimensional data remains an open problem. Despite a number of subtleties in generative model assessment
Theis2015d , researchers looking to make practical progress in generative models of images, and particularly those who have focused on Generative Adversarial Networks goodfellow2014generative , have identified desirable properties such as “sample quality” and “diversity” and proposed automatic metrics to measure these desiderata. As a result, recent years have witnessed a rapid improvement in the quality of deep generative models. While ultimately the utility of these models is their performance in downstream tasks, the focus on these metrics has led to models whose samplers now generate nearly photorealistic images brock2018large ; karras2017progressive ; karras2018style . For one model in particular, BigGANdeep brock2018large , results on standard GAN metrics such as Inception Score (IS) salimans2016improved and Frechet Inception Distance (FID) heusel2017gans approach those of the data distribution. The results on FID, which purports to be the Wasserstein2 metric in a perceptual feature space, in particular suggest that BigGANs are capturing the data distribution.A similar, though less heralded, improvement has occurred for models whose objectives are (bounds of) likelihood, with the result that many of these models now also produce photorealistic samples. Examples include the Subscale Pixel Networks menick2018generating , unconditional autoregressive models of 128128 ImageNet that achieve stateoftheart test set loglikelihoods; Hierarchical Autoregressive Models (HAMs) de2019hierarchical , classconditional autoregressive models of 128128 and 256256 ImageNet; and the recently introduced highresolution VQVAE RazaviNote2019
, a variational autoencoder that uses vector quantization and an autoregressive prior to produce highquality samples. Notably, these models measure diversity using test set likelihood and assess sample quality through visual inspection, eschewing the metrics typically used in GAN research.
As these models increasingly seem “to learn the distribution” according to these metrics, it is natural to consider their use in downstream tasks. Such a view certainly has a precedent: improved test set likelihoods in language models, unconditional models of text, also improve performance in tasks such as speech recognition chen1998evaluation . While a generative model need not learn the data distribution to perform well on a downstream task, poor performance on such tasks allows us to diagnose specific problems with both our generative models and the taskagnostic metrics we use to evaluate them. To that end, we introduce a general framework in which we use conditional generative models to perform approximate inference and measure the quality of that inference. The idea is simple: for any generative model of the form , we learn an inference network using only samples from the conditional generative model, and measure the performance of the inference network on a downstream task. We then compare performance to that of an inference network trained on real data.
We apply this framework to conditional image models, where is the image label, is the image, and task is image classification. The performance measure we use, Top1 and Top5 accuracy, denote a Classification Accuracy Score (CAS), and the gap in performance between networks trained on real and synthetic data allows us to understand specific deficiencies in the generative model. Although a simple metric, CAS uncovers some surprising results:

When using a stateoftheart GAN (BigGANdeep) and an offtheshelf ResNet50 classifier as the inference network, Top1 and Top5 accuracy decreased by 27.9% and 41.6% compared to real data.

Conditional generative models based on likelihood, such as highresolution VQVAE and Hierarchical Autoregressive Models, perform relatively well compared to BigGANdeep, despite achieving relatively poor Inception Scores and Frechet Inception Distances. Since these models produce visually appealing samples, the result suggests that FID/IS are poor measures of nonGAN models.

Classification Accuracy Score (CAS) automatically surfaces particular classes for which BigGANdeep and VQVAE fail to capture the data distribution, and which were previously were unknown in the literature. Figure 1 shows four such classes for BigGANdeep.

We find that neither Inception Score, nor Frechet Inception Distance, nor combinations thereof are predictive of CAS. These results suggest that as generative models are beginning to be deployed in downstream tasks, we should create metrics that better measure task performance.

We introduce Naive Augmentation Score (NAS), a variant of CAS where the image classifier is trained on both real and synthetic images, to demonstrate that classification performance improves in limited circumstances. Augmenting the ImageNet training set with lowdiversity BigGANdeep images improves Top5 accuracy by 0.2%, while augmenting the dataset with any other synthetic images degrades classification performance.
In Section 2 we provide a few definitions, desiderata of metrics, and shortcomings of the most popular metrics in relation to the different research directions for generative modeling. Section 3 introduces CAS. Finally, Section 4 provides a large scale study of current stateoftheart generative models using FID, IS, and CAS on both the ImageNet and CIFAR10 datasets.
2 Metrics for Generative Models
Much of the difficulty in evaluation is not knowing the task for which the model is used. Understanding how it is deployed, however, has important implications on its desired properties. For example, consider the seemingly similar tasks of automatic speech recognition and speech synthesis. While both may share the same generative model of speech—such as a hidden Markov Model
with the latent and observed variables being a word sequence and waveform , respectively—the implications of model misspecification are vastly different. In speech recognition, the model should be able to infer words for all possible speech waveforms, even if the waveforms themselves are degraded, while in speech synthesis, the model should produce the most realisticsounding samples, even if it cannot produce all possible speech waveforms. In particular, for automatic speech recognition, we care about , while for speech synthesis, we care about .In absence of a downstream task, we assess to what extent the model distribution matches the data distribution , a less specific, and often more difficult goal. Two consequences of the trivial observation that are: 1) each sample “comes” from the data distribution (i.e., it is a “plausible” sample from the data distribution), and 2) that all possible examples from the data distribution are represented by the model. Different metrics that evaluate the degree of model mismatch weigh these criteria differently.
Furthermore, we expect our metrics to be relatively fast to calculate. This last desideratum often depends on the model class, which comprise five categories:

(Inexact) Likelihood models using variational inference (e.g., VAE kingma2013auto ; rezende2014stochastic )

Likelihood models using fully observed inputs (e.g., PixelCNN van2016conditional )

Likelihood models based on bijections (e.g., GLOW kingma2018glow , rNVP dinh2016density )

(Possibly inexact) likelihood using Energybased models (e.g., RBM
smolensky1986information ) 
Implicit generative models (e.g., GANs)
For the first four of these classes, the likelihood objective provides us scaled estimates of the KLdivergence between the data and model. Furthermore, test set likelihood is also an implicit measure of diversity. The likelihood, however, is a fairly poor measure of sample quality
Theis2015d and often scores outofdomain data more highly than indomain data nalisnick2018deep .For implicit models, the objective provides neither an accurate estimate of a statistical divergence or distance, nor a natural evaluation metric. The lack of any such metrics likely forced researchers to propose heuristics that measure versions of both 1) and 2) simultaneously. Inception Score
salimans2016improved ( measures 1) by how confidently a classifier assigns an image to a particular class (), and 2) by penalizing if too many images were classified to the same class (). More principled versions this procedure are Frechet Inception Distance heusel2017gans and Kernel Inception Distance binkowski2018demystifying , which both use variants of twosample tests in a learned “perceptual” feature space, the Inception pool3 space, to assess distribution matching. Even though this space was an adhoc proposition, recent work zhang2018unreasonablesuggests that deep features correlate with human perception of similarity. Even more recent work
sajjadi2018assessing ; kynkaanniemi2019improvedcalculate 1) and 2) (sample quality and diversity) independently by calculating precision and recall.
Reliance on Inception Score and Frechet Inception Distance in particular has led to improvement in GAN models, but it has certain deficiencies. Inception Score does not penalize a lack of intraclass diversity, and certain outofdistribution samples to produce Inception Scores three times higher than that of the data barratt2018note . Frechet Inception Distance, on the other hand, suffers from a high degree of bias binkowski2018demystifying . Moreover, the pool3 feature layer may not even correlate well with human judgment of sample quality zhou2019hype . In this work, we also find that nonGAN models have rather poor Inception Scores and Frechet Inception Distances, even though the samples are visually appealing.
Rather than creating adhoc heuristics aimed at broadly measuring sample quality and diversity, we instead evaluate generative models by assessing their performance on a downstream task. This is akin to measuring a generative model of speech by evaluating it on automatic speech recognition. Since models considered here are implicit or do not admit exact likelihoods, exact inference is difficult. To circumvent this issue, we train an inference network on samples from the model. If the generative model is indeed capturing the data distribution, then we could replace the original distribution with a modelgenerated one, and perform any downstream task, and obtain the same result. In this work, we study perhaps the simplest downstream task: image classification.
2.1 Other Related Work
The metrics mentioned above are by no means the only ones, and researchers have proposed methods to evaluate other generative model properties. khrulkov2018geometry constructs approximate manifolds from data and samples, and applies the method to GAN samples to determine whether mode collapse occurred. arora2017gans attempt to determine the support size of GANs by using a Birthday Paradox test, though the procedure requires a human to identify two nearlyidentical samples. Maximum Mean Discrepancy gretton2012kernel is a twosample test that has many nice theoretical properties, but seems to be less used because the choice of kernels do not necessarily coincide with human evaluation. Finally, procedurally similar to our proposal, semeniuta2018accurate proposes a “reverse LM score”, which trains a language model on GAN data and tests on a real heldout set.
3 Classification Accuracy Score
At the heart of the proposed score lies a very simple idea: if the model captures the data distribution, performance on any downstream task should be similar whether using the original or model data. To make this intuition more precise, suppose that data comes from a distribution , the task is to infer from , and we suffer a loss for predicting when the true label is . The risk associated with a classifier is:
(1) 
As we only have samples from , we measure the empirical risk . From the right hand side of Equation 1, of the set of predictions , the optimal one minimizes the expected posterior loss:
(2) 
Assuming we know the label distribution , a generative modeling approach to this problem is to model the conditional distribution , and infer labels using Bayes rule: . If
, then we can make predictions that minimize the risk for any loss function. If the risk is not minimized, however, then we can conclude that distributions are not matched, and we can interrogate
to better understand how our generative models failed.For most modern deep generative models, however, we have access to neither
, the probability of the data given the label, nor
, the model conditional distribution, nor , the true conditional distribution. Instead, from samples , we train a discriminative model to learn , and use it to estimate the expected posterior loss . We define the generative risk as:(3) 
where is the classifier that minimizes the expected posterior loss under . Then, we compare the performance of the classifier to the performance of the classifier trained on samples from .
In the case of conditional generative models of images, is the class label for image , and the model of is an image classifier. We use ResNets he2015deep in this work.
The loss functions we explore are the standard ones for image classification. One is 01, which yields Top1 Accuracy, and the other is 01 in the Top5, which yields Top5 accuracy.^{1}^{1}1
It is more correct to state that the losses yield errors, but we present results as accuracies instead as they are standard in computer vision literature.
Procedurally, we train a classifier on synthetic data, and evaluate the performance of the classifier on real data. We call the accuracy the Classification Accuracy Score (CAS).Note that a CAS close to that for the data does not imply that the generative model accurately modeled the data distribution. This may happen for a few reasons. First, for any generative model that satisfies for all . An example is a generative model that samples from the true distribution with probability , and from a noise distribution with a support disjoint from the true distribution with probability . In this case, our inference model is good, but underlying generative model is poor.
Second, since the losses considered here are not proper scoring rules gneiting2007strictly , one could obtain reasonable CAS from suboptimal inference networks. For example, suppose that for the correct class, while for the correct class due to poor synthetic data. CAS for both is 100%. Using a proper scoring rule, such as Brier Score, eliminates this issue, but experimentally, we found limited practical benefit from using one.
Finally, a generative model that memorizes the training set will achieve the same CAS as the original data.^{2}^{2}2N.B. Inception Score and Frechet Inception Distance also suffer the same failure mode. In general, however, we hope that generative models produce samples disjoint from the set on which they are trained. If the samples are sufficiently different, we can train a classifier on both the original data and model data and expect improved accuracy. We denote the performance of classifiers trained on this “naive augmentation” Naive Augmentation Score (NAS). Our CAS results, however, indicate that the current models still significantly underfit, rendering the conclusions less compelling. For completeness, we include those results in Section 4.4.
Despite these theoretical issues, we find that generative models have Classification Accuracy Scores lower than the original data, indicating that they fail to capture the data distribution.
3.1 Computation and OpenSourcing Metric
Computationally, training classifiers is significantly more demanding than calculating Frechet Inception Distance or Inception Score over 50,000 samples. The time is ripe, however, for such a metric due to a few key advances in training classifiers: 1) Training of ImageNet classifiers has been reduced to minutes goyal2017accurate
, 2) With cloud services, the variance due to implementation details of such a metric is largely mitigated, 3) thanks to cloud services, the price and time cost is reasonable, and will only improve in the incoming years.
Moreover, many classconditional generative models are computationally expensive to train, and as a result, even a relatively expensive metric such as CAS comprises a small percentage of the training budget.
We opensource our metric on Google Cloud for others to use. The instructions are given in Appendix B
. At the time of writing, one can compute the metric in 10 hours for roughly $15, or in 45 minutes for roughly $85 using TPUs. Moreover, depending on affiliation, one may be able to access TPUs for free using the Tensorflow Research Cloud (TFRC) (
https://www.tensorflow.org/tfrc/).4 Experiments
Our experiments are simple: on ImageNet, we use three generative models—BigGANdeep at 256256 and 128128 resolutions, Hierarchical Autoregressive Models (HAM) with masked selfprediction auxiliary decoder at 128128 resolution , and highresolution VectorQuantized Variational Autoencoder (highres VQVAE) at 256256 resolution—to replace the ImageNet training set with a modelgenerated one, train an image classifier, and evaluate performance on the ImageNet validation set. To calculate CAS, we replace the ImageNet training set with one sampled from the model, and each example from the original training set is replaced with a model sample from the same class.
In addition, we compare CAS to two traditional GAN metrics: Inception Score and Frechet Inception Distance (FID), as these metrics are the current gold standard for GAN comparison and have been used to compare nonGAN to GAN models. Both rely on a feature space from a classifier trained on ImageNet, suggesting that if metrics are useful at predicting performance on a downstream task, it would indeed be this one.
Further details about the experiment can be found in Appendix A.1.
4.1 Model Comparison on ImageNet
Training  Resolution  Top5  Top1  Inception  FID50K 

Set  Accuracy  Accuracy  Score  
Real  128128  88.79%  68.82%  1.61  
BigGANdeep  128128  64.44%  40.64%  4.22  
Hierarchical Autoregressive  128128  77.33%  54.05%  17.02 0.79  46.05 
Real  256256  91.47%  73.09%  2.47  
BigGANdeep  256256  65.92%  42.65%  11.78  
HighRes VQVAE  256256  77.59%  54.83%  38.05 
Table 1 shows the performance of classifiers trained on modelgenerated datasets compared to those on the real dataset for 256256 and 128128, respectively. At 256256 resolution, BigGANdeep achieves a CAS Top5 is 65.92%, suggesting that BigGANs are learning nontrivial distributions. Perhaps surprisingly, highresolution VQVAE, though performing quite poorly compared to BigGANdeep on both Frechet Inception Distance and Inception Score, obtains a CAS Top5 accuracy 77.59%. Both models, however, lag the original 256256 dataset which achieves a CAS Top5 Accuracy of 91.47%.
We find nearly identical results for the 128128 models. BigGANdeep achieves CAS Top5 and Top1 similar to the 256256 model (note that Inception Score and Frechet Inception Distance results for 128128 and 256256 BigGANdeep are vastly different). Hierarchical Autoregressive Models, similar to highresolution VQVAE, perform poorly on FID and Inception Score, but outperform BigGANdeep on CAS. Moreover, both models underperform relative to original 128128 dataset.
4.2 Uncovering Model Deficiencies
To better understand what accounts for the gap between generative model and dataset CAS, we broke down the performance by class (Figure 2). As shown in the left pane, nearly every class suffers of BigGANdeep suffers a drop in performance compared to the original dataset, though six classes—partridge, red fox, jaguar/panther, squirrel monkey, African elephant, and strawberry—show marginal improvement over the original dataset. The left pane of Figure 3 shows the two best and two worst performing categories, as measured by the difference in classification performance. Notably, for the two worst performing categories and two others—balloon, paddlewheel, pencil sharpener, and spatula—classification accuracy was 0% on the validation set.
The perclass breakdown of highresolution VQVAE (middle pane of Figure 2) shows that this model also underperforms the real data in most classes (31 classes performed better than the original data), though the gap is not as large as for BigGANdeep. Furthermore, highresolution VQVAE has better generalization performance in 87.6% of classes compared to BigGANdeep, and suffers 0% classification accuracy for no classes. The middle pane of Figure 3 shows the two best and two worst performing categories.
4.3 A Note on FID and a Second Note on IS
Training  Truncation  Top5  Top1  Inception  FID50K 

Set  Accuracy  Accuracy  Score  
BigGANdeep  0.20  13.24%  5.11%  20.75  
BigGANdeep  0.42  28.68%  13.30%  15.93  
BigGANdeep  0.50  32.88%  15.66%  14.37  
BigGANdeep  0.60  45.01%  25.51%  12.41  
BigGANdeep  0.80  56.68%  32.88%  9.24  
BigGANdeep  1.00  62.97%  39.07%  7.42  
BigGANdeep  1.50  65.92%  42.65%  11.78  
BigGANdeep  2.00  64.37%  40.98%  28.67  
HighRes VQVAE Recon    89.46%  69.90%  8.69  
Real    91.47%  73.09%  2.47 
We note that Inception Score and FID have very little correlation with CAS, suggesting that alternative metrics are needed if we use our models on downstream tasks. As a controlled experiment, we calculated CAS, IS, and FID for BigGANdeep models with input noise distributions truncated at different values (known as the “truncation trick”). As noted in brock2018large , lower truncation values seem to improve sample quality at the expense of diversity. For CAS, the correlation coefficient between Top1 Accuracy and FID is 0.16, and Inception Score 0.86, the latter result incorrectly suggesting that improved Inception Score is highly correlated with poorer performance. Moreover, the bestperforming truncation values (1.5 and 2.0) have rather poor Inception Scores and FIDs. That these poor IS/FID also seem to indicate poor performance on this metric is no surprise; that other models, with wellperforming Inception Scores and FIDs yield poor performance on CAS suggests that alternative metrics are needed. One can easily diagnose the issue with Inception Score: as noted in barratt2018note , Inception Score does not account for intraclass diversity, and a training set with little intraclass diversity may make the classifier fail to generalize to a more diverse test set. FID should better account for this lack of diversity at least grossly, as the metric, calculated as , compares the covariance matrices of the data and model distribution. By comparison, CAS offers a finer measure of model performance, as it provides us a perclass metric to identify which classes have better or worse performance. While in theory one could calculate a perclass FID, FID is known to suffer from high bias (binkowski2018demystifying, ) for low number of samples, likely making the perclass estimates unreliable. ^{3}^{3}3binkowski2018demystifying proposed Kernel Inception Distance, an unbiased alternative to FID, but this metric suffers from variance too large to be reliable when using the number of perclass samples in the ImageNet training set (roughly 1,000 per class), much less when using the 50 in the validation set.
Perhaps a larger issue is that Inception Score and Frechet Inception Distance heavily penalize nonGAN models, suggesting that these heuristics are not suitable for intermodelclass comparisons. A particularly egregious failure case is that IS and FID aggressively penalize certain types of samples that look nearly identical to the original dataset. For example, we computed CAS, Inception Score, and FID on the ImageNet training set at 256256 resolution, and on reconstructions from a highresolution VQVAE. As shown in Figure 4, the samples look nearly identical. As noted in Table 2, however, Inception Score decreases by 128 points and Frechet Inception Distance increases by 3.5. The drop in performance is so great that BigGANdeep at 1.00 truncation achieves better Inception Scores and FIDs than nearlyidentical reconstructions. CAS Top1 and Top5, however, drops by 2.2% and 4.4%, respectively relative to the original dataset. The BigGANdeep model at 1.00 truncation, on the other hand, drops by 31.1% and 46.5% relative.
4.4 Naive Augmentation Score
To calculate Naive Augmentation Score (NAS), we add to the original ImageNet training set, 25%, 50%, or 100% more data from each of our models. The original ImageNet training set achieves a Top5 accuracy of 92.97%.
Although the results of BigGANdeep, and to a lesser extent highresolution VQVAE, on CAS suggest that augmenting the original training set with model samples will not result in improved classification performance, we wanted to study whether the relative ordering on the CAS experiments would hold for the NAS ones. Figure 5 illustrates the performance of the classifiers as we increase the amount of synthetic training data. Perhaps somewhat surprisingly, BigGANdeep models that sample from lower truncation values, and have lower sample diversity, are able to perform better for data augmentation compared to those models that performed well on CAS. In fact, for some of the lowest truncation values, one found modest improvement in classification performance: roughly a improvement. Moreover, highresolution VQVAE underperforms relative to BigGANdeep models. Of course, the caveat is that the former model does not yet have a mechanism to trade off sample quality from sample diversity.
The results on augmentation highlight different desiderata for samples that are added to the dataset rather than replaced. Clearly, the samples added should be sufficiently different from the data to allow the classifier to better generalize, and yet, poorer sample quality may lead to poorer generalization compared to the original dataset. This may be the reason why extending the dataset with samples generated from a lower truncation value noise—which are higherquality, but less diverse—perform better on NAS than CAS. Furthermore, this may also explain why Inception Score, Frechet Inception Distance, and CAS are not predictive of NAS.
4.5 Model Comparison on CIFAR10
Real  BigGAN  cGAN  PixelCNN  PixelIQN  
Accuracy  92.58%  71.87%  76.35%  64.02%  74.26% 
Finally, we also compare CAS for different model classes on CIFAR10. We compare four models: BigGAN, cGAN with Projection Discriminator miyato2018cgans , PixelCNN van2016conditional , and PixelIQN ostrovski2018autoregressive . We train a ResNet56 following the training procedure of he2015deep . More details can be found in Appendix A.2. Similar to the ImageNet experiments, we find that both GANs produce samples with a certain degree of generalization. GANs also significantly outperform PixelCNN on this benchmark. Perhaps surprisingly, PixelIQN has similar performance to the newer GANs.
5 Conclusion
Good metrics have long been an important, and perhaps underappreciated, component in driving improvements in models. It may be particularly important now, as generative models have reached a maturity that they may be deployed in downstream tasks. We proposed one, Classification Accuracy Score, for conditional generative models of images, and found the metric practically useful in uncovering model deficiencies. Furthermore, we find that GAN models of ImageNet, despite high sample quality, tend to underperform models based on likelihood. Finally, we find that Inception Score and Frechet Inception Distance unfairly penalize nonGAN models.
An open question in this work is understanding to what extent these models generalize beyond the training set. While current results suggest that even stateoftheart models currently underfit, recent progress indicates that underfitting may be a temporary issue. Measuring generalization will then be of primary importance, especially if we plan to deploy our models on downstream tasks.
References
 (1) L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. In International Conference on Learning Representations, 2016. arXiv:1511.01844.
 (2) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 (3) Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
 (4) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
 (5) Tero Karras, Samuli Laine, and Timo Aila. A stylebased generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018.
 (6) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
 (7) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
 (8) Jacob Menick and Nal Kalchbrenner. Generating high fidelity images with subscale pixel networks and multidimensional upscaling. arXiv preprint arXiv:1812.01608, 2018.
 (9) Jeffrey De Fauw, Sander Dieleman, and Karen Simonyan. Hierarchical autoregressive image models with auxiliary decoders. arXiv preprint arXiv:1903.04933, 2019.
 (10) A. Razavi, A. van den Oord, and O. Vinyals. Generating diverse highresolution images with vqvae. In International Conference on Learning Representations Workshop on Deep Generative Models for Highly Structured Data, 2019.
 (11) Stanley F Chen, Douglas Beeferman, and Roni Rosenfeld. Evaluation metrics for language models. 1998.
 (12) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 (13) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 (14) Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pages 4790–4798, 2016.
 (15) Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215–10224, 2018.
 (16) Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
 (17) Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, Colorado Univ at Boulder Dept of Computer Science, 1986.
 (18) Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep generative models know what they don’t know? arXiv preprint arXiv:1810.09136, 2018.
 (19) Mikołaj Bińkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.

(20)
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang.
The unreasonable effectiveness of deep features as a perceptual
metric.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 586–595, 2018.  (21) Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assessing generative models via precision and recall. In Advances in Neural Information Processing Systems, pages 5228–5237, 2018.
 (22) Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. arXiv preprint arXiv:1904.06991, 2019.
 (23) Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.
 (24) Sharon Zhou, Mitchell Gordon, Ranjay Krishna, Austin Narcomey, Durim Morina, and Michael S Bernstein. Hype: Human eye perceptual evaluation of generative models. arXiv preprint arXiv:1904.01121, 2019.
 (25) Valentin Khrulkov and Ivan Oseledets. Geometry score: A method for comparing generative adversarial networks. arXiv preprint arXiv:1802.02664, 2018.
 (26) Sanjeev Arora and Yi Zhang. Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224, 2017.

(27)
Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and
Alexander Smola.
A kernel twosample test.
Journal of Machine Learning Research
, 13(Mar):723–773, 2012.  (28) Stanislau Semeniuta, Aliaksei Severyn, and Sylvain Gelly. On accurate evaluation of gans for language generation. arXiv preprint arXiv:1806.04936, 2018.
 (29) Kaiming He, XRSSJ Zhang, S Ren, and J Sun. Deep residual learning for image recognition. eprint. arXiv preprint arXiv:0706.1234, 2015.
 (30) Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
 (31) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 (32) Takeru Miyato and Masanori Koyama. cgans with projection discriminator. arXiv preprint arXiv:1802.05637, 2018.
 (33) Georg Ostrovski, Will Dabney, and Rémi Munos. Autoregressive quantile networks for generative modeling. arXiv preprint arXiv:1806.05575, 2018.
Appendix A Experimental Setup
a.1 ImageNet
We use a ResNet50 classifier for our models on ImageNet, with singlecrop evaluation. The classifier is trained for 90 epochs using TensorFlow’s momentum optimizer, a learning rate schedule linearly increasing from 0.0 to 0.4 for the first 5 epochs, and decreased by a factor of 10 at epochs 30, 60, and 80. It mirrors the 8,192 batch setup of
[31] with gradual warmup. These were trained on 128 TPU chips, and training completed in roughly 45 minutes. We also compared results to those trained on 8 TPU chips, and trained with batch size 1,024 (and completed in roughly 10 hours), and found that Top1 and Top5 accuracy were within 0.4%.For BigGANdeep models, since the truncation trick – which resamples dimensions that are outside the mean of the distribution – seems to trade off quality for diversity, we perform experiments for a sweep of truncation parameters: 0.2, 0.42, 0.5, 0.8, 1.0, 1.5, and 2.0.^{4}^{4}4Dimensions of the noise vector whose value are greater outside the range of to ( is the truncation parameter) are resampled. Lower values of lead to less diverse datasets.
a.2 Cifar10
We use a ResNet56 classifier for our models on CIFAR10, using 45,000 samples for the training set, and 5,000 for validation. We train for 182 epochs, starting at learning rate 1.0, and decaying by a factor of 10 at epochs 91 and 136. We use batch size 128. Note that this setup mirrors that from [29].
Appendix B Instructions for Operating on Google Cloud
(these broadly follow https://cloud.google.com/tpu/docs/tutorials/resnet, with some changes for the metric)
For firsttime users: 1. Create a project using: https://console.cloud.google.com/cloudresourcemanager
2. Enable Billing: https://cloud.google.com/billing/docs/howto/modifyproject
3. Create a storage bucket (used for storing data and models): https://console.cloud.google.com/storage/browser. Make sure this is in the zone uscentral, as this has the cheapest pricing.
For 10hour training:
1. Launch google cloud shell (https://cloud.google.com/shell/)
2.
ctpu up machinetype n1standard8 tpusize=v28 preemptible zone uscentral<x>
(where <x> is a,b for paying customers, and f for those in the TFRC program)
3. You will now be in a virtual machine. Run tmux to keep a persistent ssh.
4. run
export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
5.
cd /usr/share/tpu/models/official/resnet/
6. Set
TRAIN_DIR=gs://<BUCKETNAME>/<synthetic data>
to TFRecords of your synthetic data.
7. Set
EVAL_DIR=gs://<BUCKETNAME>/<real data>
to TFRecords of the validation data
8. Set
MODEL_DIR=gs://<BUCKETNAME>/<MODEL_DIR>
9.
python resnet_main.py tpu=${TPU_NAME} data_dir=${TRAIN_DIR} \ model_dir=$MODEL_DIR \ hparams_file=configs/cloud/v28.yaml \ mode=train
10.
python resnet_main.py tpu=${TPU_NAME} data_dir=${EVAL_DIR} \ model_dir=$MODEL_DIR \ hparams_file=configs/cloud/v28.yaml \ mode=eval
11. exit shell
12.
ctpu delete zone <ZONE>
to turn off the tpu
For 45minute training:
1. Launch google cloud shell (https://cloud.google.com/shell/)
2.
ctpu up machinetype n1standard8 tpusize=v2128 preemptible zone uscentral<x>
(where <x> is a,b for paying customers, and f for those in the TFRC program)
3. You will now be in a virtual machine. Run tmux to keep a persistent ssh.
4. run
export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
5.
cd /usr/share/tpu/models/official/resnet/
6. Set
TRAIN_DIR=gs://<BUCKETNAME>/<synthetic data>
to TFRecords of your synthetic data.
7. Set
EVAL_DIR=gs://<BUCKETNAME>/<real data>
to TFRecords of the validation data
8. Set
MODEL_DIR=gs://<BUCKETNAME>/<MODEL_DIR>
9.
python resnet_main.py tpu=${TPU_NAME} data_dir=${TRAIN_DIR} \ model_dir=$MODEL_DIR \ hparams_file=configs/cloud/v2128.yaml \ mode=train
11. exit shell
12.
ctpu delete zone <ZONE>
to turn off the tpu
13. follow steps of 10hour training, except for steps 6 and 9.
Comments
There are no comments yet.