Classification Accuracy Score for Conditional Generative Models

05/26/2019 ∙ by Suman Ravuri, et al. ∙ Google 4

Deep generative models (DGMs) of images are now sufficiently mature that they produce nearly photorealistic samples and obtain scores similar to the data distribution on heuristics such as Frechet Inception Distance. These results, especially on large-scale datasets such as ImageNet, suggest that DGMs are learning the data distribution in a perceptually meaningful space, and can be used in downstream tasks. To test this latter hypothesis, we use class-conditional generative models from a number of model classes---variational autoencoder, autoregressive models, and generative adversarial networks---to infer the class labels of real data. We perform this inference by training the image classifier using only synthetic data, and using the classifier to predict labels on real data. The performance on this task, which we call Classification Accuracy Score (CAS), highlights some surprising results not captured by traditional metrics and comprise our contributions. First, when using a state-of-the-art GAN (BigGAN), Top-5 accuracy decreases by 41.6 other model classes, such as high-resolution VQ-VAE and Hierarchical Autoregressive Models, substantially outperform GANs on this benchmark. Second, CAS automatically surfaces particular classes for which generative models failed to capture the data distribution, and were previously unknown in the literature. Third, we find traditional GAN metrics such as Frechet Inception Distance neither predictive of CAS nor useful when evaluating non-GAN models. Finally, we introduce Naive Augmentation Score, a variant of CAS where the image classifier is trained on both real and synthetic data, to demonstrate that naive augmentation improves classification performance in limited circumstances. In order to facilitate better diagnoses of generative models, we open-source the proposed metric.



There are no comments yet.


page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Evaluating generative models of high-dimensional data remains an open problem. Despite a number of subtleties in generative model assessment

Theis2015d , researchers looking to make practical progress in generative models of images, and particularly those who have focused on Generative Adversarial Networks goodfellow2014generative , have identified desirable properties such as “sample quality” and “diversity” and proposed automatic metrics to measure these desiderata. As a result, recent years have witnessed a rapid improvement in the quality of deep generative models. While ultimately the utility of these models is their performance in downstream tasks, the focus on these metrics has led to models whose samplers now generate nearly photorealistic images brock2018large ; karras2017progressive ; karras2018style . For one model in particular, BigGAN-deep brock2018large , results on standard GAN metrics such as Inception Score (IS) salimans2016improved and Frechet Inception Distance (FID) heusel2017gans approach those of the data distribution. The results on FID, which purports to be the Wasserstein-2 metric in a perceptual feature space, in particular suggest that BigGANs are capturing the data distribution.

A similar, though less heralded, improvement has occurred for models whose objectives are (bounds of) likelihood, with the result that many of these models now also produce photorealistic samples. Examples include the Subscale Pixel Networks menick2018generating , unconditional autoregressive models of 128128 ImageNet that achieve state-of-the-art test set log-likelihoods; Hierarchical Autoregressive Models (HAMs) de2019hierarchical , class-conditional autoregressive models of 128128 and 256256 ImageNet; and the recently introduced high-resolution VQ-VAE RazaviNote2019

, a variational autoencoder that uses vector quantization and an autoregressive prior to produce high-quality samples. Notably, these models measure diversity using test set likelihood and assess sample quality through visual inspection, eschewing the metrics typically used in GAN research.

As these models increasingly seem “to learn the distribution” according to these metrics, it is natural to consider their use in downstream tasks. Such a view certainly has a precedent: improved test set likelihoods in language models, unconditional models of text, also improve performance in tasks such as speech recognition chen1998evaluation . While a generative model need not learn the data distribution to perform well on a downstream task, poor performance on such tasks allows us to diagnose specific problems with both our generative models and the task-agnostic metrics we use to evaluate them. To that end, we introduce a general framework in which we use conditional generative models to perform approximate inference and measure the quality of that inference. The idea is simple: for any generative model of the form , we learn an inference network using only samples from the conditional generative model, and measure the performance of the inference network on a downstream task. We then compare performance to that of an inference network trained on real data.

We apply this framework to conditional image models, where is the image label, is the image, and task is image classification. The performance measure we use, Top-1 and Top-5 accuracy, denote a Classification Accuracy Score (CAS), and the gap in performance between networks trained on real and synthetic data allows us to understand specific deficiencies in the generative model. Although a simple metric, CAS uncovers some surprising results:

Balloon Paddlewheel Pencil Sharpener Spatula
Figure 1: CAS uncovers classes for which BigGAN-deep fails to capture the data distribution. Top row are real images, and the bottom two rows are samples from BigGAN-deep.
  • When using a state-of-the-art GAN (BigGAN-deep) and an off-the-shelf ResNet-50 classifier as the inference network, Top-1 and Top-5 accuracy decreased by 27.9% and 41.6% compared to real data.

  • Conditional generative models based on likelihood, such as high-resolution VQ-VAE and Hierarchical Autoregressive Models, perform relatively well compared to BigGAN-deep, despite achieving relatively poor Inception Scores and Frechet Inception Distances. Since these models produce visually appealing samples, the result suggests that FID/IS are poor measures of non-GAN models.

  • Classification Accuracy Score (CAS) automatically surfaces particular classes for which BigGAN-deep and VQ-VAE fail to capture the data distribution, and which were previously were unknown in the literature. Figure 1 shows four such classes for BigGAN-deep.

  • We find that neither Inception Score, nor Frechet Inception Distance, nor combinations thereof are predictive of CAS. These results suggest that as generative models are beginning to be deployed in downstream tasks, we should create metrics that better measure task performance.

  • We introduce Naive Augmentation Score (NAS), a variant of CAS where the image classifier is trained on both real and synthetic images, to demonstrate that classification performance improves in limited circumstances. Augmenting the ImageNet training set with low-diversity BigGAN-deep images improves Top-5 accuracy by 0.2%, while augmenting the dataset with any other synthetic images degrades classification performance.

In Section 2 we provide a few definitions, desiderata of metrics, and shortcomings of the most popular metrics in relation to the different research directions for generative modeling. Section 3 introduces CAS. Finally, Section 4 provides a large scale study of current state-of-the-art generative models using FID, IS, and CAS on both the ImageNet and CIFAR-10 datasets.

2 Metrics for Generative Models

Much of the difficulty in evaluation is not knowing the task for which the model is used. Understanding how it is deployed, however, has important implications on its desired properties. For example, consider the seemingly similar tasks of automatic speech recognition and speech synthesis. While both may share the same generative model of speech—such as a hidden Markov Model

with the latent and observed variables being a word sequence and waveform , respectively—the implications of model misspecification are vastly different. In speech recognition, the model should be able to infer words for all possible speech waveforms, even if the waveforms themselves are degraded, while in speech synthesis, the model should produce the most realistic-sounding samples, even if it cannot produce all possible speech waveforms. In particular, for automatic speech recognition, we care about , while for speech synthesis, we care about .

In absence of a downstream task, we assess to what extent the model distribution matches the data distribution , a less specific, and often more difficult goal. Two consequences of the trivial observation that are: 1) each sample “comes” from the data distribution (i.e., it is a “plausible” sample from the data distribution), and 2) that all possible examples from the data distribution are represented by the model. Different metrics that evaluate the degree of model mismatch weigh these criteria differently.

Furthermore, we expect our metrics to be relatively fast to calculate. This last desideratum often depends on the model class, which comprise five categories:

For the first four of these classes, the likelihood objective provides us scaled estimates of the KL-divergence between the data and model. Furthermore, test set likelihood is also an implicit measure of diversity. The likelihood, however, is a fairly poor measure of sample quality

Theis2015d and often scores out-of-domain data more highly than in-domain data nalisnick2018deep .

For implicit models, the objective provides neither an accurate estimate of a statistical divergence or distance, nor a natural evaluation metric. The lack of any such metrics likely forced researchers to propose heuristics that measure versions of both 1) and 2) simultaneously. Inception Score

salimans2016improved ( measures 1) by how confidently a classifier assigns an image to a particular class (), and 2) by penalizing if too many images were classified to the same class (). More principled versions this procedure are Frechet Inception Distance heusel2017gans and Kernel Inception Distance binkowski2018demystifying , which both use variants of two-sample tests in a learned “perceptual” feature space, the Inception pool3 space, to assess distribution matching. Even though this space was an ad-hoc proposition, recent work zhang2018unreasonable

suggests that deep features correlate with human perception of similarity. Even more recent work

sajjadi2018assessing ; kynkaanniemi2019improved

calculate 1) and 2) (sample quality and diversity) independently by calculating precision and recall.

Reliance on Inception Score and Frechet Inception Distance in particular has led to improvement in GAN models, but it has certain deficiencies. Inception Score does not penalize a lack of intra-class diversity, and certain out-of-distribution samples to produce Inception Scores three times higher than that of the data barratt2018note . Frechet Inception Distance, on the other hand, suffers from a high degree of bias binkowski2018demystifying . Moreover, the pool3 feature layer may not even correlate well with human judgment of sample quality zhou2019hype . In this work, we also find that non-GAN models have rather poor Inception Scores and Frechet Inception Distances, even though the samples are visually appealing.

Rather than creating ad-hoc heuristics aimed at broadly measuring sample quality and diversity, we instead evaluate generative models by assessing their performance on a downstream task. This is akin to measuring a generative model of speech by evaluating it on automatic speech recognition. Since models considered here are implicit or do not admit exact likelihoods, exact inference is difficult. To circumvent this issue, we train an inference network on samples from the model. If the generative model is indeed capturing the data distribution, then we could replace the original distribution with a model-generated one, and perform any downstream task, and obtain the same result. In this work, we study perhaps the simplest downstream task: image classification.

2.1 Other Related Work

The metrics mentioned above are by no means the only ones, and researchers have proposed methods to evaluate other generative model properties. khrulkov2018geometry constructs approximate manifolds from data and samples, and applies the method to GAN samples to determine whether mode collapse occurred. arora2017gans attempt to determine the support size of GANs by using a Birthday Paradox test, though the procedure requires a human to identify two nearly-identical samples. Maximum Mean Discrepancy gretton2012kernel is a two-sample test that has many nice theoretical properties, but seems to be less used because the choice of kernels do not necessarily coincide with human evaluation. Finally, procedurally similar to our proposal, semeniuta2018accurate proposes a “reverse LM score”, which trains a language model on GAN data and tests on a real held-out set.

3 Classification Accuracy Score

At the heart of the proposed score lies a very simple idea: if the model captures the data distribution, performance on any downstream task should be similar whether using the original or model data. To make this intuition more precise, suppose that data comes from a distribution , the task is to infer from , and we suffer a loss for predicting when the true label is . The risk associated with a classifier is:


As we only have samples from , we measure the empirical risk . From the right hand side of Equation 1, of the set of predictions , the optimal one minimizes the expected posterior loss:


Assuming we know the label distribution , a generative modeling approach to this problem is to model the conditional distribution , and infer labels using Bayes rule: . If

, then we can make predictions that minimize the risk for any loss function. If the risk is not minimized, however, then we can conclude that distributions are not matched, and we can interrogate

to better understand how our generative models failed.

For most modern deep generative models, however, we have access to neither

, the probability of the data given the label, nor

, the model conditional distribution, nor , the true conditional distribution. Instead, from samples , we train a discriminative model to learn , and use it to estimate the expected posterior loss . We define the generative risk as:


where is the classifier that minimizes the expected posterior loss under . Then, we compare the performance of the classifier to the performance of the classifier trained on samples from .

In the case of conditional generative models of images, is the class label for image , and the model of is an image classifier. We use ResNets he2015deep in this work.

The loss functions we explore are the standard ones for image classification. One is 0-1, which yields Top-1 Accuracy, and the other is 0-1 in the Top-5, which yields Top-5 accuracy.111

It is more correct to state that the losses yield errors, but we present results as accuracies instead as they are standard in computer vision literature.

Procedurally, we train a classifier on synthetic data, and evaluate the performance of the classifier on real data. We call the accuracy the Classification Accuracy Score (CAS).

Note that a CAS close to that for the data does not imply that the generative model accurately modeled the data distribution. This may happen for a few reasons. First, for any generative model that satisfies for all . An example is a generative model that samples from the true distribution with probability , and from a noise distribution with a support disjoint from the true distribution with probability . In this case, our inference model is good, but underlying generative model is poor.

Second, since the losses considered here are not proper scoring rules gneiting2007strictly , one could obtain reasonable CAS from suboptimal inference networks. For example, suppose that for the correct class, while for the correct class due to poor synthetic data. CAS for both is 100%. Using a proper scoring rule, such as Brier Score, eliminates this issue, but experimentally, we found limited practical benefit from using one.

Finally, a generative model that memorizes the training set will achieve the same CAS as the original data.222N.B. Inception Score and Frechet Inception Distance also suffer the same failure mode. In general, however, we hope that generative models produce samples disjoint from the set on which they are trained. If the samples are sufficiently different, we can train a classifier on both the original data and model data and expect improved accuracy. We denote the performance of classifiers trained on this “naive augmentation” Naive Augmentation Score (NAS). Our CAS results, however, indicate that the current models still significantly underfit, rendering the conclusions less compelling. For completeness, we include those results in Section 4.4.

Despite these theoretical issues, we find that generative models have Classification Accuracy Scores lower than the original data, indicating that they fail to capture the data distribution.

3.1 Computation and Open-Sourcing Metric

Computationally, training classifiers is significantly more demanding than calculating Frechet Inception Distance or Inception Score over 50,000 samples. The time is ripe, however, for such a metric due to a few key advances in training classifiers: 1) Training of ImageNet classifiers has been reduced to minutes goyal2017accurate

, 2) With cloud services, the variance due to implementation details of such a metric is largely mitigated, 3) thanks to cloud services, the price and time cost is reasonable, and will only improve in the incoming years.

Moreover, many class-conditional generative models are computationally expensive to train, and as a result, even a relatively expensive metric such as CAS comprises a small percentage of the training budget.

We open-source our metric on Google Cloud for others to use. The instructions are given in Appendix B

. At the time of writing, one can compute the metric in 10 hours for roughly $15, or in 45 minutes for roughly $85 using TPUs. Moreover, depending on affiliation, one may be able to access TPUs for free using the Tensorflow Research Cloud (TFRC) (

4 Experiments

Our experiments are simple: on ImageNet, we use three generative models—BigGAN-deep at 256256 and 128128 resolutions, Hierarchical Autoregressive Models (HAM) with masked self-prediction auxiliary decoder at 128128 resolution , and high-resolution Vector-Quantized Variational Autoencoder (high-res VQ-VAE) at 256256 resolution—to replace the ImageNet training set with a model-generated one, train an image classifier, and evaluate performance on the ImageNet validation set. To calculate CAS, we replace the ImageNet training set with one sampled from the model, and each example from the original training set is replaced with a model sample from the same class.

In addition, we compare CAS to two traditional GAN metrics: Inception Score and Frechet Inception Distance (FID), as these metrics are the current gold standard for GAN comparison and have been used to compare non-GAN to GAN models. Both rely on a feature space from a classifier trained on ImageNet, suggesting that if metrics are useful at predicting performance on a downstream task, it would indeed be this one.

Further details about the experiment can be found in Appendix A.1.

4.1 Model Comparison on ImageNet

Training Resolution Top-5 Top-1 Inception FID-50K
Set Accuracy Accuracy Score
Real 128128 88.79% 68.82% 1.61
BigGAN-deep 128128 64.44% 40.64% 4.22
Hierarchical Autoregressive 128128 77.33% 54.05% 17.02 0.79 46.05
Real 256256 91.47% 73.09% 2.47
BigGAN-deep 256256 65.92% 42.65% 11.78
High-Res VQ-VAE 256256 77.59% 54.83% 38.05
Table 1: CAS for different models at 128128 and 256256 resolutions. BigGAN-deep samples taken from best truncation parameter of 1.5.

Table 1 shows the performance of classifiers trained on model-generated datasets compared to those on the real dataset for 256256 and 128128, respectively. At 256256 resolution, BigGAN-deep achieves a CAS Top-5 is 65.92%, suggesting that BigGANs are learning nontrivial distributions. Perhaps surprisingly, high-resolution VQ-VAE, though performing quite poorly compared to BigGAN-deep on both Frechet Inception Distance and Inception Score, obtains a CAS Top-5 accuracy 77.59%. Both models, however, lag the original 256256 dataset which achieves a CAS Top-5 Accuracy of 91.47%.

We find nearly identical results for the 128128 models. BigGAN-deep achieves CAS Top-5 and Top-1 similar to the 256256 model (note that Inception Score and Frechet Inception Distance results for 128128 and 256256 BigGAN-deep are vastly different). Hierarchical Autoregressive Models, similar to high-resolution VQ-VAE, perform poorly on FID and Inception Score, but outperform BigGAN-deep on CAS. Moreover, both models underperform relative to original 128128 dataset.

4.2 Uncovering Model Deficiencies

Figure 2: Comparison of per-class accuracy of data (blue) vs. model (red). Left: BigGAN-deep 256256 at 1.5 truncation level. Middle: High-resolution VQ-VAE 256256. Right: HAM 128128.
BigGAN-deep High-Res VQ-VAE Hierarchical Autoregressive
Figure 3: The top two rows are samples from classes that achieved the best test set performance relative to original dataset. The bottom two rows are those from classes that achieved the worst. Left: BigGAN-deep top two—squirrel monkey and red fox—and bottom two—(hot air) balloon and paddlewheel. Middle: High-Res VQ-VAE top two—red fox and African elephant—and bottom two—guillotine and fur coat. Right: Hierarchical Autoregressive Model top two—husky and gong/tim-tam—and bottom two—hermit crab and harmonica.

To better understand what accounts for the gap between generative model and dataset CAS, we broke down the performance by class (Figure 2). As shown in the left pane, nearly every class suffers of BigGAN-deep suffers a drop in performance compared to the original dataset, though six classes—partridge, red fox, jaguar/panther, squirrel monkey, African elephant, and strawberry—show marginal improvement over the original dataset. The left pane of Figure 3 shows the two best and two worst performing categories, as measured by the difference in classification performance. Notably, for the two worst performing categories and two others—balloon, paddlewheel, pencil sharpener, and spatula—classification accuracy was 0% on the validation set.

The per-class breakdown of high-resolution VQ-VAE (middle pane of Figure 2) shows that this model also underperforms the real data in most classes (31 classes performed better than the original data), though the gap is not as large as for BigGAN-deep. Furthermore, high-resolution VQ-VAE has better generalization performance in 87.6% of classes compared to BigGAN-deep, and suffers 0% classification accuracy for no classes. The middle pane of Figure 3 shows the two best and two worst performing categories.

The right panes of Figures 2 and 3 show the per-class breakdown and top and bottom two classes, respectively, for Hierarchical Autoregressive Models. Results broadly mirror those of high-resolution VQ-VAE.

4.3 A Note on FID and a Second Note on IS

Figure 4: Top: Originals, Bottom: Reconstructions Using high-resolution VQ-VAE.
Training Truncation Top-5 Top-1 Inception FID-50K
Set Accuracy Accuracy Score
BigGAN-deep 0.20 13.24% 5.11% 20.75
BigGAN-deep 0.42 28.68% 13.30% 15.93
BigGAN-deep 0.50 32.88% 15.66% 14.37
BigGAN-deep 0.60 45.01% 25.51% 12.41
BigGAN-deep 0.80 56.68% 32.88% 9.24
BigGAN-deep 1.00 62.97% 39.07% 7.42
BigGAN-deep 1.50 65.92% 42.65% 11.78
BigGAN-deep 2.00 64.37% 40.98% 28.67
High-Res VQ-VAE Recon - 89.46% 69.90% 8.69
Real - 91.47% 73.09% 2.47
Table 2: Classification Accuracy Score (CAS) for high-resolution VQ-VAE model reconstructions and BigGAN-deep models at different truncation levels at 256256 resolution.

We note that Inception Score and FID have very little correlation with CAS, suggesting that alternative metrics are needed if we use our models on downstream tasks. As a controlled experiment, we calculated CAS, IS, and FID for BigGAN-deep models with input noise distributions truncated at different values (known as the “truncation trick”). As noted in brock2018large , lower truncation values seem to improve sample quality at the expense of diversity. For CAS, the correlation coefficient between Top-1 Accuracy and FID is 0.16, and Inception Score -0.86, the latter result incorrectly suggesting that improved Inception Score is highly correlated with poorer performance. Moreover, the best-performing truncation values (1.5 and 2.0) have rather poor Inception Scores and FIDs. That these poor IS/FID also seem to indicate poor performance on this metric is no surprise; that other models, with well-performing Inception Scores and FIDs yield poor performance on CAS suggests that alternative metrics are needed. One can easily diagnose the issue with Inception Score: as noted in barratt2018note , Inception Score does not account for intra-class diversity, and a training set with little intra-class diversity may make the classifier fail to generalize to a more diverse test set. FID should better account for this lack of diversity at least grossly, as the metric, calculated as , compares the covariance matrices of the data and model distribution. By comparison, CAS offers a finer measure of model performance, as it provides us a per-class metric to identify which classes have better or worse performance. While in theory one could calculate a per-class FID, FID is known to suffer from high bias (binkowski2018demystifying, ) for low number of samples, likely making the per-class estimates unreliable. 333binkowski2018demystifying proposed Kernel Inception Distance, an unbiased alternative to FID, but this metric suffers from variance too large to be reliable when using the number of per-class samples in the ImageNet training set (roughly 1,000 per class), much less when using the 50 in the validation set.

Perhaps a larger issue is that Inception Score and Frechet Inception Distance heavily penalize non-GAN models, suggesting that these heuristics are not suitable for inter-model-class comparisons. A particularly egregious failure case is that IS and FID aggressively penalize certain types of samples that look nearly identical to the original dataset. For example, we computed CAS, Inception Score, and FID on the ImageNet training set at 256256 resolution, and on reconstructions from a high-resolution VQ-VAE. As shown in Figure 4, the samples look nearly identical. As noted in Table 2, however, Inception Score decreases by 128 points and Frechet Inception Distance increases by 3.5. The drop in performance is so great that BigGAN-deep at 1.00 truncation achieves better Inception Scores and FIDs than nearly-identical reconstructions. CAS Top-1 and Top-5, however, drops by 2.2% and 4.4%, respectively relative to the original dataset. The BigGAN-deep model at 1.00 truncation, on the other hand, drops by 31.1% and 46.5% relative.

4.4 Naive Augmentation Score

Figure 5: Top-5 accuracy as training data is augmented by examples from BigGAN-deep for different truncation levels. Lower truncation generates datasets with less sample diversity.

To calculate Naive Augmentation Score (NAS), we add to the original ImageNet training set, 25%, 50%, or 100% more data from each of our models. The original ImageNet training set achieves a Top-5 accuracy of 92.97%.

Although the results of BigGAN-deep, and to a lesser extent high-resolution VQ-VAE, on CAS suggest that augmenting the original training set with model samples will not result in improved classification performance, we wanted to study whether the relative ordering on the CAS experiments would hold for the NAS ones. Figure 5 illustrates the performance of the classifiers as we increase the amount of synthetic training data. Perhaps somewhat surprisingly, BigGAN-deep models that sample from lower truncation values, and have lower sample diversity, are able to perform better for data augmentation compared to those models that performed well on CAS. In fact, for some of the lowest truncation values, one found modest improvement in classification performance: roughly a improvement. Moreover, high-resolution VQ-VAE underperforms relative to BigGAN-deep models. Of course, the caveat is that the former model does not yet have a mechanism to trade off sample quality from sample diversity.

The results on augmentation highlight different desiderata for samples that are added to the dataset rather than replaced. Clearly, the samples added should be sufficiently different from the data to allow the classifier to better generalize, and yet, poorer sample quality may lead to poorer generalization compared to the original dataset. This may be the reason why extending the dataset with samples generated from a lower truncation value noise—which are higher-quality, but less diverse—perform better on NAS than CAS. Furthermore, this may also explain why Inception Score, Frechet Inception Distance, and CAS are not predictive of NAS.

4.5 Model Comparison on CIFAR-10

Real BigGAN cGAN PixelCNN PixelIQN
Accuracy 92.58% 71.87% 76.35% 64.02% 74.26%
Table 3: Classification Accuracy Score for different models of CIFAR-10.

Finally, we also compare CAS for different model classes on CIFAR-10. We compare four models: BigGAN, cGAN with Projection Discriminator miyato2018cgans , PixelCNN van2016conditional , and PixelIQN ostrovski2018autoregressive . We train a ResNet-56 following the training procedure of he2015deep . More details can be found in Appendix A.2. Similar to the ImageNet experiments, we find that both GANs produce samples with a certain degree of generalization. GANs also significantly outperform PixelCNN on this benchmark. Perhaps surprisingly, PixelIQN has similar performance to the newer GANs.

5 Conclusion

Good metrics have long been an important, and perhaps underappreciated, component in driving improvements in models. It may be particularly important now, as generative models have reached a maturity that they may be deployed in downstream tasks. We proposed one, Classification Accuracy Score, for conditional generative models of images, and found the metric practically useful in uncovering model deficiencies. Furthermore, we find that GAN models of ImageNet, despite high sample quality, tend to underperform models based on likelihood. Finally, we find that Inception Score and Frechet Inception Distance unfairly penalize non-GAN models.

An open question in this work is understanding to what extent these models generalize beyond the training set. While current results suggest that even state-of-the-art models currently underfit, recent progress indicates that underfitting may be a temporary issue. Measuring generalization will then be of primary importance, especially if we plan to deploy our models on downstream tasks.


Appendix A Experimental Setup

a.1 ImageNet

We use a ResNet-50 classifier for our models on ImageNet, with single-crop evaluation. The classifier is trained for 90 epochs using TensorFlow’s momentum optimizer, a learning rate schedule linearly increasing from 0.0 to 0.4 for the first 5 epochs, and decreased by a factor of 10 at epochs 30, 60, and 80. It mirrors the 8,192 batch setup of

[31] with gradual warmup. These were trained on 128 TPU chips, and training completed in roughly 45 minutes. We also compared results to those trained on 8 TPU chips, and trained with batch size 1,024 (and completed in roughly 10 hours), and found that Top-1 and Top-5 accuracy were within 0.4%.

For BigGAN-deep models, since the truncation trick – which resamples dimensions that are outside the mean of the distribution – seems to trade off quality for diversity, we perform experiments for a sweep of truncation parameters: 0.2, 0.42, 0.5, 0.8, 1.0, 1.5, and 2.0.444Dimensions of the noise vector whose value are greater outside the range of to ( is the truncation parameter) are resampled. Lower values of lead to less diverse datasets.

a.2 Cifar-10

We use a ResNet-56 classifier for our models on CIFAR-10, using 45,000 samples for the training set, and 5,000 for validation. We train for 182 epochs, starting at learning rate 1.0, and decaying by a factor of 10 at epochs 91 and 136. We use batch size 128. Note that this setup mirrors that from [29].

Appendix B Instructions for Operating on Google Cloud

(these broadly follow, with some changes for the metric)

For first-time users: 1. Create a project using:

3. Create a storage bucket (used for storing data and models): Make sure this is in the zone us-central, as this has the cheapest pricing.

For 10-hour training:

1. Launch google cloud shell (


 ctpu up --machine-type n1-standard-8
--tpu-size=v2-8 --preemptible --zone us-central-<x> 

(where <x> is a,b for paying customers, and f for those in the TFRC program)

3. You will now be in a virtual machine. Run tmux to keep a persistent ssh.

4. run

 export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models" 


 cd /usr/share/tpu/models/official/resnet/ 

6. Set

TRAIN_DIR=gs://<BUCKET-NAME>/<synthetic data>

to TFRecords of your synthetic data.

7. Set

 EVAL_DIR=gs://<BUCKET-NAME>/<real data>

to TFRecords of the validation data

8. Set



 python --tpu=${TPU_NAME} --data_dir=${TRAIN_DIR} \
  --model_dir=$MODEL_DIR \
  --hparams_file=configs/cloud/v2-8.yaml \


 python --tpu=${TPU_NAME} --data_dir=${EVAL_DIR} \
  --model_dir=$MODEL_DIR \
  --hparams_file=configs/cloud/v2-8.yaml \

11. exit shell


 ctpu delete --zone <ZONE>

to turn off the tpu

For 45-minute training:

1. Launch google cloud shell (


 ctpu up --machine-type n1-standard-8
--tpu-size=v2-128 --preemptible --zone us-central-<x> 

(where <x> is a,b for paying customers, and f for those in the TFRC program)

3. You will now be in a virtual machine. Run tmux to keep a persistent ssh.

4. run

 export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models" 


 cd /usr/share/tpu/models/official/resnet/ 

6. Set

TRAIN_DIR=gs://<BUCKET-NAME>/<synthetic data>

to TFRecords of your synthetic data.

7. Set

 EVAL_DIR=gs://<BUCKET-NAME>/<real data>

to TFRecords of the validation data

8. Set



 python --tpu=${TPU_NAME} --data_dir=${TRAIN_DIR} \
  --model_dir=$MODEL_DIR \
  --hparams_file=configs/cloud/v2-128.yaml \

11. exit shell


 ctpu delete --zone <ZONE>

to turn off the tpu

13. follow steps of 10-hour training, except for steps 6 and 9.