1 Introduction
The advent of new deep learning techniques for generative modeling has led to a resurgence of interest in the topic within the artificial intelligence community. Most notably, recent advances have allowed for the generation of hyperrealistic natural images
(Karras et al., 2017), in addition to applications in style transfer (Zhu et al., 2017; Isola et al., 2016), image superresolution
(Ledig et al., 2016)(Guo et al., 2017), music generation (Mogren, 2016), medical data generation (Esteban et al., 2017), and physical modeling (Farimani et al., 2017). In sum, these applications represent a major advance in the capabilities of machine intelligence and will have significant and immediate practical consequences. Even more promisingly, in the long run, deep generative models are a potential method for developing rich representations of the world from unlabeled data, similar to how humans develop complex mental models, in an unsupervised way, directly from sensory experience. The human ability to imagine and consider potential future scenarios with rich clarity is a crucial feature of our intelligence, and deep generative models may bring us a small step closer to replicating that ability in silico.Despite a widespread recognition that highdimensional generative models lie at the frontier of artificial intelligence research, it remains notoriously difficult to evaluate them. In the absence of meaningful evaluation metrics, it becomes challenging to rigorously make progress towards improved models. As a result, the generative modeling community has developed various adhoc evaluative criteria. The Inception Score is one of these adhoc metrics that has gained popularity to evalute the quality of generative models for images.
In this paper, we rigorously investigate the most widely used metric for evaluating imagegenerating models, the Inception Score, and discover several shortcomings within the underlying premise of the score and its application. This metric, while of importance in and of itself, also serves as a paradigm that illustrates many of the difficulties faced when designing an effective method for the evaluation of blackbox generative models. In Section 2, we briefly review generative models and discuss why evaluating them is often difficult. In Section 3, we review the Inception Score and discuss some of its characteristics. In Section 4, we describe what we have identified as the five major shortcomings of the Inception Score, both within the mechanics of the score itself and in the popular usage thereof. We propose some alterations to the metric and its usage to make it more appropriate, but some of the shortcomings are systemic and difficult to eliminate without altering the basic premise of the score.
2 Evaluating (BlackBox) Generative Models
In generative modeling, we are given a dataset of samples
drawn from some unknown probability distribution
. The samples could be images, text, video, audio, GPS traces, etc. We want to use the samples to derive the unknown real data distribution . Our generative model encodes a distribution over new samples, . The aim is that we find a generative distribution such that according to some metric.If we are able to directly evaluate , then it is common to calculate the likelihood of a heldout dataset under and choose the model that maximizes this likelihood. For most applications, this approach is effective^{1}^{1}1It has been shown that loglikelihood evaluation can be misled by simple mixture distributions (Theis et al., 2015; van den Oord & Dambre, 2015), but this is only relevant in some applications.. Unfortunately, in many stateoftheart generative models, we do not have the luxury of an explicit . For example, latent variable models like Generative Adversarial Networks (GANs) do not have an explicit representation of the distribution
, but rather implicitly map random noise vectors to samples through a parameterized neural network
(Goodfellow et al., 2014a).Some metrics have been devised that use the structure within an individual class of generative models to compare them (Im et al., 2016). However, this makes it impossible to make global comparisons between different classes of generative models. In this paper, we focus on the evaluation of blackbox generative models where we assume that we can sample from and assume nothing further about the structure of the model.
Many metrics have been proposed for the evaluation of blackbox generative models. One way is to approximate a density function over generated samples and then calculate the likelihood of heldout samples. This can be achieved using Parzen Window Estimates as a method for approximating the likelihood when the data consists of images, but other nonparametric density estimation techniques exist for other data types
(Breuleux et al., 2010). A more indirect method for evaluation is to apply a pretrained neural network to generated images and calculate statistics of its output or at a particular hidden layer. This is the approach taken by the Inception Score (Salimans et al., 2016), Mode Score (Che et al., 2016) and Fréchet Inception Distance (FID) (Heusel et al., 2017). These scores are often motivated by demonstrating that it prefers models that generate realistic and varied images and is correlated with visual quality. Most of the aforementioned metrics can be fooled by algorithms that memorize the training data. Since the Inception Score is the most widely used metric in generative modeling for images, we focus on this metric.Further, there are several works concerned with the evaluation of evaluation metrics themselves. One study examined several common evaluation metrics and found that the metrics do not correlate with each other. The authors further argue that generative models need to be directly evaluated for the application they are intended for (Theis et al., 2015). As generative models become integrated into more complex systems, it will be harder to discern their exact application aside from effectively capturing highdimensional probability distributions thus necessitating highquality evaluation metrics that are not specific to applications. A recent study investigated several samplebased evaluation metrics and argued that Maximum Mean Discrepancy (MMD) and the 1NearestNeighbour (1NN) twosample test satisfied most of the desirable properties of a metric (Qiantong et al., 2018)
. Further, a recent study found that over several different datasets and metrics, there is no clear evidence to suggest that any model is better than the others, if enough computation is used for hyperparameter search
(Lucic et al., 2017). This result comes despite the claims of different generative models to demonstrate clear improvements on earlier work (e.g. WGAN as an improvement on the original GAN). In light of the results and discussion in this paper, which casts doubt on the most popular metric used, we do not find the results of this study surprising.3 The Inception Score for Image Generation
Suppose we are trying to evaluate a trained generative model that encodes a distribution over images . We can sample from as many times as we would like, but do not assume that we can directly evaluate . The Inception Score is one way to evaluate such a model (Salimans et al., 2016). In this section, we reintroduce and motivate the Inception Score as a metric for generative models over images and point out several of its interesting properties.
3.1 Inception v3
The Inception v3 Network (Szegedy et al., 2016)
is a deep convolutional architecture designed for classification tasks on ImageNet
(Deng et al., 2009), a dataset consisting of 1.2 million RGB images from 1000 classes. Given an image , the task of the network is to output a class label in the form of a vector of probabilities, indicating the probability the network assigns to each of the class labels. The Inception v3 network is one of the most widely used networks for transfer learning and pretrained models are available in most deep learning software libraries.
3.2 Inception Score
The Inception Score is a metric for automatically evaluating the quality of image generative models (Salimans et al., 2016). This metric was shown to correlate well with human scoring of the realism of generated images from the CIFAR10 dataset. The IS uses an Inception v3 Network pretrained on ImageNet and calculates a statistic of the network’s outputs when applied to generated images.
(1) 
where indicates that is an image sampled from , is the KLdivergence between the distributions and , is the conditional class distribution, and is the marginal class distribution. The in the expression is there to make the values easier to compare, so it will be ignored and we will use without loss of generality.
The authors who proposed the IS aimed to codify two desirable qualities of a generative model into a metric:

The images generated should contain clear objects (i.e. the images are sharp rather than blurry), or should be low entropy. In other words, the Inception Network should be highly confident there is a single object in the image.

The generative algorithm should output a high diversity of images from all the different classes in ImageNet, or should be high entropy.
If both of these traits are satisfied by a generative model, then we expect a large KLdivergence between the distributions and , resulting in a large IS.
3.3 Digging Deeper into the Inception Score
Let’s see why the proposed score codifies these qualities. The expected KLdivergence between the conditional and marginal distributions of two random variables is equal to their Mutual Information (for proof see Appendix A):
(2) 
In other words, the IS can be interpreted as the measure of dependence between the images generated by and the marginal class distribution over . The Mutual Information of two random variables is further related to their entropies:
(3) 
This confirms the connection between the IS and our desire for to be low entropy and to be high entropy. As a consequence of simple properties of entropy we can bound the Inception Score (for proof see Appendix B):
(4) 
3.4 Calculating the Inception Score
We can construct an estimator of the Inception Score from samples by first constructing an empirical marginal class distribution,
(5) 
where is the number of sample images taken from the model. Then an approximation to the the expected KLdivergence can be computed by
(6) 
The original proposal of the IS recommended applying the above estimator times with
and then taking the mean and standard deviation of the resulting scores. At first glance, this procedure seems troubling and in Section
4.1.2 we lay out our critique.4 Issues With the Inception Score
As mentioned earlier, Salimans et al. (2016) introduced the Inception Score because, in their experiments, it correlated well with human judgment of image quality. Though we don’t dispute that this is the case within a significant regime of its usage, there are several problems with the Inception Score that make it an undesirable metric for the evaluation and comparison of generative models.
Before illustrating in greater detail the problems with the Inception Score, we offer a simple onedimensional example that illustrates some of its troubles. Suppose our true data comes with equal probability from two classes which have respective normal distributions
and. The Bayes optimal classifier is
. We can then use this to calculate an analog to the Inception Score in this setting. The optimal generator according to the Inception Score outputs and with equal probability, as it achieves and and thus an Inception Score of. Furthermore, many other distributions will also achieve high scores, e.g. the uniform distribution
and the centered normal distribution , because they will result in and reasonably small . However, the true underlying distribution will achieve a lower score than the aforementioned distributions.In the general setting, the problems with the Inception Score fall into two categories^{2}^{2}2A third issue with the usage of Inception Score is that the code most commonly used to calculate the score has a number of errors, including using an esoteric version of the Inception Network with 1008 classes, rather than the actual 1000. See our GitHub issue for more details: https://github.com/openai/improvedgan/issues/29.:

Suboptimalities of the Inception Score itself

Problems with the popular usage of the Inception Score
In this section we enumerate both types of issues. In describing the problems with popular usage of the Inception Score, we omit citations so as to not call attention to individual papers for their practices. However, it is not difficult to find many examples of each of the issues we discuss.
4.1 Suboptimalities of the Inception Score Itself
4.1.1 Sensitivity to Weights
Network  
IV2 TF  IV3 Torch 
IV3 Keras 

CIFAR10  
ImageNet Validation  
Top1 Accuracy  0.756  0.772  0.777 
Inception Scores on 50k CIFAR10 training images, 50k ImageNet validation images and ImageNet Validation top1 accuracy. IV2 TF is the Tensorflow Implementation of the Inception Score using the Inception V2 network. IV3 Torch is the PyTorch implementation of the Inception V3 network
(Paszke et al., 2017). IV3 Keras is the Keras implementation of the Inception V3 network (Chollet et al., 2015). Scores were calculated using 10 splits of N=5,000 as in the original proposal.Different training runs of the Inception network on a classification task for ImageNet result in different network weights due to randomness inherent in the training procedure. These differences in network weights typically have minimal effect on the classification accuracy of the network, which speaks to the robustness of the deep convolutional neural network paradigm for classifying images. Although these networks have virtually the same classification accuracy, slight weight changes result in drastically different scores for the exact same set of sampled images. This is illustrated in Table
1, where we calculate the Inception Score for 50k CIFAR10 training images and 50k ImageNet Validation images using 3 versions of the Inception network, each of which achieve similar ImageNet validation classification accuracies.The table shows that the mean Inception Score is higher for ImageNet validation images, and higher for CIFAR validation images, depending on whether a Keras or Torch implementation of the Inception Network are used, both of which have almost identical classification accuracy. The discrepancies are even more pronounced when using the Inception V2 architecture, which is often the network used when calculating the Inception Score in recent papers.
This shows that the Inception Score is sensitive to small changes in network weights that do not affect the final classification accuracy of the network. We would hope that a good metric for evaluating generative models would not be so sensitive to changes that bear no relation to the quality of the images generated. Furthermore, such discrepancies in the Inception Score can easily account for the advances that differentiate “stateoftheart” performance from other work, casting doubt on claims of model superiority.
4.1.2 Score Calculation and Exponentiation
In Section 3.4, we described that the Inception Score is taken by applying the estimator in Equation 6 for large (). However, the score is not calculated directly for , but instead the generated images are broken up into chunks of size and the estimator is applied repeatedly on these chunks to compute a mean and standard deviation of the Inception Score. Typically, . For datasets like ImageNet, where there are 1000 classes in the original dataset, samples are not enough to get good statistics on the marginal class distribution of generated images through the method described in Equation 5.^{3}^{3}3
ImageNet also has a skew in its class distribution, so we should be careful to train on a subset of ImageNet that has a uniform distribution over classes when applying this metric or account for it in the calculation of the metric.
Furthermore, by introducing the parameter we unnecessarily introduce an extra parameter that can change the final score, as shown in Table 2.
1  2  5  10  20  50  100  200  

mean score  9.9147  9.9091  9.8927  9.8669  9.8144  9.6653  9.4523  9.0884 
standard deviation  0  0.00214  0.1010  0.1863  0.2220  0.3075  0.3815  0.4950 
This dependency on can be removed by computing over the entire generated dataset and by removing the exponential from the calculation of Inception Score, such that the average value will be the same no matter how you choose to batch the generated images. Also, by removing the exponential (which the original authors included only for aesthetic purposes), the Inception Score is now interpretable, in terms of mutual information, as the reduction in uncertainty of an image’s ImageNet class given that the image is emitted by the generator .
The new Improved Inception Score is as follows
(7) 
and it improves both calculation and interpretability of the Inception Score. To calculate the average value, the dataset can be batched into any number of splits without changing the answer, and the variance should be calculated over the entire dataset (i.e.
).4.2 Problems with Popular Usage of Inception Score
4.2.1 Usage beyond ImageNet dataset
Though this has been pointed out elsewhere (Rosca et al., 2017), it is worth restating: applying the Inception Score to generative models trained on datasets other than ImageNet gives misleading results. The most common use of Inception Score on nonImageNet datsets is for generative models trained on CIFAR10, because it is quite a bit smaller and more manageable to train on than ImageNet. We have also seen the score used on datasets of bedrooms, flowers, celebrity faces, and more. The original proposal of the Inception Score was for the evaluation of models trained on CIFAR10.
As discussed in Section 3.2, the intuition behind the usefulness of Inception Score lies in its ability to recover good estimates of , the marginal class distribution across the set of generated images , and of , the conditional class distribution for generated images . As shown in Table 3, several of the top 10 predicted classes for CIFAR images are obscure and confusing, suggesting that the predicted marginal distribution is far from correct and casting doubt on the first assumption underlying the score.
Top 10 Inception Score Classes  CIFAR10 Classes 

Moving Van  Airplane 
Sorrel (garden herb)  Automobile 
Container Ship  Bird 
Airliner  Cat 
Threshing Machine  Deer 
Hartebeest (antelope)  Dog 
Amphibian  Frog 
Japanese Spaniel (dog breed)  Horse 
Fox Squirrel  Ship 
Milk Can  Truck 
Since the classes in ImageNet and CIFAR10 do not line up identically, we cannot expect perfect alignment between the classes predicted by the Inception Network and the actual classes within CIFAR10. Nevertheless, there are many classes in ImageNet that align more appropriately with classes in CIFAR than some of those chosen by the Inception Network. One of the reason for the promotion of bizarre classes (e.g. milk can, fox squirrel) is also that ImageNet contains many more specific categories than CIFAR, and thus the probability of Cat is spread out over the many different breeds of cat, leading to a higher entropy in the conditional distribution. This is another reason that testing on a network trained on a wholly separate dataset is a poor choice.
The second assumption, that the distribution over classes will be low entropy, also does not hold to the degree that we would hope. The average entropy of the conditional distribution conditioned on an image from the training set of CIFAR is 4.664 bits, whereas the average entropy conditioned on a uniformly random image (pixel values uniform between 0 and 255) is 6.512 bits, a modest increase relative to the bits of entropy possible. For comparison, the average entropy of conditioned on images in the ImageNet validation set is 1.97 bits. As such, the entropy of the conditional class distribution on CIFAR is closer to that of random images than to the actual images in ImageNet, casting doubt on the second assumption underlying the Inception Score.
Given the premise of the score, it makes quite a bit more sense to use the Inception Score only when the Inception Network has been trained on the same dataset as the generative model. Thus the original Inception Score should be used only for ImageNet generators, and its variants should use models trained on the specific dataset in question.
4.2.2 Optimizing the Inception Score (indirectly & implicitly)
As mentioned in the original proposal, the Inception Score should only be used as a “rough guide” to evaluating generative models, and directly optimizing the score will lead to the generation of adversarial examples (Szegedy et al., 2013). It should also be noted that optimizing the metric indirectly by using it for model selection will similarly tend to produce models that, though they may achieve a higher Inception Score, tend toward adversarial examples. It is not uncommon in the literature to see algorithms use the Inception Score as a metric to optimize early stopping, hyperparameter tuning, or even model architecture. Furthermore, by promoting models that achieve high Inception Scores, the generative modeling community similarly optimizes implicitly towards adversarial examples, though this effect will likely only be significant if the Inception Score continues to be optimized for within the community over a long time scale.
In Appendix Achieving High Inception Scores we show how to achieve high inception scores by gently altering the output of a WGAN to create examples that achieve a nearly perfect Inception Score, despite looking no more like natural images than the original WGAN output. A few such images are shown in Figure 1, which achieve an Inception Score of 900.15.
4.2.3 Not Reporting Overfitting
It is clear that a generative algorithm that memorized an appropriate subset of the training data would perform extremely well in terms of Inception Score, and in some sense we can treat the score of a validation set as an upper bound on the possible performance of a generative algorithm. Thus, it is extremely important when reporting the Inception Score of an algorithm to include some alternative score demonstrating that the model is not overfitting to training data, validating that the high score achieved is not simply replaying the training data. Nevertheless, in many works the Inception Score is treated as a holistic metric that can summarize the performance of the algorithm in a single number. In the generative modeling community, we should not use the existence of a metric that correlates with human judgment as an excuse to exclude more thorough analysis of the generative technique in question.
5 Conclusion
Deep learning is an empirical subject. In an empirical subject, success is determined by using evaluation metrics–developed and accepted by researchers within the community–to measure performance on tasks that capture the essential difficulty of the problem at hand. Thus, it is crucial to have meaningful evaluation metrics in order to make scientific progress in deep learning. An outstanding example of successful empirical research within machine learning is the Large Scale Visual Recognition Challenge benchmark for computer vision tasks that has arguably produced most of the greatest computer vision advances of the last decade
(Russakovsky et al., 2015). This competition has and continues to serve as a perfect sandbox to develop, test, and verify hypotheses about visual recognition systems. Developing common tasks and evaluative criteria can be more difficult outside such narrow domains as visual recognition, but we think it is worthwhile for generative modeling researchers to devote more time to rigorous and consistent evaluative methodologies. This paper marks an attempt to better understand popular evaluative methodologies and make the evaluation of generative models more consistent and thorough.In this note, we highlighted a number of suboptimalities of the Inception Score and explicated some of the difficulties in designing a good metric for evaluating generative models. Given that our metrics to evaluate generative models are far from perfect, it is important that generative modeling researchers continue to devote significant energy to the evaluation and validations of new techniques and methods.
Acknowledgements
This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE1656518.
References
 Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 Breuleux et al. (2010) Breuleux, O., Bengio, Y., and Vincent, P. Unlearning for better mixing. Universite de Montreal/DIRO, 2010.
 Che et al. (2016) Che, T., Li, Y., Jacob, A. P., Bengio, Y., and Li, W. Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136, 2016.
 Chollet et al. (2015) Chollet, F. et al. Keras. https://github.com/fchollet/keras, 2015.

Deng et al. (2009)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and FeiFei, L.
Imagenet: A largescale hierarchical image database.
In
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on
, pp. 248–255. IEEE, 2009.  Esteban et al. (2017) Esteban, C., Hyland, S. L., and Rätsch, G. Realvalued (Medical) Time Series Generation with Recurrent Conditional GANs. ArXiv eprints, June 2017.
 Farimani et al. (2017) Farimani, A. B., Gomes, J., and Pande, V. S. Deep learning the physics of transport phenomena. CoRR, abs/1709.02432, 2017. URL http://arxiv.org/abs/1709.02432.
 Goodfellow et al. (2014a) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014a.
 Goodfellow et al. (2014b) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014b.
 Guo et al. (2017) Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., and Wang, J. Long text generation via adversarial training with leaked information. CoRR, abs/1709.08624, 2017. URL http://arxiv.org/abs/1709.08624.
 Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6629–6640, 2017.
 Im et al. (2016) Im, D. J., Kim, C. D., Jiang, H., and Memisevic, R. Generative adversarial metric. 2016.
 Isola et al. (2016) Isola, P., Zhu, J., Zhou, T., and Efros, A. A. Imagetoimage translation with conditional adversarial networks. CoRR, abs/1611.07004, 2016. URL http://arxiv.org/abs/1611.07004.
 Karras et al. (2017) Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
 Ledig et al. (2016) Ledig, C., Theis, L., Huszar, F., Caballero, J., Aitken, A. P., Tejani, A., Totz, J., Wang, Z., and Shi, W. Photorealistic single image superresolution using a generative adversarial network. CoRR, abs/1609.04802, 2016. URL http://arxiv.org/abs/1609.04802.
 Lucic et al. (2017) Lucic, M., Kurach, K., Michalski, M., Gelly, S., and Bousquet, O. Are gans created equal? a largescale study. arXiv preprint arXiv:1711.10337, 2017.
 Mogren (2016) Mogren, O. CRNNGAN: continuous recurrent neural networks with adversarial training. CoRR, abs/1611.09904, 2016. URL http://arxiv.org/abs/1611.09904.
 Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017.
 Qiantong et al. (2018) Qiantong, X., Gao, H., Yang, Y., Chuan, G., Yu, S., Felix, W., and Weinberger, K. An empirical study on evaluation metrics of generative adversarial networks. arXiv preprint arXiv:1806.07755, 2018.
 Rosca et al. (2017) Rosca, M., Lakshminarayanan, B., WardeFarley, D., and Mohamed, S. Variational approaches for autoencoding generative adversarial networks. arXiv preprint arXiv:1706.04987, 2017.
 Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242, 2016.
 Szegedy et al. (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013. URL http://arxiv.org/abs/1312.6199.
 Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.
 Theis et al. (2015) Theis, L., Oord, A. v. d., and Bethge, M. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015.
 van den Oord & Dambre (2015) van den Oord, A. and Dambre, J. Locallyconnected transformations for deep gmms. In International Conference on Machine Learning (ICML): Deep learning Workshop, pp. 1–8, 2015.
 Zhu et al. (2017) Zhu, J., Park, T., Isola, P., and Efros, A. A. Unpaired imagetoimage translation using cycleconsistent adversarial networks. CoRR, abs/1703.10593, 2017. URL http://arxiv.org/abs/1703.10593.
Proof of Equation 2
(8) 
(9) 
(10) 
(11) 
(12) 
Proof of Equation 3
We can derive an upper bound of Equation 3,
(13) 
The first inequality is because entropy is always positive and the second inequality is because the highest entropy discrete distribution is the uniform distribution, which has entropy as there are 1000 classes in ImageNet. Taking the exponential of our upper bound on the log IS, we find that the maximum possible IS is 1000. We can also find a lower bound
(14) 
because the conditional entropy is always less than the unconditional entropy . Again, taking the exponential of our lower bound, we find that the minimum possible IS is 1. We can combine our two inequalities to get the final expression,
(15) 
Achieving High Inception Scores
We repeat Equation 13 here for the convenience of the reader
It should be relatively clear now how we can achieve an Inception score of . We require the following:

. We can achieve this by making the uniform distribution.

. We can achieve this by making =1 for one and for all of the others.
Since the Inception Network is differentiable, we have access to the gradient of the output with respect to the input . We can then use this gradient to repeatedly update our image to force .
Let’s make this more concrete. Given a class , we can sample an image from some distribution , then repeatedly update to maximize for some . Our resulting generator cycles from to repeatedly, outputting the image that is the result of the above optimization procedure. This procedure is identical to the Fast Gradient Sign Method (FGSM) for adversarial attacks against neural networks(Goodfellow et al., 2014b). In the original proposal of the Inception Score, the authors noted that directly optimizing it would lead to adversarial examples(Salimans et al., 2016).
In theory, it should achieve a near perfect Inception Score as long as is suitably large enough. The full generative algorithm is summarized in Algorithm 1. We note that the replay attack is equivalent to being the empirical distribution of the training data and or being equal to .
We can realize this algorithm by setting , and to be a uniform distribution over images. The resulting generator achieves produces images shown in the left of Figure 2 and an Inception score of .
We can make the images more realistic by making a pretrained Wasserstein GAN (WGAN) (Arjovsky et al., 2017) trained on CIFAR10. This method produces realisticlooking examples that achieve a nearperfect Inception Score, shown in the right of Figure 2 and an Inception score of .