Recent developments in Generative Adversarial Networks (GAN) have showed rapid improvements in generated image quality and variety. GAN can now generate both ImageNet at reasonable quality and human faces at resolution of 1024 by 1024[1, 2]3, 4, 5, 6, 7, 8, 9, 10]. Despite impressive results, GAN generated images are difficult to analyze. Current benchmarks for GAN performance consist of Inception Score (IS) and the Fretchet Inception Distance (FID). FID and IS are adequate for evaluating overall GAN performance for a large dataset. However, they cannot evaluate the quality of a single image. Furthermore, FID and IS are difficult to interpret; we do not know where unrealistic elements occur.
When staring at a GAN generated image, one cannot help but to try and find its flaws. Despite researchers’ best attempts, state of the art models still generate strange background, weird hats, and other artifacts. Given the shortcoming of FID and IS, we propose a novel supervised approach for detecting errors and ranking GAN generated images. With our approach, we can detect quality and mode collapse errors at the pixel level as shown in Figure 1. To train without manually labeled data, we collage generated and real images together with some artifacts to mimic errors in the generated distribution. This defines a third image distribution . Our method detects if a given pixel is drawn from the real pixel distribution or the generated pixel distribution . Doing so allows us to visualize GAN generated errors. To evaluate our approach, we propose a performance metric for quality and mode collapse using detected errors. We show our metric correlates well with FID and with qualitative tests.
The major contributions of this paper can be summarized as follows:
We explore a novel approach of detecting quality and mode collapse errors from GAN generated images without manually labeled data. Our error detection can provide information for individual images at pixel level. Furthermore, our detected error can be visualized.
We propose a new performance metric for GAN. Unlike FID and IS, our metric can be used on an individual image and is useful for evaluating a small class of images.
We show our model’s performance samples from the state of the art approach, BigGAN and StyleGAN for ImageNet and Flickr respectively. Furthermore, we test our approach on badly generated samples from an Improved Wasserstein trained on dogs and cats.
We provide detailed analysis on the performance of BigGAN with our proposed performance metric.
2 Related Works
2.1 Generative Adversarial Network
A generative adversarial network (GAN) consists of a discriminator and a generator. The goal of the discriminator is to differentiate between real and generated images. In contrast, the generator’s goal is to create realistic images which can fool the discriminator. GAN loss proposed by Goodfellow et al., minimizing the Shannon-Jensen divergence is as follows :
As the generator is trained with guidance from the discriminator, GAN research has focused on improving the discriminator. Early GAN were unstable during training [11, 12, 13]. As an improvement, Arjovsky et al. proposed using earth mover distance instead of Shannon Jensen Divergence . This led to significant improvements for training stability and greater sample variety. The Improved Wasserstein proposed a soft clipping which improved training speeds . Despite stable training, Improved Wasserstein GAN has trouble generating realistic samples for datasets such as ImageNet. The next milestone for GAN is SAGAN which can generate ImageNet at high quality . SAGAN leveraged self attention layer which allowed connections between distant pixels and it improved global image quality. Brock et al. took SAGAN to its limits by greatly increasing training layers and training data size . The resulting work, aptly named BigGAN, showed large performance improvements over SAGAN.
Besides generating ImageNet, another challenge for GAN is generating high resolution images . Progressive GAN lead to a major breakthrough, generating images of size 1024 by 1024. As the name implies, progressive GAN trains low resolution images to high resolution images akin to curriculum learning . Its successor, StyleGAN increases variety by leveraging style transfer and adding noise between layers .
2.2 GAN Evaluation Metrics
Given GAN synthesizes novel data, it remains difficult to measure its performance besides using human judgment. To overcome this problem, Salimans et al. proposed the inception score (IS) as a performance metric 
. As a metric, IS rates image quality and diversity in the classes generated for a large dataset. To do so, IS leverages computed class probability from an inception model and calculates the difference between label probability and the marginal probability. However, IS has weaknesses; it does not perform well for datasets which are not ImageNet. In contrast, the Frechet Inception Distance (FID) measures the distance between the real distribution and fake distribution . As with IS, FID also measures quality and sample diversity. FID computes the features from a pretrained inception model and aggregates statistics for comparison. FID removes the weakness of IS when used for datasets beyond ImageNet. However, the weakness of both approach is that they cannot evaluate the performance of an individual sample. Furthermore, neither FID nor IS provides information on why a model did well or poorly. Finally, as a metric for GAN sample diversity, Arora and Zhang proposed the Birthday paradox test. This approach consists of analyzing generated images and finding the size of a dataset needed before duplicate samples appear. However, this approach is qualitative and requires human judgment .
3 Error Detection for GAN
To discover errors generated by GAN, we must learn the difference between real distribution and generated distribution
akin to a GAN discriminator. In contrast to GAN discriminators which are trained to classify images as real or fake, our model is trained to detect salient features of generated distribution at pixel level which we refer to as errors. The advantage of our approach is that it provides richer information for evaluating and understanding GAN. Furthermore, a pixel level loss can guide the model to learn finer details. Letbe a sampled image. Let equals to 0 when no error exists and let E equals to 1 when there exists errors. Furthermore, let be the probability of error conditioned on image and its pixel location . Thus, an idealized loss would be
Where and are the dimensions of the . However, we do not know where errors are located so we cannot train directly using or . Instead, we create a distribution which consist of collaged images drawn from both and . is meant to mimic and but with knowledge of where errors occurs. Here, errors would refer to pixels in the collaged image from . As such, we use the following approximation:
The two terms in the equation can be explained as predicting errors when there are none, weighted by and predicting no errors when there are weighted by . We split our equation into two terms so that we can weight the loss for predicting a real pixel to be an error. In a realistic generated sample, predicting it as real is acceptable. However, predicting a real sample as generated should not be. Furthermore, we add regularization to help the model generalize results. With as the computed results for a generated image, we can define a new performance metric for GAN, averaged Pixel Distance(PD).
Averaged Pixel Distance is similar to FID in that it computes both the quality and mode collapse of GAN generated samples. The difference is that it can compute a value for an individual image rather than for a dataset. Furthermore as we have a score for each pixel, we can even compute PD for parts of an image.
3.1 Training Data and Error Labels
Given we have no knowledge of where error occurs in GAN, we approximate GAN generated image distribution by combining generated images and real images. Our method is different from denoising or image augmentation approaches from past literature [23, 24, 25]. Our goal is not adding noise to improve model robustness but to learn errors. Instead, we collage patches of a generated image to a real image. To randomly combine images, one approach from Liu et al. is to create irregular masks . However, their approach is binary where a pixel is masked or not. Instead, we leverage Perlin Noise from computer graphics. Perlin noise is a gradient noise, and is used to generate textures and cloud like patterns. As seen in Figure 2(a), this method can create continuous random noises. Continuity is beneficial as it improves the model’s ability to learn a difficult problem.
Besides merging real and generated images, we further add artifacts using real pixels. To do so, we copy a circular area of pixels from a sampled area on real image. We proceed to replicate these pixels with rotations to a random uniformly sampled area on the real image. The purpose is to create artifacts such as multiple heads or eyes. Such artifacts can help the model detect realistic features at abnormal locations. The circular like artifact can be seen in Figure 2(b).
3.2 Network Design
Our network takes an image of shape as input and generates a value for each pixel to an output shape . As the output dimension is the similar to the input, we choose use an auto encoder architecture. Our architecture, is based on the U-Net which has skip links connecting the encoder and decoder layer . Our deep auto-encoder architecture consists of 5 layers of convolution per pooling with skip connections within pooling blocks inspired by ResNet 
. We find deeper network necessary for learning variegated datasets such as ImageNet. In addition, we include pretrained ImageNet model logits to our layers by concatenation. We add logits from after pool2, pool3 and pool4 of a VGG model. Finally, to improve the model’s ability to learn distant pixels relationships, we add self attention layers proposed by SAGAN in both the encoder and decoder .
As part of our evaluation, we select three GAN models with different datasets. As the baseline, we select improved Wasserstein trained on Dogs vs. Cats dataset to showcase our model’s performance for samples with poor quality. We further select StyleGAN and BigGAN as they are cutting edge for GAN models and can demonstrate our approach’s strength and weaknesses. StyleGAN is trained on Flickr-Faces-HQ dataset, consisting of 70 000 1024 by 1024 human faces, while BigGAN is trained on 1000 classes of ImageNet. Furthermore, ImageNet contains class labels which we can use to analyze performance per class. Note that for Improved Wasserstein, we train our own model. In contrast, we use pretrained models provided by Nvidia and DeepMind for StyleGAN and BigGAN respectively. Given that we are limited by computing resources, we resize training data to 128 by 128 for Flickr, 64 by 64 for ImageNet, and 64 by 64 for Dogs vs. Cats. To train our model, it takes on average 48 hours for the best results. Both StyleGAN and BigGAN use a truncation hyper-parameter to control sample quality at the expense of diversity [2, 1]. For truncation, we choose 0.7 for StyleGAN for better sample quality . For BigGAN, we use a truncation ranging from 0.1 to 1 sampled uniformly per image. Finally, we use 16 000 samples to compute FID which is sufficient but might lead to worse FID. Due to these factors, our FID score is worse compared to the best scores obtained by both papers. We emphasize this paper is not meant to improve the FID score of the best models but to show our approach can better evaluate images generated from GAN.
4.1 Evaluation Metrics
To evaluate our approach, we rely on both qualitative and quantitative metrics. First, our model can rank individual images, as seen in Figure 3. Furthermore, we can see the difference between classes as ranked by our approach in Figure 4. Second, we also show the PD score for each pixel and its corresponding images as comparison in Figure 1. For mode collapse evaluation, we can use birthday paradox for generated images sampled from a single class. See images in Appendix.
For quantitative analysis, we show PD correlates well with FID. We sort validation dataset by PD and divide it into splits. Split one contains the worst ranked images and each subsequent split improves in quality. Each individual split’s FID is computed against the same real dataset of size 16 000. For ImageNet, we generate the same number of samples per class and sort within each class. The reasoning is that each class is of different quality and mode collapse; directly sorting would cause uneven dataset and lead to strange FID. Finally, we test our model trained on ImageNet for StyleGAN and Improved Wasserstein. For splits of 4, 8, 32, we use validation dataset of size 64 000, 128 000, 512 000.
4.2 Training details
For training, we use AdamOptimizer with a learning rate of 0.0002. For each dataset, we must tune hyper-parameters for best results. The value of is set to 5, 1, and as 0.03. However for Flickr, a of 2 and
of 0.3 led to better mode collapse detection. For Dogs vs. Cats, we further add image segmentation and increase the weight of segmented areas in the loss. As shown in Table 1, this approach did not improve results significantly for Dogs vs. Cats and thus, we choose not to use segmentation for Flickr and ImageNet. When training ImageNet, we combine generated images with real images of the same class. Otherwise with 1000 classes, too much variance in the collaged image can prevent the model from learning. This is not necessary for Flickr and Dogs vs. Cats.
|Improved Wass, Dogs and Cat||S1||S2||S3||S4||Random Dataset|
|3 Layer per Pool||117.7||101.9||96.4||91.1||100.16|
|5 Layer per Pool||123.8||105.5||95||86.2||101.7|
|5 Layer per Pool + Segmentation||123.1||105.2||94.5||84.1||100.52|
|StyleGAN, Truc=0.7||S1||S2||S3||S4||Random Dataset|
|Mode Collapse Focused||30.50||17.82||12.97||11.04||14.06|
|BigGAN, 1000 Classes||S1||S2||S3||S4||Random Dataset|
|Trucation Uniform [0.5, 1]||20.06||17.72||15.91||14.20||17.18|
|Trucation Uniform [0.1, 1]||27.08||24.89||20.96||16.91||21.9|
4.3 FID Analysis
FID as a metric considers both quality and mode collapse. However, when FID score is poor, quality is weighted heavily compared to mode collapse. In contrast, when the FID score is good, mode collapse is weighted heavily over quality. In many cases, we see datasets of images with "higher quality" rated lower than images with clear artifacts because of mode collapse. When we split a dataset into 4 splits, overall FID score for the whole dataset is not the average FID of each individual split. This is due to mode collapse considerations. If we split ImageNet into two groups of 500 classes each, the average FID score will be much higher than the overall FID of the two groups combined.
4.4 Improved Wasserstein Analysis
For Improved Wasserstein, our model performs well for detecting errors as shown in Table 1. Despite improved image quality, there is significant mode collapse for the best rated samples. However, because of low sample quality, our best samples score best for FID. To test the curve of PD and its correlation with FID, we generate 512 000 images and split them into 32 groups. As we see in Figure 5(a), the FID score of the worst group(0) is lower by large margins. This is in part due to mode collapse and bad sample quality from the worst samples. Furthermore, the consistent decrease in FID demonstrates our model is not assigning PD randomly but with purpose. Qualitatively, errors occur on the object while backgrounds tend to be well generated. Samples are shown in Figure 5(b)
4.5 Style GAN Analysis
The samples from StyleGAN seem to have no errors at first glance. We find instead that StyleGAN still suffers from some quality issues and mode collapse. By controlling PD with , we can better account for image quality. However, the model prefers bright images which results in a worse FID score for our best ranked split as we see in Table 2. We find it impressive for our model to find badly generated samples from a dataset of such high quality samples, see Figure 1, 6. In contrast, by increasing PD, we can detect mode collapse and show impressive FID performance in Table 2 and appendix. Overall, Style GAN’s sample quality is high; there are few samples with severe quality issues.
4.6 BigGAN Analysis
As seen in Table 3 and Figure 7(b), our results correlates well with FID. Qualitatively, we observe the worse samples suffer from heavy mode collapse and quality issues as seen in appendix. Interestingly, as we add more truncation(more mode collapse), the difference in FID narrows due to reduced differences between images with more mode collapse. Up until now, we have only shown how our metric correlates with FID. However, with PD and labels from ImageNet, we can dig deeper into BigGAN’s performance. First with truncation, we show PD correlates with mode collapse as seen in Figure 7(a). Furthermore, we observe that PD score can vary significantly between classes. Samples for different classes can be seen in Figures 3 and appendix. To compare PD score between classes, we offset the generated class’s PD by its real class PD. Qualitatively, we observe BigGAN is heavily mode collapsed within each class compared to its real data counterpart which we can observe in the appendix. From Figure 7(a), we see around 200 of BigGAN’s classes suffer from high PD. Indeed, a dozen of classes are completely mode collapsed. Observing the best scoring classes, we discover three patterns: comparatively reduced mode collapse within the class, high image quality, and limited diversity from the real class. When we visualize PD for different classes, the location on the image which suffers from high PD vary. For instance, many classes receive high PD in background areas while others such as dogs classes often receive high PD on the face. For certain dogs, BigGAN seems to learn the same face while for snakes, the background is often the same. Finally, we experiment with our pretrained model of ImageNet on StyleGAN and Improved Wasserstein results. Results in Table 4 show that our model can evaluate other GAN models. However the results are not as impressive compared to scores from directly trained models. The model has trouble detecting mode collapse given it was not trained on the data.
5 Conclusion and Future Work
In this paper, we show a new approach to evaluate GAN performance for mode collapse and image quality. By combining real and generated images at the pixel level, our approach enables detailed analysis of GAN and can be easily visualized. Furthermore, our approach correlates well with FID and can be used for ranking generated samples. However, our model must directly learn from the generated and real data which requires some training and tuning. The next step would improving its ability for using trained models for other datasets. Furthermore, we are looking into using our model directly as a GAN discriminator. Although the stability of the model is unknown, such discriminator could provide more information and better loss to the generator.
-  Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
-  Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018.
-  Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In , pages 1125–1134, 2017.
-  Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2223–2232, 2017.
Raymond A Yeh, Chen Chen, Teck Yian Lim, Alexander G Schwing, Mark
Hasegawa-Johnson, and Minh N Do.
Semantic image inpainting with deep generative models.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5485–5493, 2017.
-  Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li. High-resolution image inpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6721–6729, 2017.
-  Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. In The European Conference on Computer Vision (ECCV), September 2018.
-  Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
-  Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised map inference for image super-resolution. arXiv preprint arXiv:1610.04490, 2016.
-  Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2017.
-  Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
-  Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
-  Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
-  Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 5907–5915, 2017.
-  Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
-  Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
-  Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.
-  Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
-  Sanjeev Arora and Yi Zhang. Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224, 2017.
-  Viren Jain and Sebastian Seung. Natural image denoising with convolutional networks. In Advances in neural information processing systems, pages 769–776, 2009.
Junyuan Xie, Linli Xu, and Enhong Chen.
Image denoising and inpainting with deep neural networks.In Advances in neural information processing systems, pages 341–349, 2012.
-  David Warde-Farley and Yoshua Bengio. Improving generative adversarial networks with denoising feature matching. 2016.
-  Ken Perlin. An image synthesizer. 1985.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.