GAN
GAN with Spectral Normalization and Gradient Penalty
view repo
Generative Adversarial Networks (GANs) are a class of deep generative models which aim to learn a target distribution in an unsupervised fashion. While they were successfully applied to many problems, training a GAN is a notoriously challenging task and requires a significant amount of hyperparameter tuning, neural architecture engineering, and a nontrivial amount of "tricks". The success in many practical applications coupled with the lack of a measure to quantify the failure modes of GANs resulted in a plethora of proposed losses, regularization and normalization schemes, and neural architectures. In this work we take a sober view of the current state of GANs from a practical perspective. We reproduce the current state of the art and go beyond fairly exploring the GAN landscape. We discuss common pitfalls and reproducibility issues, opensource our code on Github, and provide pretrained models on TensorFlow Hub.
READ FULL TEXT VIEW PDFGAN with Spectral Normalization and Gradient Penalty
None
None
None
Deep generative models are a powerful class of (mostly) unsupervised machine learning models. These models were recently applied to great effect in a variety of applications, including image generation, learned compression, and domain adaptation
(Brock et al., 2019; Menick & Kalchbrenner, 2019; Karras et al., 2019; Lucic et al., 2019; Isola et al., 2017; Tschannen et al., 2018).Generative adversarial networks (GANs) (Goodfellow et al., 2014) are one of the main approaches to learning such models in a fully unsupervised fashion. The GAN framework can be viewed as a twoplayer game where the first player, the generator, is learning to transform some simple input distribution to a complex highdimensional distribution (e.g. over natural images), such that the second player, the discriminator, cannot tell whether the samples were drawn from the true distribution or were synthesized by the generator. The solution to the classic minimax formulation (Goodfellow et al., 2014)
is the Nash equilibrium where neither player can improve unilaterally. As the generator and discriminator are usually parameterized as deep neural networks, this minimax problem is notoriously hard to solve.
In practice, the training is performed using stochastic gradientbased optimization methods. Apart from inheriting the optimization challenges associated with training deep neural networks, GAN training is also sensitive to the choice of the loss function optimized by each player, neural network architectures, and the specifics of regularization and normalization schemes applied. This has resulted in a flurry of research focused on addressing these challenges
(Goodfellow et al., 2014; Salimans et al., 2016; Miyato et al., 2018; Gulrajani et al., 2017; Arjovsky et al., 2017; Mao et al., 2017).Our Contributions In this work we provide a thorough empirical analysis of these competing approaches, and help the researchers and practitioners navigate this space. We first define the GAN landscape – the set of loss functions, normalization and regularization schemes, and the most commonly used architectures. We explore this search space on several modern largescale datasets by means of hyperparameter optimization, considering both “good” sets of hyperparameters reported in the literature, as well as those obtained by sequential Bayesian optimization.
We first decompose the effect of various normalization and regularization schemes. We show that both gradient penalty (Gulrajani et al., 2017) as well as spectral normalization (Miyato et al., 2018) are useful in the context of highcapacity architectures. Then, by analyzing the impact of the loss function, we conclude that the nonsaturating loss (Goodfellow et al., 2014) is sufficiently stable across datasets and hyperparameters. Finally, show that similar conclusions hold for both popular types of neural architectures used in stateoftheart models. We then discuss some common pitfalls, reproducibility issues, and practical considerations. We provide reference implementations, including training and evaluation code on Github^{1}^{1}1www.github.com/google/compare_gan, and provide pretrained models on TensorFlow Hub^{2}^{2}2www.tensorflow.org/hub.
The main design choices in GANs are the loss function, regularization and/or normalization approaches, and the neural architectures. At this point GANs are extremely sensitive to these design choices. This fact coupled with optimization issues and hyperparameter sensitivity makes GANs hard to apply to new datasets. Here we detail the main design choices which are investigated in this work.
Let denote the target (true) distribution and the model distribution. Goodfellow et al. (2014)
suggest two loss functions: the minimax GAN and the nonsaturating (NS) GAN. In the former the discriminator minimizes the negative loglikelihood for the binary classification task. In the latter the generator maximizes the probability of generated samples being real. In this work we consider the nonsaturating loss as it is known to outperform the minimax variant empirically. The corresponding discriminator and generator loss functions are
where denotes the probability of being sampled from . In Wasserstein GAN (WGAN) (Arjovsky et al., 2017) the authors propose to consider the Wasserstein distance instead of the JensenShannon (JS) divergence. The corresponding loss functions are
where the discriminator output and is required to be 1Lipschitz. Under the optimal discriminator, minimizing the proposed loss function with respect to the generator minimizes the Wasserstein distance between and . A key challenge is ensure the Lipschitzness of . Finally, we consider the leastsquares loss (LS) which corresponds to minimizing the Pearson divergence between and (Mao et al., 2017). The corresponding loss functions are
where is the output of the discriminator. Intuitively, this loss smooth loss function saturates slower than the crossentropy loss.
Gradient Norm Penalty The idea is to regularize by constraining the norms of its gradients (e.g. ). In the context of Wasserstein GANs and optimal transport this regularizer arises naturally and the gradient norm is evaluated on the points from the optimal coupling between samples from and (GP) (Gulrajani et al., 2017)
. Computing this coupling during GAN training is computationally intensive, and a linear interpolation between these samples is used instead. The gradient norm can also be penalized close to the data manifold which encourages the discriminator to be piecewise linear in that region (Dragan)
(Kodali et al., 2017). A drawback of gradient penalty (GP) regularization scheme is that it can depend on the model distribution which changes during training. For Dragan it is unclear to which extent the Gaussian assumption for the manifold holds. In both cases, computing the gradient norms implies a nontrivial running time overhead.Notwithstanding these natural interpretations for specific losses, one may also consider the gradient norm penalty as a classic regularizer for the complexity of the discriminator (Fedus et al., 2018). To this end we also investigate the impact of a regularization on
which is ubiquitous in supervised learning.
Discriminator Normalization Normalizing the discriminator can be useful from both the optimization perspective (more efficient gradient flow, more stable optimization), as well as from the representation perspective – the representation richness of the layers in a neural network depends on the spectral structure of the corresponding weight matrices (Miyato et al., 2018).
From the optimization point of view, several normalization techniques commonly applied to deep neural network training have been applied to GANs, namely batch normalization (BN)
(Ioffe & Szegedy, 2015) and layer normalization (LN) (Ba et al., 2016). The former was explored in Denton et al. (2015) and further popularized by Radford et al. (2016), while the latter was investigated in Gulrajani et al. (2017). These techniques are used to normalize the activations, either across the batch (BN), or across features (LN), both of which were observed to improve the empirical performance.From the representation point of view, one may consider the neural network as a composition of (possibly nonlinear) mappings and analyze their spectral properties. In particular, for the discriminator to be a bounded operator it suffices to control the operator norm of each mapping. This approach is followed in Miyato et al. (2018) where the authors suggest dividing each weight matrix, including the matrices representing convolutional kernels, by their spectral norm. It is argued that spectral normalization results in discriminators of higher rank with respect to the competing approaches.
We explore two classes of architectures in this study: deep convolutional generative adversarial networks (DCGAN) (Radford et al., 2016) and residual networks (ResNet) (He et al., 2016), both of which are ubiquitous in GAN research. Recently, Miyato et al. (2018) defined a variation of DCGAN, so called SNDCGAN. Apart from minor updates (cf. Section 4) the main difference to DCGAN is the use of an eightlayer discriminator network. The details of both networks are summarized in Table 4. The other architecture, ResNet19, is an architecture with five ResNet blocks in the generator and six ResNet blocks in the discriminator, that can operate on images. We follow the ResNet setup from Miyato et al. (2018), with the small difference that we simplified the design of the discriminator.
We focus on several recently proposed metrics well suited to the image domain. For an indepth overview of quantitative metrics we refer the reader to Borji (2019).
Inception Score (IS) Proposed by Salimans et al. (2016), the IS offers a way to quantitatively evaluate the quality of generated samples. Intuitively, the conditional label distribution of samples containing meaningful objects should have low entropy, and the variability of the samples should be high. which can be expressed as . The authors found that this score is wellcorrelated with scores from human annotators. Drawbacks include insensitivity to the prior distribution over labels and not being a proper distance.
Fréchet Inception Distance (FID) In this approach proposed by Heusel et al. (2017) samples from and
are first embedded into a feature space (a specific layer of InceptionNet). Then, assuming that the embedded data follows a multivariate Gaussian distribution, the mean and covariance are estimated. Finally, the Fréchet distance between these two Gaussians is computed, i.e.
where , and are the mean and covariance of the embedded samples from and , respectively. The authors argue that FID is consistent with human judgment and more robust to noise than IS. Furthermore, the score is sensitive to the visual quality of generated samples – introducing noise or artifacts in the generated samples will reduce the FID. In contrast to IS, FID can detect intraclass mode dropping – a model that generates only one image per class will have a good IS, but a bad FID (Lucic et al., 2018).
Kernel Inception Distance (KID) Bińkowski et al. (2018)
argue that FID has no unbiased estimator and suggest KID as an unbiased alternative. In Appendix
B we empirically compare KID to FID and observe that both metrics are very strongly correlated (Spearman rankorder correlation coefficient of for lsunbedroom and for celebahq128 datasets). As a result we focus on FID as it is likely to result in the same ranking.We consider three datasets, namely cifar10, celebahq128, and lsunbedroom. The lsunbedroom dataset contains slightly more than 3 million images (Yu et al., 2015).^{3}^{3}3The images are preprocessed to using TensorFlow resize_image_with_crop_or_pad. We randomly partition the images into a train and test set whereby we use 30588 images as the test set. Secondly, we use the celebahq dataset of K images (Karras et al., 2018). We use the version obtained by running the code provided by the authors.^{4}^{4}4 github.com/tkarras/progressive_growing_of_gans We use K examples as the test set and the remaining examples as the training set. Finally, we also include the cifar10 dataset which contains K images (), partitioned into K training instances and K testing instances. The baseline FID scores are 12.6 for celebahq128, 3.8 for lsunbedroom, and 5.19 for cifar10. Details on FID computation are presented in Section 4.
The search space for GANs is prohibitively large: exploring all combinations of all losses, normalization and regularization schemes, and architectures is outside of the practical realm. Instead, in this study we analyze several slices of this search space for each dataset. In particular, to ensure that we can reproduce existing results, we perform a study over the subset of this search space on cifar10. We then proceed to analyze the performance of these models across celebahq128 and lsunbedroom. In Section 3.1 we fix everything but the regularization and normalization scheme. In Section 3.2 we fix everything but the loss. Finally, in Section 3.3 we fix everything but the architecture. This allows us to decouple some of these design choices and provide some insight on what matters most in practice.
Parameter  Discrete Value 

Learning rate  
Reg. strength  
,  
, 
Parameter  Range  Log 

Learning rate  Yes  
for  Yes  
for non  Yes  
No 
As noted by Lucic et al. (2018), one major issue preventing further progress is the hyperparameter tuning – currently, the community has converged to a small set of parameter values which work on some datasets, and may completely fail on others. In this study we combine the best hyperparameter settings found in the literature (Miyato et al., 2018), and perform sequential Bayesian optimization (Srinivas et al., 2010) to possibly uncover better hyperparameter settings. In a nutshell, in sequential Bayesian optimization one starts by evaluating a set of hyperparameter settings (possibly chosen randomly). Then, based on the obtained scores for these hyperparameters the next set of hyperparameter combinations is chosen such to balance the exploration (finding new hyperparameter settings which might perform well) and exploitation (selecting settings close to the bestperforming settings). We then consider the top performing models and discuss the impact of the computational budget.
We summarize the fixed hyperparameter settings in Table 1 which contains the “good” parameters reported in recent publications (Fedus et al., 2018; Miyato et al., 2018; Gulrajani et al., 2017). In particular, we consider the Cartesian product of these parameters to obtain 24 hyperparameter settings to reduce the survivorship bias. Finally, to provide a fair comparison, we perform sequential Bayesian optimization (Srinivas et al., 2010) on the parameter ranges provided in Table 2. We run rounds (i.e. we communicate with the oracle 12 times) of sequential optimization, each with a batch of hyperparameter sets selected based on the FID scores from the results of the previous iterations. As we explore the number of discriminator updates per generator update (1 or 5), this leads to an additional hyperparameter settings which in some cases outperform the previously known hyperparameter settings. The batch size is set to 64 for all the experiments. We use a fixed the number of discriminator update steps of 100K for lsunbedroom dataset and celebahq128 dataset, and 200K for cifar10 dataset. We apply the Adam optimizer (Kingma & Ba, 2015).
Given that there are 4 major components (loss, architecture, regularization, normalization) to analyze for each dataset, it is infeasible to explore the whole landscape. Hence, we opt for a more pragmatic solution – we keep some dimensions fixed, and vary the others. We highlight two aspects:
[itemsep=0mm,topsep=0mm,parsep=1mm]
We train the models using various hyperparameter settings, both predefined and ones obtained by sequential Bayesian optimization. Then we compute the FID distribution of the top
of the trained models. The lower the median FID, the better the model. The lower the variance, the more stable the model is from the optimization point of view.
The tradeoff between the computational budget (for training) and model quality in terms of FID. Intuitively, given a limited computational budget (being able to train only different models), which model should one choose? Clearly, models which achieve better performance using the same computational budget should be preferred in practice. To compute the minimum attainable FID for a fixed budget we simulate a practitioner attempting to find a good hyperparameter setting for their model: we spend a part of the budget on the “good” hyperparameter settings reported in recent publications, followed by exploring new settings (i.e. using Bayesian optimization). As this is a random process, we repeat it 1000 times and report the average of the minimum attainable FID.
Due to the fact that the training is sensitive to the initial weights, we train the models 5 times, each time with a different random initialization, and report the median FID. The variance in FID for models obtained by sequential Bayesian optimization is handled implicitly by the applied explorationexploitation strategy.
The goal of this study is to compare the relative performance of various regularization and normalization methods presented in the literature, namely: batch normalization (BN) (Ioffe & Szegedy, 2015), layer normalization (LN) (Ba et al., 2016), spectral normalization (SN), gradient penalty (GP) (Gulrajani et al., 2017), Dragan penalty (DR) (Kodali et al., 2017), or regularization. We fix the loss to nonsaturating loss (Goodfellow et al., 2014) and the ResNet19 with generator and discriminator architectures described in Table 5(a). We analyze the impact of the loss function in Section 3.2 and of the architecture in Section 3.3. We consider both CelebAHQ128 and LSUNbedroom with the hyperparameter settings shown in Tables 1 and 2.
The results are presented in Figure 1. We observe that adding batch norm to the discriminator hurts the performance. Secondly, gradient penalty can help, but it doesn’t stabilize the training. In fact, it is nontrivial to strike a balance of the loss and regularization strength. Spectral normalization helps improve the model quality and is more computationally efficient than gradient penalty. This is consistent with recent results in Zhang et al. (2019). Similarly to the loss study, models using GP penalty may benefit from 5:1 ratio of discriminator to generator updates. Furthermore, in a separate ablation study we observed that running the optimization procedure for an additional 100K steps is likely to increase the performance of the models with GP penalty.
Here we investigate whether the above findings also hold when the loss functions are varied. In addition to the nonsaturating loss (NS), we also consider the the leastsquares loss (LS) (Mao et al., 2017), or the Wasserstein loss (WGAN) (Arjovsky et al., 2017). We use the ResNet19 with generator and discriminator architectures detailed in Table 5(a). We consider the most prominent normalization and regularization approaches: gradient penalty (Gulrajani et al., 2017), and spectral normalization (Miyato et al., 2018). Other parameters are detailed in Table 1. We also performed a study on the recently popularized hinge loss (Lim & Ye, 2017; Miyato et al., 2018; Brock et al., 2019) and present it in the Appendix.
The results are presented in Figure 2. Spectral normalization improves the model quality on both datasets. Similarly, the gradient penalty can help, but finding a good regularization tradeoff is nontrivial and requires a large computational budget. Models using the GP penalty benefit from 5:1 ratio of discriminator to generator updates (Gulrajani et al., 2017).
An interesting practical question is whether our findings also hold for different neural architectures. To this end, we also perform a study on SNDCGAN from Miyato et al. (2018). We consider the nonsaturating GAN loss, gradient penalty and spectral normalization. While for smaller architectures regularization is not essential (Lucic et al., 2018), the regularization and normalization effects might become more relevant due to deeper architectures and optimization considerations.
The results are presented in Figure 3. We observe that both architectures achieve comparable results and benefit from regularization and normalization. Spectral normalization strongly outperforms the baseline for both architectures.
Simultaneous Regularization and Normalization A common observation is that the Lipschitz constant of the discriminator is critical for the performance, one may expect simultaneous regularization and normalization could improve model quality. To quantify this effect, we fix the loss to nonsaturating loss (Goodfellow et al., 2014), use the Resnet19 architecture (as above), and combine several normalization and regularization schemes, with hyperparameter settings shown in Table 1 coupled with 24 randomly selected parameters. The results are presented in Figure 4. We observe that one may benefit from additional regularization and normalization. However, a lot of computational effort has to be invested for somewhat marginal gains in FID. Nevertheless, given enough computational budget we advocate simultaneous regularization and normalization – spectral normalization and layer normalization seem to perform well in practice.
In this section we focus on several pitfalls we encountered while trying to reproduce existing results and provide a fair and accurate comparison.
Metrics There already seems to be a divergence in how the FID score is computed: (1) Some authors report the score on training data, yielding a FID between K training and K generated samples (Unterthiner et al., 2018). Some opt to report the FID based on K test samples and K generated samples and use a custom implementation (Miyato et al., 2018). Finally, Lucic et al. (2018) report the score with respect to the test data, in particular FID between K test samples, and K generated samples. The subtle differences will result in a mismatch between the reported FIDs, in some cases of more than . We argue that FID should be computed with respect to the test dataset. Furthermore, whenever possible, one should use the same number of instances as previously reported results. Towards this end we use K test samples and K generated samples on cifar10 and lsunbedroom, and K vs K on celebahq128 as in in Lucic et al. (2018).
Details of Neural Architectures
Even in popular architectures, like ResNet, there is still a number of design decisions one needs to make, that are often omitted from the reported results. Those include the exact design of the ResNet block (order of layers, when is ReLu applied, when to upsample and downsample, how many filters to use). Some of these differences might lead to potentially unfair comparison. As a result, we suggest to use the architectures presented within this work as a solid baseline. An ablation study on various ResNet modifications is available in the Appendix.
Datasets A common issue is related to dataset processing – does lsunbedroom always correspond to the same dataset? In most cases the precise algorithm for upscaling or cropping is not clear which introduces inconsistencies between results on the “same” dataset.
Implementation Details and NonDeterminism One major issue is the mismatch between the algorithm presented in a paper and the code provided online. We are aware that there is an embarrassingly large gap between a good implementation and a bad implementation of a given model. Hence, when no code is available, one is forced to guess which modifications were done. Another particularly tricky issue is removing randomness from the training process. After one fixes the data ordering and the initial weights, obtaining the same score by training the same model twice is nontrivial due to randomness present in certain GPU operations (Chetlur et al., 2014). Disabling the optimizations causing the nondeterminism often results in an order of magnitude running time penalty.
While each of these issues taken in isolation seems minor, they compound to create a mist which introduces friction in practical applications and the research process (Sculley et al., 2018).
A recent largescale study on GANs and Variational Autoencoders was presented in
Lucic et al. (2018). The authors consider several loss functions and regularizers, and study the effect of the loss function on the FID score, with lowtomedium complexity datasets (MNIST, cifar10, CelebA), and a single neural network architecture. In this limited setting, the authors found that there is no statistically significant difference between recently introduced models and the original nonsaturating GAN. A study of the effects of gradientnorm regularization in GANs was recently presented in Fedus et al. (2018). The authors posit that the gradient penalty can also be applied to the nonsaturating GAN, and that, to a limited extent, it reduces the sensitivity to hyperparameter selection. In a recent work on spectral normalization, the authors perform a small study of the competing regularization and normalization approaches (Miyato et al., 2018). We are happy to report that we could reproduce these results and we present them in the Appendix.Inspired by these works and building on the available opensource code from Lucic et al. (2018), we take one additional step in all dimensions considered therein: more complex neural architectures, more complex datasets, and more involved regularization and normalization schemes.
In this work we study the impact of regularization and normalization schemes on GAN training. We consider the stateoftheart approaches and vary the loss functions and neural architectures. We study the impact of these design choices on the quality of generated samples which we assess by recently introduced quantitative metrics.
Our fair and thorough empirical evaluation suggests that when the computational budget is limited one should consider nonsaturating GAN loss and spectral normalization as default choices when applying GANs to a new dataset. Given additional computational budget, we suggest adding the gradient penalty from Gulrajani et al. (2017) and training the model until convergence. Furthermore, we observe that both classes of popular neural architectures can perform well across the considered datasets. A separate ablation study uncovered that most of the variations applied in the ResNet style architectures lead to marginal improvements in the sample quality.
As a result of this largescale study we identify the common pitfalls standing in the way of accurate and fair comparison and propose concrete actions to demystify the future results – issues with metrics, dataset preprocessing, nondeterminism, and missing implementation details are particularly striking. We hope that this work, together with the opensourced reference implementations and trained models, will serve as a solid baseline for future GAN research.
Future work should carefully evaluate models which necessitate largescale training such as BigGAN (Brock et al., 2019), models with custom architectures (Chen et al., 2019; Karras et al., 2019; Zhang et al., 2019), recently proposed regularization techniques (Roth et al., 2017; Mescheder et al., 2018), and other proposals for stabilizing the training (Chen et al., 2018). In addition, given the popularity of conditional GANs, one should explore whether these insights transfer to the conditional settings. Finally, given the drawbacks of FID and IS, additional quantitative evaluation using recently proposed metrics could bring novel insights (Sajjadi et al., 2018; Kynkäänniemi et al., 2019).
We are grateful to Michael Tschannen for detailed comments on this manuscript.
Computer Vision and Pattern Recognition
, 2018.Assessing generative models via precision and recall.
In Advances in Neural Information Processing Systems, 2018.We present an empirical study with SNDCGAN and ResNet CIFAR architectures on cifar10 in figure 5 and figure 6. In addition to the nonsaturating loss (NS) and the Wasserstein loss (WGAN) presented in Section 3.2, we evaluate hinge loss (HG) on cifar10. We observe that its performance is similar to the nonsaturating loss.
The KID metric introduced by Bińkowski et al. (2018) is an alternative to FID. We use models from our Regularization and Normalization study (see Section 3.1) to compare both metrics. Here, by model we denote everything that needs to be specified for the training – including all hyperparameters, like learning rate, , Adam’s , etc. The Spearman rankorder correlation coefficient between KID and FID scores is approximately for lsunbedroom and for celebahq128 datasets.
To evaluate a practical setting of selecting several best models, we compare the intersection between the set of “best models by FID” and the set of “best models by KID” for . The results are summarized in Table 3.
This experiment suggests that FID and KID metrics are very strongly correlated, and for the practical applications one can choose either of them. Also, the conclusions from our studies based on FID should transfer to studies based on KID.
lsunbedroom  celebahq128  

K = 5  
K = 10  
K = 20  
K = 50  
K = 100 
We used the same architecture as Miyato et al. (2018), with the parameters copied from the GitHub page^{5}^{5}5github.com/pfnetresearch/chainerganlib. In Table 4(a) and Table 4(b), we describe the operations in layer column with order. Kernel size is described in format , input shape is and output shape is . The slopes of all lReLU functions are set to 0.1. The input shape is for celebahq128 and lsunbedroom, for cifar10.


The ResNet19 architecture is described in Table 5. The RS column stands for the resample of the residual block, with downscale(D)/upscale(U)/none() setting. MP stands for mean pooling and BN for batch normalization. ResBlock is defined in Table 6. The addition layer merges two paths by adding them. The first path is a shortcut layer with exactly one convolution operation, while the second path consists of two convolution operations. The downscale layer and upscale layer are marked in Table 6. We used average pool with kernel for downscale, after the convolution operation. We used unpool from github.com/tensorflow/tensorflow/issues/2169 for upscale, before the convolution operation. and are the input shape to the ResNet block, output shape depends on the RS parameter. and are the input channels and output channels for a ResNet block. Table 7 described the ResNet CIFAR architecture we used in Figure 5 for reproducing the existing results. Note that RS is set to none for third ResBlock and fourth ResBlock in discriminator. In this case, we used the same ResNet block defined in Table 6 without resampling.






We have noticed six minor differences in the Resnet architecture compared to the implementation from github.com/pfnetresearch/chainerganlib/blob/master/common/net.py (Miyato et al., 2018). We performed an ablation study to verify the impact of these differences. Figure 7 shows the impact of the ablation study, with details described in the following.
DEFAULT: ResNet CIFAR architecture with spectral normalization and nonsaturating GAN loss.
SKIP: Use input as output for the shortcut connection in the discriminator ResBlock. By default it was a convolutional layer with 3x3 kernel.
CIN: Use for the discriminator ResBlock hidden layer output channels. By default it was in our setup, while Miyato et al. (2018) used for first ResBlock and for the rest.
OPT: Use an optimized setup for the first discriminator ResBlock, which includes: (1) no ReLU, (2) a convolutional layer for the shortcut connections, (3) use instead of in ResBlock.
CIN OPT: Use CIN and OPT together. It means the first ResBlock is optimized while the remaining ResBlocks use for the hidden output channels.
SUM: Use reduce sum to pool the discriminator output. By default it was reduce mean.
TAN: Use tanh for the generator output, as well as range [1, 1] for the discriminator input. By default it was sigmoid and discriminator input range .
EPS: Use a bigger epsilon for generator batch normalization. By default it was in TensorFlow.
ALL: Apply all the above differences together.
In the ablation study, the CIN experiment obtained the worst FID score. Combining with OPT, the CIN results were improved to the same level as the others which is reasonable because the first block has three input channels, which becomes a bottleneck for the optimization. Hence, using OPT and CIN together performs as well as the others. Overall, the impact of these differences are minor according to the study on cifar10.
To make the future GAN training simpler, we propose a set of best parameters for three setups: (1) Best parameters without any regularizer. (2) Best parameters with only one regularizer. (3) Best parameters with at most two regularizers. Table 8, Table 9 and Table 10 summarize the top 2 parameters for SNDCGAN architecture, ResNet19 architecture and ResNet CIFAR architecture, respectively. Models are ranked according to the median FID score of five different random seeds with fixed hyperparameters in Table 1. Note that ranking models according to the best FID score of different seeds will achieve better but unstable result. Sequential Bayesian optimization hyperparameters are not included in this table. For ResNet19 architecture with at most two regularizers, we have run it only once due to computational overhead. To show the model stability, we listed the best FID score out of five seeds from the same parameters in column best. Spectral normalization is clearly outperforms the other normalizers on SNDCGAN and ResNet CIFAR architectures, while on ResNet19 both layer normalization and spectral normalization work well.
To visualize the FID score on each dataset, Figure 8, Figure 9 and Figure 10 show the generated examples by GANs. We select the examples from the best FID run, and then increase the FID score for two more plots.
Dataset  Median  Best  LR  Norm  

cifar10  29.75  28.66  0.100  0.500  0.999  1     
cifar10  36.12  33.23  0.200  0.500  0.999  1     
celebahq128  66.42  63.13  0.100  0.500  0.999  1     
celebahq128  67.39  64.59  0.200  0.500  0.999  1     
lsunbedroom  180.36  160.12  0.200  0.500  0.999  1     
lsunbedroom  188.99  162.00  0.100  0.500  0.999  1     
cifar10  26.66  25.27  0.200  0.500  0.999  1    SN 
cifar10  27.32  26.97  0.100  0.500  0.999  1    SN 
celebahq128  31.14  29.05  0.200  0.500  0.999  1    SN 
celebahq128  33.52  31.92  0.100  0.500  0.999  1    SN 
lsunbedroom  63.46  58.13  0.200  0.500  0.999  1    SN 
lsunbedroom  74.66  59.94  1.000  0.500  0.999  1    SN 
cifar10  26.23  26.01  0.200  0.500  0.999  1  1  SN+GP 
cifar10  26.66  25.27  0.200  0.500  0.999  1    SN 
celebahq128  31.13  30.80  0.100  0.500  0.999  1  10  GP 
celebahq128  31.14  29.05  0.200  0.500  0.999  1    SN 
lsunbedroom  63.46  58.13  0.200  0.500  0.999  1    SN 
lsunbedroom  66.58  65.75  0.200  0.500  0.999  1  10  GP 
Dataset  Median  Best  LR  Norm  

celebahq128  43.73  39.10  0.100  0.500  0.999  5     
celebahq128  43.77  39.60  0.100  0.500  0.999  1     
lsunbedroom  160.97  119.58  0.100  0.500  0.900  5     
lsunbedroom  161.70  125.55  0.100  0.500  0.900  5     
celebahq128  32.46  28.52  0.100  0.500  0.999  1    LN 
celebahq128  40.58  36.37  0.200  0.500  0.900  1    LN 
lsunbedroom  70.30  48.88  1.000  0.500  0.999  1    SN 
lsunbedroom  73.84  60.54  0.100  0.500  0.900  5    SN 
celebahq128  29.13    0.100  0.500  0.900  5  1  LN+DR 
celebahq128  29.65    0.200  0.500  0.900  5  1  GP 
lsunbedroom  55.72    0.200  0.500  0.900  5  1  LN+GP 
lsunbedroom  57.81    0.100  0.500  0.999  1  10  SN+GP 
Dataset  Median  Best  LR  Norm  

cifar10  31.40  28.12  0.200  0.500  0.999  5     
cifar10  33.79  30.08  0.100  0.500  0.999  5     
cifar10  23.57  22.91  0.200  0.500  0.999  5    SN 
cifar10  25.50  24.21  0.100  0.500  0.999  5    SN 
cifar10  22.98  22.73  0.200  0.500  0.999  1  1  SN+GP 
cifar10  23.57  22.91  0.200  0.500  0.999  5    SN 
For each architecture and hyperparameter we estimate its impact on the final FID. Figure 11 presents heatmaps for hyperparameters, namely the learning rate, , , , and for each combination of neural architecture and dataset.