Improved Consistency Regularization for GANs

02/11/2020 ∙ by Zhengli Zhao, et al. ∙ 38

Recent work has increased the performance of Generative Adversarial Networks (GANs) by enforcing a consistency cost on the discriminator. We improve on this technique in several ways. We first show that consistency regularization can introduce artifacts into the GAN samples and explain how to fix this issue. We then propose several modifications to the consistency regularization procedure designed to improve its performance. We carry out extensive experiments quantifying the benefit of our improvements. For unconditional image synthesis on CIFAR-10 and CelebA, our modifications yield the best known FID scores on various GAN architectures. For conditional image synthesis on CIFAR-10, we improve the state-of-the-art FID score from 11.48 to 9.21. Finally, on ImageNet-2012, we apply our technique to the original BigGAN model and improve the FID from 6.66 to 5.38, which is the best score at that model size.



There are no comments yet.


page 7

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GANs; Goodfellow et al., 2014) are a powerful class of deep generative models, but are known for training difficulties (Salimans et al., 2016). Many approaches have been introduced to improve GAN performance (Arjovsky et al., 2017; Gulrajani et al., 2017; Miyato et al., 2018a; Brock et al., 2019). Recent work (Wei et al., 2018; Zhang et al., 2020)

suggests that the performance of generative models can be improved by introducing consistency regularization techniques – which are popular in the semi-supervised learning literature

(Oliver et al., 2018). In particular, Zhang et al. (2020) show that Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) augmented with consistency regularization can achieve state-of-the-art image-synthesis results. In CR-GAN, real images and their corresponding augmented counterparts are fed into the discriminator. The discriminator is then encouraged – via an auxiliary loss term – to produce similar outputs for an image and its corresponding augmentation.

Though the consistency regularization in CR-GAN is effective, the augmentations are only applied to the real images and not to generated samples, making the whole procedure somewhat imbalanced. In particular, the generator can learn these artificial augmentation features and introduce them into generated samples as undesirable artifacts.111We show examples in Fig. 15 and discuss further in Section 5.2. Further, by regularizing only the discriminator, and by only using augmentations in image space, the regularizations in Wei et al. (2018) and Zhang et al. (2020) do not act directly on the generator. By constraining the mapping from the prior to the generated samples, we can achieve further performance gains on top of those yielded by performing consistency regularization on the discriminator in the first place.

In this work, we introduce Improved Consistency Regularization

 (ICR) which applies forms of consistency regularization to the generated images, the latent vector space, and the generator. First, we address the lack of regularization on the generated samples by introducing

balanced consistency regularization (bCR), where a consistency term on the discriminator is applied to both real images and samples coming from the generator. Second, we introduce latent consistency regularization (zCR), which incorporates regularization terms modulating the sensitivity of both the generator and discriminator changes in the prior. In particular, given augmented/perturbed latent vectors, we show it is helpful to encourage the generator to be sensitive to the perturbations and the discriminator to be insensitive.

When both bCR and zCR are combined into ICR, it yields state-of-the-art image synthesis results. For unconditional image synthesis on CIFAR-10 and CelebA, our method yields the best known FID scores on various GAN architectures. For conditional image synthesis on CIFAR-10, we improve the state-of-the-art FID score from 11.48 to 9.21. Finally, on ImageNet-2012, we apply our technique to the original BigGAN (Brock et al., 2019) model and improve the FID from 6.66 to 5.38, which is the best score at that model size.

Figure 1: Illustrations comparing our methods to the baseline. (1) CR-GAN (Zhang et al., 2020) is the baseline, with consistency regularization applied only between real images and their augmentations. (2) In Balanced Consistency Regularization (bCR-GAN), we also introduce consistency regularization between generated fake images and their augmentations. With consistency regularization on both real and fake images, the discriminator is trained in a balanced way and less augmentation artifacts are generated. (3) Furthermore, we propose Latent Consistency Regularization (zCR-GAN), where latent is augmented with noise of small magnitude. Then for the discriminator, we regularize the consistency between corresponding pairs; while for the generator we encourage the corresponding generated images to be more diverse. Note that {} indicates a loss term encouraging pairs to be closer together, while {} indicates a loss term pushing pairs apart.

2 Background

2.1 Generative Adversarial Networks

A Generative Adversarial Network (GAN) (Goodfellow et al., 2014) is composed of a Generator model, , and a Discriminator model,

, which are parameterized by deep neural networks. The generator is trained to take a latent vector

from a prior distribution and generate target samples . The discriminator is trained to distinguish samples from the target distribution and samples , which encourages generator to reduce the discrepancy between the target distribution and . Both models have respective losses defined as:

This original formulation (Goodfellow et al., 2014) is known as the non-saturating (NS) GAN. Extensive research has demonstrated that appropriate re-design of plays an important role in training stability and generation quality. For example, the hinge loss on the discriminator (Lim and Ye, 2017; Tran et al., 2017) is defined as:

The Wassertein GAN (WGAN) (Arjovsky et al., 2017) is another successful reformulation which measures 1-Lipschitz constrained Wasserstein distance (Villani, 2008)

between the target distribution and the generated distribution in the discriminator output space. The loss function of WGAN can be written as:

Follow-up work improves WGAN in multiple ways (Gulrajani et al., 2017; Wei et al., 2018). For instance, Miyato et al. (2018a) propose spectral normalization to stabilize the training, which is widely used (Zhang et al., 2019; Brock et al., 2019) and has become the de-facto weight normalization technique for GANs.

2.2 Consistency Regularization

For semi-supervised or unsupervised learning, consistency regularization techniques are effective and have become broadly used recently 

(Sajjadi et al., 2016; Laine and Aila, 2016; Zhai et al., 2019; Xie et al., 2019; Berthelot et al., 2019). The intuition behind these techniques is to encode into model training some prior knowledge: that the model should produce consistent predictions given input instances and their semantics-preserving augmentations. The augmentations (or transformations) can take many forms, such as image flipping and rotating, sentence back-translating, or even adversarial attacks. Penalizing the inconsistency can be easily achieved by minimizing loss (Sajjadi et al., 2016; Laine and Aila, 2016) between instance pairs, or KL-divergence loss (Xie et al., 2019; Miyato et al., 2018b) between distributions. In the GAN literature, Wei et al. (2018) propose a consistency term derived from Lipschitz continuity considerations to improve the training of WGAN. Recently, CR-GAN (Zhang et al., 2020) applies consistency regularization to the discriminator and achieves substantial improvements.

3 Improved Consistency Regularization

This section starts by introducing two new techniques, abbreviated as bCR and zCR, to improve and generalize consistency regularization for GANs. We denote the combination of both of these techniques as ICR, and we will later show that ICR yields state-of-the-art image synthesis results in a variety of settings. Figure 1 shows illustrations comparing our methods to the baseline CR-GAN Zhang et al. (2020).

3.1 Balanced Consistency Regularization (bCR)

  Input: parameters of generator and discriminator , consistency regularization coefficient for real images and fake images , number of discriminator iterations per generator iteration , augmentation transform (for images, e.g. shift, flip, cutout, etc).
  for number of training iterations do
     for  to  do
        Sample batch ,
        Augment both real and fake images
     end for
     Sample batch
  end for
Algorithm 1 Balanced Consistency Regularization (bCR)

Figure 1(1) illustrates the baseline CR-GAN, in which a term is added to the discriminator loss function that penalizes its sensitivity to the difference between the original image and the augmented image . One key problem with the original CR-GAN is that the discriminator might ‘mistakenly believe’ that the augmentations are actual features of the target data set, since these augmentations are only performed on the real images. This phenomenon, which we refer to as consistency imbalance, is not easy to notice for certain types of augmentation (e.g. image shifting and flipping). However, it can result in generated samples with explicit augmentation artifacts when augmented samples contain visual artifacts not belonging to real images. For example, we can easily observe this effect for CR-GAN with cutout augmentation: see the second column in Figure 15. This undesirable effect greatly limits the choice of advanced augmentations we could use.

In order to correct this issue, we propose to also augment generated samples before they are fed into the discriminator, so that the discriminator will be evenly regularized with respect to both real and fake augmentations and thereby be encouraged to focus on meaningful visual information.

Specifically, a gradient update step will involve four batches, a batch of real images , augmentations of these real images , a batch of generated samples , and that same batch with augmentations . The discriminator will have terms that penalize its sensitivity between corresponding and also , while the generator cost remains unmodified.

This technique is described in more detail in Algorithm 1 and visualized in Figure 1(2). We abuse the notation a little in the sense that denotes the output vector before activation of the last layer of the discriminator given input . denotes an augmentation transform, here for images (e.g. shift, flip, cutout, etc). The consistency regularization can be balanced by adjusting the strength of and . This proposed bCR technique not only removes augmentation artifacts (see third column of Figure 15), but also brings substantial performance improvement (see Section 4 and 5).

3.2 Latent Consistency Regularization (zCR)

  Input: parameters of generator and discriminator , consistency regularization coefficient for generator and discriminator , number of discriminator iterations per generator iteration , augmentation transform (for latent vectors, e.g. adding small perturbation noise).
  for number of training iterations do
     for  to  do
        Sample batch ,
        Sample perturbation noise
        Augment latent vectors
     end for
     Sample batch
  end for
Algorithm 2 Latent Consistency Regularization (zCR)

In Section 3.1, we focus on consistency regularization with respect to augmentations in image space on the inputs to the discriminator. In this section, we consider a different question: Would it help if we enforce consistency regularization with respect to augmentations in latent space (Zhao et al., 2018)? Given that a GAN model consists of both a generator and a discriminator, it seems reasonable to ask if techniques that can be applied to the discriminator can also be effectively applied to the generator in certain analogous way.

Towards this end, we propose to augment inputs to the generator by slightly perturbing draws from the prior to yield . Assuming the added perturbations are small enough, we expect that the output of the discriminator ought not to change much with respect to this perturbation and modify the discriminator loss by enforcing that is small.

However, with only this new consistency regularization term added onto the GAN loss, the generator would be prone to collapse to generating specific samples for any latent , since that would easily satisfy the constraint above. To avoid this, we also modify the loss function for the generator with a term that maximizes the difference between and , which also encourages generations from similar latent vectors to be diverse. Though motivated differently, this can be seen as related to the Jacobian Clamping technique from Odena et al. (2018) and diversity increase technique in Yang et al. (2019).

This method is described in more detail in Algorithm 2 and visualized in Figure 1(3). denotes the output images of the generator given input . denotes an augmentation transform, here for latent vectors (e.g. adding small perturbation noise). The strength of consistency regularization for the discriminator can be adjusted via . From the view of the generator, intuitively, the term encourages to be diverse. We have conducted analysis on the effect of with experiments in Section 5.3. This technique substantially improves the performance of GANs, as measured by FID. We present experimental results in Section 4 and 5.

3.3 Putting it All Together

Though both Balanced Consistency Regularization and Latent Consistency Regularization improve GAN performance (see Section 4), it is not obvious that they would work when ‘stacked on top’ of each other. That is, maybe they are accomplishing the same thing in different ways, and we cannot add up their benefits. However, validated with extensive experiments, we achieve the best experimental results when combining Algorithm 1 and Algorithm 2 together. We call this combination Improved Consistency Regularization (ICR). Note that in ICR, we augment inputs in both image and latent spaces, and add regularization terms to both the discriminator and the generator. We regularize the discriminator’s consistency between corresponding pairs of , , and ; For the generator, we encourage diversity between .

4 Experiments

In this section, we validate our methods on different data sets, model architectures, and GAN loss functions. We compare both Balanced Consistency Regularization (Algorithm 1) and Latent Consistency Regularization (Algorithm 2) with several baseline methods. We also combine both techniques (we abbreviate this combination as ICR) and show that this yields state-of-the-art FID numbers. We follow the best experimental practices established in Kurach et al. (2019), aggregating all runs and reporting the FID distribution of the top 15% of trained models. We provide both quantitative and qualitative results (with more in the appendix).

4.1 Baseline Methods

We compare our methods with four GAN regularization techniques: Gradient Penalty (GP) (Gulrajani et al., 2017), DRAGAN (DR) (Kodali et al., 2017), Jensen-Shannon Regularizer (JSR) (Roth et al., 2017), and vanilla Consistency Regularization (CR) (Zhang et al., 2020). The regularization strength is set to 0.1 for JSR, and 10 for all others.

Following the procedures from Lucic et al. (2018); Kurach et al. (2019), we evaluate these methods across different data sets, neural architectures, and loss functions. For optimization, we use the Adam optimizer with batch size of 64 for all experiments. By default, spectral normalization (SN) (Miyato et al., 2018a) is used in the discriminator, as it is the most effective normalization method for GANs (Kurach et al., 2019) and is becoming the standard for recent GANs (Brock et al., 2019; Wu et al., 2019).

4.2 Data Sets and Evaluation

We carry out extensive experiments comparing our methods against the above baselines on three commonly used data sets in the GAN literature: CIFAR-10 (Krizhevsky et al., 2009), CelebA-HQ-128 (Karras et al., 2018), and ImageNet-2012 (Russakovsky et al., 2015).

For data set preparation, we follow the detailed procedures in Kurach et al. (2019). CIFAR-10 contains 60K images with 10 labels, out of which 50K are used for training and 10K are used for testing. CelebA-HQ-128 (CelebA) consists of 30K facial images, out of which we use 3K images for testing and train models with the rest. ImageNet-2012 has approximately 1.2M images with 1000 labels, and we down-sample the images to . We stop training after 200k generator update steps for CIFAR-10, 100k steps for CelebA, and 250k for ImageNet.

We use the Fréchet Inception Distance (FID) (Heusel et al., 2017) as the primary metric for quantitative evaluation. FID has been shown to correlate well with human evaluation of image quality and to be helpful in detecting intra-class mode collapse. We calculate FID between generated samples and real test images, using 10K images on CIFAR-10, 3K on CelebA, and 50K on ImageNet. We also report Inception Scores (Salimans et al., 2016) in the appendix.

By default, the augmentation transform on latent vectors is adding Gaussian noise . The augmentation transform on images is a combination of randomly flipping horizontally and shifting by multiple pixels (up to 4 for CIFAR-10 and CelebA, and up to 16 for ImageNet). This transform combination results in better performance than alternatives (see Zhang et al. (2020)).

There are many different GAN loss functions and we elaborate on several of them in Section 2. Following Zhang et al. (2020), for each data set and model architecture combination, we conduct experiments using the loss function that achieves the best performance on baselines.

4.3 Unconditional GAN Models

We first test out techniques on the unconditional image generation task, which is to model images from an object-recognition data set without any reference to the underlying classes. We conduct experiments on the CIFAR-10 and CelebA data sets, and use both DCGAN (Radford et al., 2015) and ResNet (He et al., 2016) GAN architectures.

4.3.1 DCGAN on CIFAR-10

Figure 2 presents the results of DCGAN on CIFAR-10 with the hinge loss. Vanilla Consistency Regularization (CR) (Zhang et al., 2020) outperforms all other baselines. Our Balanced Consistency Regularization (bCR) technique improves on CR by more than

FID points. Our Latent Consistency Regularization (zCR) technique improves scores less than bCR, but the improvement is still significant compared to the measurement variance. We set

for bCR, while using , , and for zCR.

Figure 2: FID scores for DCGAN trained on CIFAR-10 with the hinge loss, for a variety of regularization techniques. Consistency regularization significantly outperforms non-consistency regularizations. Adding Balanced Consistency Regularization causes a larger improvement than Latent Consistency Regularization, but both yield improvements much larger than measurement variances.

Figure 3: FID scores for a ResNet-style GAN trained on CIFAR-10 with the non-saturating loss, for a variety of regularization techniques. Contrary to the results in Figure 2, Latent Consistency Regularization outperforms Balanced Consistency Regularization, though they both substantially surpass all baselines.

4.3.2 ResNet on CIFAR-10

DCGAN-type models are well-known and it is encouraging that our techniques increase performance for those models, but they have been substantially surpassed in performance by newer techniques. We then validate our methods on more recent architectures that use residual connections

(He et al., 2016). Figure 3 shows unconditional image synthesis results on CIFAR-10 using a GAN model with residual connections and the non-saturating loss. Though both of our proposed modifications still outperform all baselines, Latent Consistency Regularization works better than Balanced Consistency Regularization, contrary to the results in Figure 2. For hyper-parameters, we set and for bCR, while using , , and for zCR.

4.3.3 DCGAN on CelebA

We also conduct experiments on the CelebA data set. The baseline model we use in this case is a DCGAN model with the non-saturating loss. We set for bCR, while using , , and for zCR. The results are shown in Figure 4 and are overall similar to those in Figure 2. The improvements in performance for CelebA are not as large as those for CIFAR-10, but they are still substantial, suggesting that our methods generalize across data sets.

Figure 4: FID scores for DCGAN trained on CelebA with the non-saturating loss, for a variety of regularization techniques. Consistency regularization significantly outperforms all other baselines. Balanced Consistency Regularization further improves on Consistency Regularization by more than 2.0 in terms of FID, while Latent Consistency Regularization improves by around 1.0.

4.3.4 Improved Consistency Regularization

As alluded to above, we observe experimentally that combining Balanced Consistency regularization (bCR) and Latent Consistency Regularization (zCR) (into Improved Consistency Regularization (ICR)) yields results that are better than those given by either method alone. Using the above experimental results, we choose the best-performing hyper-parameters to carry out experiments for ICR, regularizing with both bCR and zCR. Table 1 shows that ICR yields the best results for all three unconditional synthesis settings we study. Moreover, the results of the ResNet model on CIFAR-10 are, to the best of our knowledge, the best reported results for unconditional CIFAR-10 synthesis.

CIFAR-10 CIFAR-10 CelebA
Methods (DCGAN) (ResNet) (DCGAN)
W/O 24.73 19.00 25.95
GP 25.83 19.74 22.57
DR 25.08 18.94 21.91
JSR 25.17 19.59 22.17
CR 18.72 14.56 16.97
ICR (ours) 15.87 13.36 15.43
Table 1: FID scores for Unconditional Image Synthesis with ICR. Our ICR achieves the best performance overall. Baselines are: not using regularization (W/O), Gradient Penalty (GP) (Gulrajani et al., 2017), DRAGAN (DR) (Kodali et al., 2017), Jensen-Shannon Regularizer (JSR) (Roth et al., 2017), and vanilla Consistency Regularization (CR) (Zhang et al., 2020).

4.4 Conditional GAN Models

Models CIFAR-10 ImageNet
SNGAN 17.50 27.62
BigGAN 14.73 8.73
CR-BigGAN 11.48 6.66
ICR-BigGAN (ours) 9.21 5.38
Table 2: FID scores for class conditional image generation on CIFAR-10 and ImageNet. We compare our ICR technique with state-of-the-art GAN models including SNGAN (Miyato et al., 2018a), BigGAN (Brock et al., 2019), and CR-GAN (Zhang et al., 2020). The BigGAN implementation we use is from Kurach et al. (2019). ()-BigGAN has the exactly same architecture as the publicly available BigGAN and is trained with the same settings, but with our consistency regularization techniques added to GAN losses. On CIFAR-10 and ImageNet, we improve the FID numbers to 9.21 and 5.38 correspondingly, which are the best known scores at that model size.

In this section, we apply our consistency regularization techniques to the publicly available implementation of BigGAN (Brock et al., 2019) from Kurach et al. (2019). We compare it to baselines from Brock et al. (2019); Miyato et al. (2018a); Zhang et al. (2020). Note that the FID numbers from Wu et al. (2019) are based on a larger version of BigGAN called BigGAN-Deep with substantially more parameters than the original BigGAN, and are thus not comparable to the numbers we report here. On CIFAR-10, our techniques yield the best known FID score for conditional synthesis with CIFAR-10222

There are a few papers that report lower scores using the PyTorch implementation of the FID. That implementation outputs numbers that are much lower, which are not comparable to numbers from the official TF implementation, as explained at 9.21. On conditional Image Synthesis on the ImageNet data set, our technique yields FID of 5.38. This is the best known score using the same number of parameters as in the original BigGAN model, though the much larger model from Wu et al. (2019) achieves a better score. For both setups, we set , together with , , and .

5 Ablation Studies

5.1 How do the Hyper-Parameters for Balanced Consistency Regularization Affect Performance?

In Balanced Consistency Regularization (Algorithm 1), the cost associated with sensitivity to augmentations of the real images is weighted by and the cost associated with sensitivity to augmentations of the generated samples is weighted by . In order to better understand the interplay between these parameters, we train a DCGAN-type model with spectral normalization on the CIFAR-10 data set with the hinge loss, for many different values of . The heat map in Figure 5 shows that it never pays to set either of the parameters to zero: this means that Balanced Consistency Regularization always outperforms vanilla consistency regularization (the baseline CR-GAN). Generally speaking, setting and similar in magnitude works well. This is encouraging, since it means that the performance of bCR is relatively insensitive to hyper-parameters.

Figure 5: Analysis of the effects of the and hyper-parameters for Balanced Consistency Regularization. We train DCGAN on CIFAR-10 with hinge loss, for many different values of . The results show that Balanced Consistency Regularization essentially always outperforms vanilla consistency regularization. Generally speaking, Balanced Consistency Regularization performs best with and of similar magnitudes.

5.2 Examining Artifacts Resulting from ‘Vanilla’ Consistency Regularization

(a) cutout.
(b) CR samples.
(c) bCR samples.
(d) cutout.
(e) CR samples.
(f) bCR samples.
(g) cutout.
(h) CR samples.
(i) bCR samples.
Figure 15: Illustration of resolving generation artifacts by Balanced Consistency Regularization. The first column shows CIFAR-10 training images augmented with cutout of different sizes. The second column demonstrates that the vanilla CR-GAN (Zhang et al., 2020) can cause augmentation artifacts to appear in generated samples. This is because CR-GAN only has consistency regularization on real images passed into the discriminator. In the last column (our Balanced Consistency Regularization: bCR in Algorithm 1) this issue is fixed with both real and generated fake images augmented before being fed into the discriminator.

To understand the augmentation artifacts resulting from using vanilla CR-GAN (Zhang et al., 2020), and to validate that Balanced Consistency Regularization removes those artifacts, we carry out a series of qualitative experiments using varying sizes for the cutout (DeVries and Taylor, 2017) augmentation. We experiment with cutouts of size , , and , training both vanilla CR-GANs and GANs with Balanced Consistency Regularization.

The results are shown in Figure 15. Broadly speaking, we observe more substantial cutout artifacts (black rectangles) in samples from CR-GANs with larger cutout augmentations, and essentially no such artifacts for GANs trained with Balanced Consistency Regularization with (we examine several hundred samples from bCR-GANs manually in order to make this observation).

We do observe a few artifacts when , but much less than those from the vanilla CR-GAN. We believe that this phenomenon of introducing augmentation artifacts into generations likely holds for other types of augmentation, but it is much more difficult to confirm for less visible transforms, and in some cases it may not actually be harmful (e.g. flipping of images in most contexts).

5.3 How do the Hyper-Parameters for Latent Consistency Regularization Affect Performance?

Latent Consistency Regularization (Algorithm 2) has three hyper-parameters: and , which respectively govern the magnitude of the perturbation made to the draw from the prior, the weight of the sensitivity of the generator to that perturbation, and the weight of the sensitivity of the discriminator to that perturbation. From the view of the generator, intuitively, the extra loss term added encourages and to be far away from each other.

We conduct experiments using a ResNet-style GAN on the CIFAR-10 data set with the non-saturating loss in order to better understand the interplay between these hyper-parameters. The results in Figure 16 show that a moderate value for the generator coefficient (e.g.

) works the best (as measured by FID). This corresponds to encouraging the generator to be sensitive to perturbations of samples from the prior. For this experimental setup, perturbations with standard deviation of

are the best, and higher (but not extremely high) values for the discriminator coefficient also perform better.

Figure 16: Analysis on the hyper-parameters of Latent Consistency Regularization. We conduct experiments using a ResNet-style GAN on CIFAR-10 with non-saturating loss in order to better understand the interplay between , and . The results show that a moderate value for the generator coefficient (e.g. ) works the best. With the added term , the generator is encouraged to be sensitive to perturbations in latent space. For this set of experiments, we observe the best performance adding perturbations with standard deviation of , and higher (but not extremely high) values for the discriminator coefficient also improve further.

6 Related Work

There is so much related work on GANs (Goodfellow et al., 2014) that it is impossible to do it justice (see Odena (2019); Kurach et al. (2019) for different overviews of the field), but here we sketch out a few different threads. There is a several-year-long thread of work on scaling GANs up to do conditional image synthesis on the ImageNet-2012 data set beginning with Odena et al. (2017), extending through Miyato et al. (2018a); Zhang et al. (2019); Brock et al. (2019); Daras et al. (2019) and most recently culminating in Wu et al. (2019) and Zhang et al. (2020), which presently represent the state-of-the-art models at this task (Wu et al. (2019) uses a larger model size than Zhang et al. (2020) and correspondingly report better scores). There is a separate thread of more ‘graphics-focused’ work on GANs that tends not to use the same benchmarks and is hard to directly compare with (Karras et al., 2018, 2019a, 2019b), but nevertheless produces interesting and impressive results. Finally, as GANs are known to be hard to train for a variety of reasons, there is a substantial amount of work (Metz et al., 2016; Arjovsky et al., 2017; Gulrajani et al., 2017; Olsson et al., 2018; Sinha et al., 2019) dedicated to fixing these issues, understanding them better, or more accurately measuring the quality of GAN outputs.

Most related work on consistency regularization is from the semi-supervised learning literature, and focuses on regularizing model predictions to be invariant to small perturbations (Bachman et al., 2014; Sajjadi et al., 2016; Laine and Aila, 2016; Miyato et al., 2018b; Xie et al., 2019) for the purpose of learning from limited labeled data. Wei et al. (2018); Zhang et al. (2020) apply related ideas to training GAN models and observe initial gains, which motivates this work.

7 Conclusion

Extending the recent success of consistency regularization in GANs (Wei et al., 2018; Zhang et al., 2020), we present two novel improvements: Balanced Consistency Regularization, in which generator samples are also augmented along with training data, and Latent Consistency Regularization, in which draws from the prior are perturbed, and the sensitivity to those perturbations is discouraged and encouraged for the discriminator and the generator, respectively.

In addition to fixing a new issue we observe with the vanilla Consistency Regularization (augmentation artifacts in samples), our techniques yield the best known FID numbers for both unconditional and conditional image synthesis on the CIFAR-10 data set. They also achieve the best FID numbers (with the fixed number of parameters used in the original BigGAN (Brock et al., 2019) model) for conditional image synthesis on ImageNet.

These techniques are simple to implement, not particularly computationally burdensome, and relatively insensitive to hyper-parameters. We hope they become a standard part of the GAN training toolkit, and that their use allows more interesting usage of GANs to many sorts of applications.


We would like to thank Colin Raffel, Pouya Pezeshkpour for helpful discussions.


  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §1, §2.1, §6.
  • P. Bachman, O. Alsharif, and D. Precup (2014) Learning with pseudo-ensembles. In Advances in neural information processing systems, pp. 3365–3373. Cited by: §6.
  • D. Berthelot, N. Carlini, I. J. Goodfellow, N. Papernot, A. Oliver, and C. Raffel (2019) MixMatch: A holistic approach to semi-supervised learning. NeurIPS. Cited by: §2.2.
  • A. Brock, J. Donahue, and K. Simonyan (2019) Large scale GAN training for high fidelity natural image synthesis. In ICLR, Cited by: Improved Consistency Regularization for GANs, §1, §1, §2.1, §4.1, §4.4, Table 2, §6, §7.
  • G. Daras, A. Odena, H. Zhang, and A. G. Dimakis (2019) Your local gan: designing two dimensional local attention mechanisms for generative models. arXiv preprint arXiv:1911.12287. Cited by: §6.
  • T. DeVries and G. W. Taylor (2017)

    Improved regularization of convolutional neural networks with cutout

    arXiv preprint arXiv:1708.04552. Cited by: §5.2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2.1, §6.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §1, §2.1, §4.1, Table 1, §6.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §4.3.2, §4.3.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, Cited by: §4.2.
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of gans for improved quality, stability, and variation. In ICLR, Cited by: §4.2, §6.
  • T. Karras, S. Laine, and T. Aila (2019a) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §6.
  • T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2019b) Analyzing and improving the image quality of stylegan. arXiv preprint arXiv:1912.04958. Cited by: §6.
  • N. Kodali, J. Abernethy, J. Hays, and Z. Kira (2017) On convergence and stability of gans. arXiv preprint arXiv:1705.07215. Cited by: §4.1, Table 1.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Citeseer. Cited by: §4.2.
  • K. Kurach, M. Lucic, X. Zhai, M. Michalski, and S. Gelly (2019) A large-scale study on regularization and normalization in gans. In ICML, Cited by: §4.1, §4.2, §4.4, Table 2, §4, §6.
  • S. Laine and T. Aila (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: §2.2, §6.
  • J. H. Lim and J. C. Ye (2017) Geometric gan. arXiv preprint arXiv:1705.02894. Cited by: §2.1.
  • M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet (2018) Are gans created equal? a large-scale study. In Advances in neural information processing systems, pp. 700–709. Cited by: §4.1.
  • L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein (2016) Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163. Cited by: §6.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018a) Spectral normalization for generative adversarial networks. In ICLR, Cited by: §1, §2.1, §4.1, §4.4, Table 2, §6.
  • T. Miyato, S. Maeda, S. Ishii, and M. Koyama (2018b) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.2, §6.
  • A. Odena, J. Buckman, C. Olsson, T. B. Brown, C. Olah, C. Raffel, and I. Goodfellow (2018) Is generator conditioning causally related to gan performance?. arXiv preprint arXiv:1802.08768. Cited by: §3.2.
  • A. Odena, C. Olah, and J. Shlens (2017)

    Conditional image synthesis with auxiliary classifier gans


    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 2642–2651. Cited by: §6.
  • A. Odena (2019) Open questions about generative adversarial networks. Distill 4 (4), pp. e18. Cited by: §6.
  • A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and I. Goodfellow (2018) Realistic evaluation of deep semi-supervised learning algorithms. In NeurIPS, pp. 3235–3246. Cited by: §1.
  • C. Olsson, S. Bhupatiraju, T. Brown, A. Odena, and I. Goodfellow (2018) Skill rating for generative models. arXiv preprint arXiv:1808.04888. Cited by: §6.
  • A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §4.3.
  • K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann (2017) Stabilizing training of generative adversarial networks through regularization. In Advances in neural information processing systems, pp. 2018–2028. Cited by: §4.1, Table 1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §4.2.
  • M. Sajjadi, M. Javanmardi, and T. Tasdizen (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In NeurIPS, Cited by: §2.2, §6.
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: Appendix A, §1, §4.2.
  • S. Sinha, H. Zhang, A. Goyal, Y. Bengio, H. Larochelle, and A. Odena (2019) Small-gan: speeding up gan training using core-sets. arXiv preprint arXiv:1910.13540. Cited by: §6.
  • D. Tran, R. Ranganath, and D. M. Blei (2017) Deep and hierarchical implicit models. arXiv preprint arXiv:1702.08896 7, pp. 3. Cited by: §2.1.
  • C. Villani (2008) Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §2.1.
  • X. Wei, B. Gong, Z. Liu, W. Lu, and L. Wang (2018) Improving the improved training of wasserstein gans: a consistency term and its dual effect. arXiv preprint arXiv:1803.01541. Cited by: §1, §1, §2.1, §2.2, §6, §7.
  • Y. Wu, J. Donahue, D. Balduzzi, K. Simonyan, and T. Lillicrap (2019) LOGAN: latent optimisation for generative adversarial networks. arXiv preprint arXiv:1912.00953. Cited by: §4.1, §4.4, §6.
  • Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le (2019) Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848. Cited by: §2.2, §6.
  • D. Yang, S. Hong, Y. Jang, T. Zhao, and H. Lee (2019) Diversity-sensitive conditional generative adversarial networks. In ICLR, Cited by: §3.2.
  • X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer (2019) Sl: self-supervised semi-supervised learning. arXiv preprint arXiv:1905.03670. Cited by: §2.2.
  • H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In ICML, Cited by: §2.1, §6.
  • H. Zhang, Z. Zhang, A. Odena, and H. Lee (2020) Consistency regularization for generative adversarial networks. In ICLR, Cited by: Figure 24, Figure 28, Appendix B, Improved Consistency Regularization for GANs, Figure 1, §1, §1, §2.2, §3, §4.1, §4.2, §4.2, §4.3.1, §4.4, Table 1, Table 2, Figure 15, §5.2, §6, §6, §7.
  • Z. Zhao, D. Dua, and S. Singh (2018) Generating natural adversarial examples. In ICLR, Cited by: §3.2.

Appendix A Evaluation in Inception Score

Inception Score (IS) is another GAN evaluation metric introduced by 

Salimans et al. (2016). Here, we compare the Inception Score of the unconditional generated samples on CIFAR-10 and CelebA for the experiments in Section 4.3. As shown in Table 3, our Improved Consistency Regularization achieves the best IS result with both SNDCGAN and ResNet architectures.

(a) SNDCGAN on CIFAR-10 with hinge loss.
(b) ResNet on CIFAR-10 with non-saturating loss.
(c) SNDCGAN on CelebA with non-saturating loss.
Figure 20: Inception Scores.
CIFAR-10 CIFAR-10 CelebA
Methods (SNDCGAN) (ResNet) (SNDCGAN)
W/O 7.54 8.20 2.23
GP 7.54 8.04 2.38
DR 7.54 8.09 2.38
JSR 7.52 8.03 2.17
CR 7.93 8.40 2.48
ICR (ours) 8.14 8.55 2.64
Table 3: Best Inception Scores of unconditional image generation.

Appendix B Qualitative Examples

We randomly sample from our ICR-BigGAN model on ImageNet (FID=5.38, Secition 4.4) as qualitative examples for different class labels. We have obtained permission from authors of CR-GAN (Zhang et al., 2020) to directly use the visualization of random samples from their CR-BigGAN model (FID=6.66) for comparison. In the following figures, the left column shows random samples from our ICR-BigGAN, while the right column presents those from baseline CR-BigGAN.

(a) Monarch Butterfly (our ICR vs baseline CR)
(b) Cock (our ICR vs baseline CR)
(c) Blenheim Spaniel (our ICR vs baseline CR)
Figure 24: Random ImageNet samples from our ICR-BigGAN (Section 4.4, FID 5.38) vs CR-BigGAN (Zhang et al. (2020), FID 6.66).
(a) Cheeseburger (our ICR vs baseline CR)
(b) Ambulance (our ICR vs baseline CR)
(c) Beer Bottle (our ICR vs baseline CR)
Figure 28: Random ImageNet samples from our ICR-BigGAN (Section 4.4, FID 5.38) vs CR-BigGAN (Zhang et al. (2020), FID 6.66).