1 Introduction
Generative Adversarial Networks (GANs, Goodfellow et al., 2014)
have reshaped the state of machine learning in tasks that involve generating data. A GAN is an unsupervised method that consists of two neural networks, a generator and a discriminator, with opposing (or
adversarial) objectives. The typical goal of the generator is to transform noise (e.g., drawn from a normal distribution) into samples whose statistical and structural characteristics match well those of an empirical target dataset (such as a collection of images). The discriminator, which acts as an
adversary to the generator, needs to discriminate between (or classify) samples as coming from the real data or the generator.While GANs can achieve impressive qualitative performance (most notably with image data, e.g., see Roth et al., 2017; Miyato et al., 2018; Karras et al., 2017), the most successful methods depart from the original formulation to address various instabilities and other optimization difficulties (Arjovsky and Bottou, 2017; Arjovsky, Chintala, and Bottou, 2017). One such difficulty in training GANs occurs when the generator produces samples only from a small subset of the target distribution, a phenomenon known as missing modes (a.k.a., modedropping, e.g. see Che et al., 2016). Numerous works try to address the problem by modifying the original objective, such as unrolling (Metz et al., 2016), aggregating samples (Lin et al., 2017), stacked architectures (Huang et al., 2016; Karras et al., 2017), mutual information / entopy maximization (Belghazi et al., 2018), multiple discriminators (Neyshabur, Bhojanapalli, and Chakrabarti, 2017; JuefeiXu, Boddeti, and Savvides, 2017), or multiple generators (Tolstikhin et al., 2017; Hoang et al., 2017; Kwak and Zhang, 2016).
In our work, we follow the intuition that missing modes in GANs are due in part to modespecific vanishing gradients. As a simple illustrative example which we explore in detail in our experiments below (Fig. 1), consider a discriminator that is well representing the target distribution and a generator that is only generating a subset of the modes in the data. If any of the missing modes are disjoint from those represented in the generator (i.e., are composed of sets of features with low intersection), there is no way for the generator to receive gradient signal on missing modes from the discriminator.
However, if the discriminator only represents the data approximately (in the sense that it also cannot fully distinguish between these modes), it may be possible to recover the missing mode gradient signal. If this can be achieved by using a low capacity^{1}^{1}1Throughout the paper, we refer to capacity as the architecture size of a given neural network in terms of number of parameters. discriminator, it is ultimately undesirable given that the end goal is to generate samples that resemble well the target dataset. From now on, we will refer to such low capacity discriminators as weak and to high capacity discriminators as strong. In order to ensure both high quality and mode coverage, we consider multiple discriminators (as in Durugkar, Gemp, and Mahadevan, 2016) with different strengths to train the generator. We propose to train the generator using a curriculum based on an online multiarmed bandit algorithm (Matiisen et al., 2017; Graves et al., 2017), dynamically changing the weight/resources allocated to each discriminator, which we show is crucial for achieving good results. Our primary contributions are:

[leftmargin=*,noitemsep,nolistsep]

We provide important insights into the missing mode problem as demonstrated by the gradient signal available to the generator from the discriminator.

As a potential solution to the missing modes problem, we introduce a new framework based on adversarial bandits (Littlestone and Warmuth, 1994; Auer et al., 1995; Freund and Schapire, 1997) resource allocation, where the generator gets its training signal from a set of teacher networks with increasing capacity.

We show that the proposed approach leads to a curriculum learning characterized by successive phases of the generator prioritizing different discriminators.
The remainder of this paper is organized as follows. Previous literature relevant to this work is briefly reviewed on Section 2. The proposed approach is formally introduced in Section 3, and an empirical analysis is reported in Section 4. Conclusions and future directions are finally presented in Section 5.
2 Related Work
Mode coverage and data / model augmentation
The intuition that missing modes are due to vanishing gradients resonates with some successful approaches on stabilizing and improving GAN training through data and model augmentation. Instance noise (Arjovsky and Bottou, 2017) has been shown to improve stability (see also Roth et al., 2017), which can be understood as smoothing the data modes in the pixel space. Progressively reducing the downsampling through training (either by copying parameters or feeding low resolution samples into a larger generator) have also been considered previously (Huang et al., 2016; Karras et al., 2017) as solutions to increase mode overlap. This is akin to a handcrafted curriculum, progressively increasing the difficulty of the problem at apriori chosen points in the complete training procedure.
Multiple discriminators and generators
Several works have also incorporated multiple generators or discriminators in order to improve learning. Multiplegenerator methods (Tolstikhin et al., 2017; Hoang et al., 2017; Kwak and Zhang, 2016) typically work by encouraging the generators to divide the task of generating by modes in the target dataset (without additional supervision). Using multiple discriminators (Neyshabur, Bhojanapalli, and Chakrabarti, 2017; JuefeiXu, Boddeti, and Savvides, 2017), on the other hand, is known to provide a better learning signal for the generator if said discriminators compositionally represent well the target datasets. Closest to our work, Durugkar, Gemp, and Mahadevan (2016) consider discriminators of different complexity to provide varied signal. We will show that wisely designing the reward allows to track the progress made by the generator and encourages a curriculum learning.
Multiarmed bandit as a curriculum learning method for GANs
Curriculum learning (Bengio et al., 2009) phrases a given machine learning problem as a set of tasks of increasing difficulty. GANs can also be said to share aspects with curriculum learning: the discriminator defines an objective of progressive difficulty,
thus allowing the generator to gradually learn to more faithfully mimic the target distribution. However, there is no explicit mechanism to encourage a sensible curriculum for either model. For example, if the discriminator learns to represent disjoint modes faster than the generator learns to cover them, this can lead to the generator missing modes with no gradient signal to recover.
In this paper, we propose an algorithm which gives rise to a curriculum in a direct manner. Our approach borrows from curriculum learning in multiarmed bandit setting (Matiisen et al., 2017; Graves et al., 2017)
, where learning is typically done by measuring the change in a performance criterion of a given agent (i.e. a loss function, score or gradient norm can be used) that appears to affect the form of the optimal policy. In our method, given a set of discriminators, the goal is to weight the feedback received by the generator proportionally to the information contained in the gradients from each discriminator.
3 Adaptative Curriculum GAN
Here we formulate the problem and approach for training a single generator on a target dataset using a curriculum over multiple discriminators, which we call Adaptative Curriculum GAN (acGAN). First, define a generator function, , which maps noise from a domain to the domain of a target dataset, (such as the space of images). Let denote the target density ^{2}^{2}2Here, we assume for the sake of notation that the target data admits a density., and let denote the prior density defined on used to draw noise samples for input into the generator. We wish to train this generator function using discriminators, , such that on each episode , we select the mixture of discriminators that provides the best learning signal.
3.1 Mixing discriminators
This mixtureofexperts problem, where each discriminator plays the role of a teacher, can be tackled under the fullinformation adversarial bandit setting (Littlestone and Warmuth, 1994; Freund and Schapire, 1997; Auer et al., 1995). On each episode , a bandit player associates normalized weights with discriminators . The generator is then trained based on the mixture described by , and a reward is observed for each discriminator , characterizing the generator’s improvement with respect to . Let denote the total observed reward at time . The goal of the player is to learn the optimal policy that maximizes the expected total reward^{3}^{3}3 denotes the standard simplex on ..
The Hedge algorithm (Freund and Schapire, 1997), also known as Boltzmann or Gibbs distribution, addresses this fullinformation game by maintaining probabilities
(1) 
for each discriminator , where estimates the gain of at episode . In this case, is a parameter of the distribution:
corresponds to a uniform distribution over all models. We found experimentally that using a moving average on previous rewards
(which also featured in Matiisen et al., 2017) stabilizes the training:(2) 
where is the smoothing parameter.
To demonstrate how this can be used to train GANs, consider the usual value function (Goodfellow et al., 2014):
(3) 
On each episode , given the mixture of discriminators , each discriminator is trained by taking a gradient step to increase the expected value function
(4) 
and the generator is trained by taking a gradient step to increase
(5) 
The latter corresponds to the nonsaturated version of Eq. 4 for the generator. The intuition is that training the generator with all the discriminators simultaneously (as a mixture) should force the generator to fool all discriminators at the same time (Durugkar, Gemp, and Mahadevan, 2016). Since each discriminator has an increasing level view of the modes distribution, they should have a complementary role. While the weaker discriminator focuses on modes coverage, the stronger discriminator ensures samples quality (showed in Section 4.1). This should result into a better overall coverage of the modes in the input distribution.
Algorithm 1 describes our proposed acGAN procedure. We denote and parameterize this algorithm as where .
Remark 1.
At the beginning of the training, we define a warmup period , prior to which we train and with a uniform probability, i.e . In other words, we consider . This guarantees that each discriminator is updated a minimum number of times (or provides feedback a minimum number of times to the generator) and prevents one from dominating the others (i.e, ) at the beginning of the training. Without this safeguard, the remaining weights would hardly recover a significant probability and the generator may never get informative gradient from the corresponding discriminator. Note that warmups are not uncommon either in bandits algorithm, e.g. for adding robustness to the tails of reward distributions (Baransi, Maillard, and Mannor, 2014).
3.2 Reward shaping
In order to provide meaningful feedback for learning efficient mixtures of discriminators, we consider different reward functions to generate . We argue that progress (i.e., the learning slope (Matiisen et al., 2017; Graves et al., 2017)) of the generator is a more sensible way to evaluate our policy. Let be the generator parameters at episode . We define the two following quantities for measuring generator progress:
(6)  
(7) 
The former measures the progress of the generator with respect to the discriminator score , while the latter assess the change in the loss function (Eq. 3). Since the change in the quality sample (Eq .3.2) led to better performance than the change in the loss function (Eq .3.2), all our experiments (see Section. 4) use Eq .3.2.
3.3 Connection to existing methods
Interestingly, some existing methods in the GAN literature can be seen as a specific case of acGAN:
Gman:
The original GMAN (Durugkar, Gemp, and Mahadevan, 2016) algorithm can be recovered by setting and taking the loss function to be the reward . Note how the authors of GMAN call their algorithm GMAN, where is also the Boltzmann coefficient.
Uniform:
The uniform case is defined by assigning a fixed uniform probability for each discriminator :
This corresponds to Eq. 1 with .
To support the results of our theoretical work, we conducted a set of experiments which we describe below.
4 Experiments
In this section, we first give an understanding of how each discriminator provides informative feedback to the generator. We then compare our proposed approach (acGAN) against existing methods from the literature.
4.1 Retaining mode information through weaker capacity discriminators and smoothness
We begin by analyzing the gradient norm of the discriminator networks and we show that weak capacity discriminators are smoother
than strong discriminators. This property corresponds to a "coarsegrained" representation of the distribution, which allows the generator to recover missing modes. We further show we can increase the smoothness of a weak discriminator by corrupting its inputs with white noise. This results in an increase of the discriminator’s entropy (see Supplementary Material for more details) and hence smoother gradient signal.
Weak Discriminators: a way to retain modes
We now highlight the role of weaker capacity discriminators. To this extent, we performed the following experiments on the 8 Gaussian synthetic dataset:

[leftmargin=*]

We pretrained the generator (with dense layers of
units with ReLU activation layers except for the last layer) with one discriminator on only
of the original modes. 
We trained a (vanilla) GAN on all Gaussian components, initializing with the mode generator above. The discriminator had dense layers of units (ReLU hidden activation layers).

We trained acGAN with the generator initialized with the mode generator (as with vanilla GAN). We considered discriminators, with , and dense layers respectively (same activation scheme as previously applies here).
Results (Fig. 3) show the Vanilla GAN could only retrieve additional modes, while acGAN recovered all () modes. We examined the gradients provided by the discriminators using a density plot (Fig. 4) of the gradient norm for each discriminator with respect to the input, i.e., for . Observe that there is a clear progression from a stronger discriminator with more distinct, higher gradients to the weaker discriminator smoother gradients. Additionally, note that the discriminator from the vanilla GAN, which has very high gradient norm values, has gradients for modes not present in the generator: the discriminator has information useful for learning about these missing modes, but the generator does not learn these modes due to vanishing gradients. Our results support both our original hypothesis that missing modes are due to vanishing gradients and that using a coarsegrain discriminator can be used to recover missing modes. To provide further insight, we show the evolution of the gradient norm of each discriminator at training time in the Supplementary Material. We also note that the discontinuities in the gradients is due to the ReLU activation partitioning the subspace through overlapping halfplanes, which contrasts the smooth decay of hyperbolic tangent and sigmoid^{4}^{4}4 nonlinearities, and we further explore the effect of different nonlinear activation layers on the gradient norm of the weak discriminator in the Supplementary Material.
4.2 Performance of acGAN against existing baselines
In this section, we evaluate the performance of our proposed method (acGAN), on various datasets. All experiments consider the reward shown in Eq. 3.2. We first conducted a sanity check on 2 modedropping datasets: synthetic data consisting of a mixture of 25 Gaussians and StackedMNIST with 1000 modes. We then tested it on CIFAR10 and finally show generated samples on celebA dataset (see Supplementary Material).
We aim to analyze specific properties such as diversity of generated samples and quality in terms of (FID, Heusel et al., 2017) score when available, along with convergence of the method (how fast it reaches its minimum FID score). Additionally, our results hint at the emergence of a curriculum during the training process.
All parameters used to obtain the results can be found in the Supplementary Material.
We split the batch of inputs between discriminators. We abuse of language with the term epoch
, which in the context of the current paper means that the generator has been trained on a number of iterations equivalent to an epoch. For example, CIFAR10 has 50,000 training images and, assuming a batch size of 64, one epoch represents roughly 781 iterations for the generator.
FD  Modes  Quality samples  

Vanilla GAN  
Uniform (3D)  
acGAN (3D) 
Synthetic Gaussian mixture dataset
The synthetic dataset is composed of 25 bivariate Gaussian mixtures arranged in a twodimensional grid. We launched a single run of 15 epochs for all methods with 3 discriminators. We report 3 measures in Table 1: the Fréchet Distance (FD), the number of recovered modes and the proportion of high quality samples (which is the proportion of samples covering a mode). More details on those metrics can be found in the Supplementary Material.
We compared the performance of our proposed methods to that of the Uniform algorithm and of the vanilla GAN (Goodfellow et al., 2014). Our proposed methods could cover the 25 modes. KDE plots for the 3 discriminators case are shown in Fig. 5.
Modes (max 1000)  KL  

DCGAN (Radford, Metz, and Chintala, 2015)  
ALI (Dumoulin et al., 2016)  
Unrolled GAN (Metz et al., 2016)  
VEEGAN (Srivastava et al., 2017a)  
PacGAN (Lin et al., 2017)  
GAN+MINE (Belghazi et al., 2018)  
acGAN (3D)  
acGAN (5D) 
StackedMNIST
We use the StackedMNIST dataset (Srivastava et al., 2017b) to measure the mode coverage of our proposed approach. The dataset is generated by stacking 3 randomly selected digits from the MNIST dataset: one on each RGB channel to produce a final
RGB tensor. The dataset has 128,000 training images and is assumed to have
modes. Results of our experiments are shown in Table 2.We report our results (averaged over 10 runs) in Table 2 and compare them with other existing baselines in the literature. Our method could recover all 1000 modes like PaCGAN (Lin et al., 2017) and MINE (Belghazi et al., 2018)
; these two approaches either increase the dimensionality of the generator inputs either by packing multiple samples or by adding a latent code vector which helps overcoming mode collapse. Generated samples are shown in Fig.
6, our results further verify our hypothesis that acGAN is a sensible approach to ensuring good mode coverage and sample quality.Cifar10
We conducted an indepth study of acGAN’s performance on CIFAR10 by running experiments on 5 independent seeds for 50 epochs each.
We found a particular pattern in the acGAN’s learning process: it consists of distinct regimes where one discriminator’s weight dominates over the others. To illustrate this, we averaged the sampling probability of each discriminator over every 200 iterations and plotted results in Fig. 7 for and discriminators, respectively. The reported curves suggest that, for discriminators, the weakest discriminator network is often sampled at the beginning until the generator learns enough from it, at which point it begins to use the stronger discriminator more often. Note how the strong discriminator is sampled more frequently than the weak one. In fact, because the generator needs to produce samples of higher quality to fool the strong discriminator, training with the latter might take longer as opposed to using weaker discriminators (which are more lenient). By the end of training, all discriminators are being used in equal proportions, meaning that every discriminator plays a complementary role from mode coverage to quality samples. A similar pattern is observed for the discriminators case.
To assess the quality of produced results, we report the minimum Fréchet Inception Distance (FID, Heusel et al., 2017) (and corresponding epoch) reached in Table 3. The squared FID was computed every epoch with 1,000 heldout samples at training time. As in Fedus et al. (2017), a ResNet pretrained on CIFAR10 was employed to obtain representations for FID computation rather than Inception V3. Proceeding this way yields a more informative score, given that our classifier was trained on the same data as the generative models. Details on the FID score can be found in the Supplementary Material. We compared our results to Durugkar, Gemp, and Mahadevan (2016). Since the authors reported that GMAN1 () had an overall better performance, we used this version in our experiments and refer to it as GMAN. Previously, we observed that the feedback provided to the generator is shared between all the discriminators. Especially, not all gradient comes from the strong discriminator (unlike for the Vanilla GAN). One might be concerned by a degradation of the quality samples. We show that having more discriminators leads to better mode coverage and samples quality (see the FID curves for an increasing number of discriminators in the Supplementary Material). Overall, we noticed that acGAN achieved the best FID score when compared to the baseline as presented in Fig. 8 and 9 (plots are shown in a larger format in in the Supplementary Material). GMAN performed worse than expected and increasing the number of discriminators did not significantly improve its FID score. We suspect that the original loss function of the GAN (which is equivalent to the JensenShannon divergence minimization) is not a good signal to assess the progress of . Indeed, Arjovsky, Chintala, and Bottou (2017) argued and introduced a toy example showing that this version of adversarial nets is not informative when there is little overlap between the supports of the true and approximate distributions, as commonly seen at the beginning of the training process. Finally, not keeping a moving average via a
value can lead to high variance.
Best FID (epoch)  Mean Best FID  
Vanilla GAN  5.02 (20)  5.28 (27)  4.27 (30)  4.80 (34)  4.63 (41)  4.80  
WGANGP^{5}^{5}5We replaced the batch norm layer with instance norm  4.29 (43)  4.24 (28)  3.98 (47)  3.99 (37)  3.93 (50)  4.08  
3 Disc 
Uniform  4.18 (20)  4.07 (39)  4.35 (45)  5.07 (30)  4.39 (47)  4.41 
GMAN  3.87 (43)  4.05 (46)  5.24 (42)  5.71 (42)  4.10 (22)  4.59  
acGAN  3.93 (39)  3.57 (38)  4.25 (42)  3.43 (40)  3.11 (43)  3.66  
5 Disc 
Uniform  3.42 (47)  3.69 (49)  4.37 (37)  3.64 (37)  3.47 (40)  3.72 
GMAN  4.58 (44)  4.40 (20)  3.91 (47)  4.81 (25)  4.42 (38)  4.42  
acGAN  3.62 (35)  2.62 (49)  4.14 (35)  2.66 (42)  3.67 (34)  3.34 
5 Conclusion
In this work, we model the training of the generator against discriminators of increasing complexity within a onestudent/multipleteachers paradigm. We address this mixtureofexperts problem under the adversarial bandit setting with fullinformation, where we rely on the Hedge algorithm to learn the weights assigned to each discriminator in the mixture. Since designing a suitable reward function is a key ingredient to control the shape of the learned policy, we examined two sensible reward functions which relied on sample quality and the GAN loss function. We empirically found the high quality sample reward (Eq. 3.2) to yield the best results. Keeping a moving average on the rewards helped smoothing the weights put on discriminators and resulted in a more stable mixture.
Then, we demonstrated a complementary regulation mechanism between weak and strong discriminators. While weaker discriminators enjoy smoother properties and provide more informative feedback to the generator, stronger discriminators focus one finer grain detail to ensure sample quality.
Finally, we conducted a series of experiments to show the emergence of a curriculum during the training process. That is, lowercapacity discriminators have higher weights at the beginning but, as the training progresses, higher weights are allocated to highercapacity discriminators. We showed how existing algorithms could be recovered from our model via the value. The performed experiments showed that our proposed approach leads to an earlier convergence and a better FID score compared to existing baselines in the field, i.e. Uniform and GMAN.
As a direction for future investigation, approaches not relying on the adversarial framework could be investigated to model the nonstationarity of the reward distributions. For example, finding a meaningful representation for the state of the generator could allow the use of contextual bandits algorithms.
References
 Arjovsky and Bottou (2017) Arjovsky, M., and Bottou, L. 2017. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations.
 Arjovsky, Chintala, and Bottou (2017) Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein GAN. CoRR abs/1701.07875.
 Auer et al. (1995) Auer, P.; CesaBianchi, N.; Freund, Y.; and Schapire, R. E. 1995. Gambling in a rigged casino: The adversarial multiarmed bandit problem. In IEEE Annual Symposium on Foundations of Computer Science (FOCS), 322.
 Baransi, Maillard, and Mannor (2014) Baransi, A.; Maillard, O.A.; and Mannor, S. 2014. Subsampling for multiarmed bandits. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 115–131. Springer.
 Belghazi et al. (2018) Belghazi, I.; Rajeswar, S.; Baratin, A.; Hjelm, R. D.; and Courville, A. C. 2018. MINE: mutual information neural estimation. CoRR abs/1801.04062.
 Bengio et al. (2009) Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, 41–48. New York, NY, USA: ACM.
 Che et al. (2016) Che, T.; Li, Y.; Jacob, A. P.; Bengio, Y.; and Li, W. 2016. Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136.

Dowson and
Landau (1982)
Dowson, D. C., and Landau, B. V.
1982.
The fréchet distance between multivariate normal distributions.
Journal of Multivariate Analysis
12(3):450–455.  Dumoulin et al. (2016) Dumoulin, V.; Belghazi, I.; Poole, B.; Mastropietro, O.; Lamb, A.; Arjovsky, M.; and Courville, A. 2016. Adversarially Learned Inference. ArXiv eprints.
 Durugkar, Gemp, and Mahadevan (2016) Durugkar, I. P.; Gemp, I.; and Mahadevan, S. 2016. Generative multiadversarial networks. CoRR abs/1611.01673.
 Fedus et al. (2017) Fedus, W.; Rosca, M.; Lakshminarayanan, B.; Dai, A. M.; Mohamed, S.; and Goodfellow, I. 2017. Many paths to equilibrium: Gans do not need to decrease adivergence at every step. arXiv preprint arXiv:1710.08446.
 Freund and Schapire (1997) Freund, Y., and Schapire, R. E. 1997. A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences 55(1):119–139.
 Goodfellow et al. (2014) Goodfellow, I. J.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A. C.; and Bengio, Y. 2014. Generative adversarial networks. CoRR abs/1406.2661.
 Graves et al. (2017) Graves, A.; Bellemare, M. G.; Menick, J.; Munos, R.; and Kavukcuoglu, K. 2017. Automated curriculum learning for neural networks. CoRR abs/1704.03003.
 Gu (2008) Gu, C. 2008. Smoothing noisy data via regularization: statistical perspectives. Inverse Problems 24(3):034002.
 Gulrajani et al. (2017) Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and Courville, A. C. 2017. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, 5769–5779.
 Heusel et al. (2017) Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Klambauer, G.; and Hochreiter, S. 2017. Gans trained by a two timescale update rule converge to a nash equilibrium. CoRR abs/1706.08500.
 Hoang et al. (2017) Hoang, Q.; Nguyen, T. D.; Le, T.; and Phung, D. Q. 2017. Multigenerator generative adversarial nets. CoRR abs/1708.02556.
 Huang et al. (2016) Huang, X.; Li, Y.; Poursaeed, O.; Hopcroft, J. E.; and Belongie, S. J. 2016. Stacked generative adversarial networks. CoRR abs/1612.04357.
 JuefeiXu, Boddeti, and Savvides (2017) JuefeiXu, F.; Boddeti, V. N.; and Savvides, M. 2017. Gang of gans: Generative adversarial networks with maximum margin ranking. CoRR abs/1704.04865.
 Karras et al. (2017) Karras, T.; Aila, T.; Laine, S.; and Lehtinen, J. 2017. Progressive growing of gans for improved quality, stability, and variation. CoRR abs/1710.10196.
 Kwak and Zhang (2016) Kwak, H., and Zhang, B. 2016. Generating images part by part with composite generative adversarial networks. CoRR abs/1607.05387.
 Lin et al. (2017) Lin, Z.; Khetan, A.; Fanti, G. C.; and Oh, S. 2017. Pacgan: The power of two samples in generative adversarial networks. CoRR abs/1712.04086.
 Littlestone and Warmuth (1994) Littlestone, N., and Warmuth, M. K. 1994. The weighted majority algorithm. Information and computation 108(2):212–261.

Liu et al. (2015)
Liu, Z.; Luo, P.; Wang, X.; and Tang, X.
2015.
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
.  Matiisen et al. (2017) Matiisen, T.; Oliver, A.; Cohen, T.; and Schulman, J. 2017. Teacherstudent curriculum learning. CoRR abs/1707.00183.
 Metz et al. (2016) Metz, L.; Poole, B.; Pfau, D.; and SohlDickstein, J. 2016. Unrolled generative adversarial networks. CoRR abs/1611.02163.

Mitchell and
Beauchamp (1988)
Mitchell, T. J., and Beauchamp, J. J.
1988.
Bayesian variable selection in linear regression.
Journal of the American Statistical Association 83(404):1023–1032.  Miyato et al. (2018) Miyato, T.; Kataoka, T.; Koyama, M.; and Yoshida, Y. 2018. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957.
 Neyshabur, Bhojanapalli, and Chakrabarti (2017) Neyshabur, B.; Bhojanapalli, S.; and Chakrabarti, A. 2017. Stabilizing GAN training with multiple random projections. CoRR abs/1705.07831.
 Radford, Metz, and Chintala (2015) Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434.
 Roth et al. (2017) Roth, K.; Lucchi, A.; Nowozin, S.; and Hofmann, T. 2017. Stabilizing training of generative adversarial networks through regularization. CoRR abs/1705.09367.
 Srivastava et al. (2017a) Srivastava, A.; Valkov, L.; Russell, C.; Gutmann, M. U.; and Sutton, C. 2017a. Veegan: Reducing mode collapse in gans using implicit variational learning. ArXiv eprints.
 Srivastava et al. (2017b) Srivastava, A.; Valkoz, L.; Russell, C.; Gutmann, M. U.; and Sutton, C. 2017b. Veegan: Reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems, 3310–3320.
 Tolstikhin et al. (2017) Tolstikhin, I. O.; Gelly, S.; Bousquet, O.; SimonGabriel, C.J.; and Schölkopf, B. 2017. Adagan: Boosting generative models. In Advances in Neural Information Processing Systems, 5424–5433.

Xu, Caramanis, and
Mannor (2009)
Xu, H.; Caramanis, C.; and Mannor, S.
2009.
Robustness and regularization of support vector machines.
Journal of Machine Learning Research 10(Jul):1485–1510.
Supplementary Material
Regularizing the discriminator through additive white noise
As was explored in Arjovsky and Bottou (2017), one way to stabilize GAN training is corrupting the input of the network with additive white Gaussian noise of the form . Here, we explore smoothing the discriminator by using the noise, We ran the following experiment in order to illustrate the mechanism. We (once again) train on the 8 Gaussian synthetic dataset with 3 discriminators (1,2 and 3 hidden layers of units) both with and without adding independent Gaussian noise to the discriminator’s input. A noticeable downside of feeding corrupted inputs to the network is the degradation of samples’ quality: the socalled salt and pepper effect becomes more visible as the discriminators train. To solve this issue, we decay the noise at time step by a multiplicative coefficient: , where is a real constant controlling the noise reduction speed. Initial Gaussian noise was picked to be of the form , with variances of , , , for being the weakest discriminator and the strongest. Adding white noise increases the entropy (read uncertainty) of the discriminator (a proof is shown in the Supplementary Material) and tends to smooth its decision boundary (see the probability and gradient norm values in Figs. 11 and 10). Fitting a discriminator to uncorrupted input is prone to faster overfitting as opposed to training on noisy data when fixing the number of parameters, a great illustration of which is provided by Gu (2008). Empirical results are shown in Fig. 10. We see that by corrupting the real data we manage to cover all 8 modes and the sample quality is conserved by decaying the variance of the noise. The evolution of generated data points is shown in Fig. 14.
5.1 Effect of different nonlinear activation layer on the weak discriminator’s smoothness
In this section, we aim to illustrate the effect of 3 nonlinear activation layers (Tanh, Leaky RelU and ReLU) on the gradient norm of the discriminator. We ran the training of the generator with 3 discriminator (using SoftacGAN) on the 8 Gaussian dataset. For some performance issue, we just replace the activation layer of the weak discriminator with respectively the 3 above mentionned activation layers and let the other discriminator with ReLU. Fig. 12 shows the gradient norm of the weak discriminator on the whole space . We see that Tanh has a very uniform gradient norm across the space while ReLU is the most discontinuous. Leaky ReLU has an intermediate pattern. Yet, Tanh seems to have flat behaviour (very small magnitude), this may be due to the Tanh function that has very low gradient signal at the extremity (indeed, we witness very poor performance with that activation layer). Leaky ReLU is less discontinuous although it also partitions the subspace in the same way as ReLU.
5.2 Evolution of the gradient norm during the training
In this section, we show the evolution of the gradient norm of each discriminator throughout the training process (results shown in Fig. 13). We see at the beginning of the training (first row) as the generator has just learned the top left modes, discriminator has flat behavior on the bottom right part of the subspace and has higher gradient norm on the top left part. A the training process, we see that missing modes has high gradient norm (second row third column). Finally, at the end when the generator has learned all the modes the weak discriminator seems to have more uniform gradient norm on the space while strong discriminator has equal gradient norm value at each modes locations.
5.3 Regularizing the discriminator through additive noise
Increasing entropy with additive Gaussian noise.
In order to have discriminator networks with varying degrees of strength, we first resorted to nested architectures: for instance, the stronger discriminator should have a more complex architecture than the weaker. Moreover, we proceeded to corrupt the inputs with additive Gaussian white noise. Formally, to the input matrix of the discriminator we added , thus creating new input which was then fed to the discriminator. For practical purposes, noise for image data should be on a bounded support in order to obtain meaningful RGB values.
Letting the weaker discriminators train on inputs corrupted with a Gaussian noise with larger variance allows the network to learn a highlevel representation of the dataset, while feeding uncorrupted inputs will let the corresponding to specialize. This tradeoff between sample space coverage and estimation accuracy is known as the spikeandslab prior and is frequently used in Bayesian variable selection methods similar to the one proposed by (Mitchell and
Beauchamp, 1988).
Consider the following relation, known in information theory as the entropy power inequality (EPI). Let be a continuous, realvalued and independent random vector on a bounded support and , both of dimension :
(8) 
Applying logarithm on both sides and using , we get an expression for the entropy of the sum and :
(9) 
from which it follows that adding i.i.d. Gaussian noise to the inputs increases the total entropy of the data. Here we used the fact that for all and that the uniform distribution
has the maximal entropy over all continuous random variables with bounded support
. The quantity controls the tightness of the bound. Because Eq. 9 is valid for all , it is necessary valid for . Maximizing the expression shows that picking increases the overall entropy and approaches the uniform distribution.Finally, recall that the entropy where
. That is, maximizing the entropy of a distribution is equivalent up to an additive constant to minimizing the KullbackLeibler divergence between the distribution and a uniform random variable with identical support (provided adequate restrictions on the support).
Fitting a weak discriminator to the corrupted data should increase its capacity to generalize more than that of the stronger discriminator by acting as a regularization technique and preventing the network from overfitting.
Analogous mechanisms are widely used in conjunction with other learning algorithms, such as support vector machines, where adding noise to the data is equivalent to increasing the classification margin as shown by (Xu, Caramanis, and Mannor, 2009).
As a final remark, it is important to select a proper noise distribution in order to avoid introducing bias and respect the original structure of the data.
5.4 Experimental parameters
Algorithm parameters  

acGAN  
number of discriminators  
Optimizer parameters  
StackedMNIST  RMSprop () 
CIFAR10  Adam () 
Synthetic (25 Gaussians)  Adam () 
Synthetic (8 Gaussians)  Adam () 
CelebA  Adam () 
General experimental hyperparameters.
5.5 Synthetic data
We utilize the 2Dring with 8 Gaussians and the 2Dgrid with 25 Gaussians Gulrajani et al. (2017). Three metrics were employed to evaluate the results:

[leftmargin=*]

High Quality samples

Number of Covered modes

Fréchet Distance (FD)
The percentage of "High Quality" samples is defined as the proportion of generated samples which are within standard deviation of the closest mode. The next metric reported is the number of modes covered, i.e. the count of modes that has generated samples closes enough (). The Fréchet Distance originally from Dowson and Landau (1982) is defined as:
(10) 
where and
are first and second order moments of the real data distributions and estimates from generated data, respectively.
Architecture.
The generator network’s architecture comprises 4 dense layers of units each. We used 3 discriminators with respectively 2,3 and 4 dense layers of
units. ReLU activations were used in all layers, except for the last one, where a linear activation function was used for the generator and a sigmoid for the discriminator.
StackedMNIST
Architecture.
We used DCGAN’s architecture Radford, Metz, and Chintala (2015) to create lower capacity discriminators (in terms of feature representation power). For the 3Ds case, we used discriminators 3, 4, 5 (described in the following tables). For the 5Ds case, we used discriminators 1, 2, 3, 4, 5.
Layer  Outputs  Kernel size  Stride  BN  Activation 

Input:  
Fully connected  2*2*512  4, 4  2, 2  Yes  ReLU 
Transposed convolution  4*4*256  4, 4  2, 2  Yes  ReLU 
Transposed convolution  8*8*128  4, 4  2, 2  Yes  ReLU 
Transposed convolution  14*14*64  4, 4  2, 2  Yes  ReLU 
Transposed convolution  28*28*3  4, 4  2, 2  No  Tanh 
Layer  Outputs  Kernel size  Stride  BN  Activation 

Input  28*28*3  
Convolution  14*14*64  4, 4  2, 2  No  LeakyReLU 
Convolution  7*7*128  4, 4  2, 2  Yes  LeakyReLU 
Convolution  4*4*256  4, 4  2, 2  Yes  LeakyReLU 
Convolution  2*2*512  4, 4  2, 2  Yes  LeakyReLU 
Convolution  1  4, 4  2, 2  No  Sigmoid 
Layer  Outputs  Kernel size  Stride  BN  Activation 

Input  28*28*3  
Convolution  14*14*64  4, 4  2, 2  No  LeakyReLU 
Convolution  7*7*128  4, 4  2, 2  Yes  LeakyReLU 
Convolution  4*4*256  4, 4  2, 2  Yes  LeakyReLU 
Convolution  2*2*512  4, 4  2, 2  Yes  LeakyReLU 
Fully connected  1  4, 4  2, 2  No  Sigmoid 
Layer  Outputs  Kernel size  Stride  BN  Activation 

Input  28*28*3  
Convolution  13*13*64  6, 6  2, 2  No  LeakyReLU 
Convolution  6*6*128  6, 6  2, 2  Yes  LeakyReLU 
Convolution  2*2*256  6, 6  2, 2  Yes  LeakyReLU 
Convolution  1  6, 6  2, 2  No  Sigmoid 
Layer  Outputs  Kernel size  Stride  BN  Activation 

Input  28*28*3  
Convolution  13*13*64  6, 6  2, 2  No  LeakyReLU 
Convolution  6*6*128  6, 6  2, 2  Yes  LeakyReLU 
Convolution  2*2*256  6, 6  2, 2  Yes  LeakyReLU 
Fully connected  1  6, 6  2, 2  No  Sigmoid 
Layer  Outputs  Kernel size  Stride  BN  Activation 

Input  28*28*3  
Convolution  12*12*64  8, 8  2, 2  No  LeakyReLU 
Convolution  4*4*128  8, 8  2, 2  Yes  LeakyReLU 
Convolution  1  8, 8  2, 2  No  Sigmoid 
5.6 Cifar10
FID score. FID scores, as introduced in Heusel et al. (2017), were computed for CIFAR10. It is defined as the squared Fréchet distance between the Gaussian having the first and second order statistics matching those obtained from image features. The late layers of a pretrained classifier are used as low dimensional representation of images for statistics estimation.
Architecture. For our strongest discriminator we use the DCGAN architecture but with halved the number of filter, i.e. . For the 3D case, we introduced two extra discriminators with kernel sizes of 6 and 8. For the 5D case, we add two discriminators with kernel sizes 4 and 6 respectively to the set of 3D discriminator networks. In both 3D and 5D, we replaced the last layer from the DCGAN model with a fully connected dense layer. The generator network was taken from the original DCGAN architecture but with halved filter sizes too, i.e. . ReLU activation units were used for the generator network while LeakyRelu is used for the discriminators with a coefficient of .
Influence of the number of discriminators. An important assumption in the current paper is that increasing the number of discriminator networks helps the model converge faster. To assess that, we conducted experiments with the acGAN algorithm while varying the number of discriminators for ( being the Vanilla GAN) and averaging results over 5 seeds. According to Fig. 16, we see that a higher number of discriminators indeed leads to earlier convergence of the FID score curve.
CelebA
For both the CelebA (Liu et al., 2015) datasets, we conducted singlerun experiments of 50,000 iterations each counted in generator steps ( epochs). We downscaled the original images to pixels out of practical concerns.
Results.
Similarly to CIFAR10, we observe the emergence of a curriculum in Fig. 23. In particular, we note the presence of alternating phases during which a specific discriminator is dominating in the 3D and 5D cases. In the end, all discriminator probabilities converge to a stationary (i.e. long term) uniform distribution just like for previously mentioned datasets.
Architecture.
We used the same architecture as for the CIFAR10 experiments except that the original numbers of filters were set to: for the discriminators and for the generator.
Generating 128x128 images.
In this experiment, we generated high resolution images with 3 and 5 discriminators. A convolutional layer with 2048 feature maps was added to both generator and discriminators architectures. The 3 discriminators settings used a kernel size of 4,6 and 8. For the 5 discriminators case, we added a discriminator of kernel size 4 and 6 but replaced the last layer with dense layers. The same parameters was employed as for CelebA 64x64.

