Online Adaptative Curriculum Learning for GANs

by   Thang Doan, et al.

Generative Adversarial Networks (GANs) can successfully learn a probability distribution and produce realistic samples. However, open questions such as sufficient convergence conditions and mode collapse still persist. In this paper, we build on existing work in the area by proposing a novel framework for training the generator against an ensemble of discriminator networks, which can be seen as a one-student/multiple-teachers setting. We formalize this problem within the non-stationary Multi-Armed Bandit (MAB) framework, where we evaluate the capability of a bandit algorithm to select discriminators for providing the generator with feedback during learning. To this end, we propose a reward function which reflects the amount of knowledge learned by the generator and dynamically selects the optimal discriminator network. Finally, we connect our algorithm to stochastic optimization methods and show that existing methods using multiple discriminators in literature can be recovered from our parametric model. Experimental results based on the Fréchet Inception Distance (FID) demonstrates faster convergence than existing baselines and show that our method learns a curriculum.


page 15

page 17

page 18

page 19

page 20

page 21

page 22

page 23


Dropout-GAN: Learning from a Dynamic Ensemble of Discriminators

We propose to incorporate adversarial dropout in generative multi-advers...

On distinguishability criteria for estimating generative models

Two recently introduced criteria for estimation of generative models are...

Multi-Armed Bandit Strategies for Non-Stationary Reward Distributions and Delayed Feedback Processes

A survey is performed of various Multi-Armed Bandit (MAB) strategies in ...

Improved Training with Curriculum GANs

In this paper we introduce Curriculum GANs, a curriculum learning strate...

Automated Curriculum Learning for Neural Networks

We introduce a method for automatically selecting the path, or syllabus,...

1 Introduction

Generative Adversarial Networks (GANs, Goodfellow et al., 2014)

have reshaped the state of machine learning in tasks that involve generating data. A GAN is an unsupervised method that consists of two neural networks, a generator and a discriminator, with opposing (or


) objectives. The typical goal of the generator is to transform noise (e.g., drawn from a normal distribution) into samples whose statistical and structural characteristics match well those of an empirical target dataset (such as a collection of images). The discriminator, which acts as an

adversary to the generator, needs to discriminate between (or classify) samples as coming from the real data or the generator.

While GANs can achieve impressive qualitative performance (most notably with image data, e.g., see Roth et al., 2017; Miyato et al., 2018; Karras et al., 2017), the most successful methods depart from the original formulation to address various instabilities and other optimization difficulties (Arjovsky and Bottou, 2017; Arjovsky, Chintala, and Bottou, 2017). One such difficulty in training GANs occurs when the generator produces samples only from a small subset of the target distribution, a phenomenon known as missing modes (a.k.a., mode-dropping, e.g. see Che et al., 2016). Numerous works try to address the problem by modifying the original objective, such as unrolling (Metz et al., 2016), aggregating samples (Lin et al., 2017), stacked architectures (Huang et al., 2016; Karras et al., 2017), mutual information / entopy maximization (Belghazi et al., 2018), multiple discriminators (Neyshabur, Bhojanapalli, and Chakrabarti, 2017; Juefei-Xu, Boddeti, and Savvides, 2017), or multiple generators (Tolstikhin et al., 2017; Hoang et al., 2017; Kwak and Zhang, 2016).

In our work, we follow the intuition that missing modes in GANs are due in part to mode-specific vanishing gradients. As a simple illustrative example which we explore in detail in our experiments below (Fig. 1), consider a discriminator that is well representing the target distribution and a generator that is only generating a subset of the modes in the data. If any of the missing modes are disjoint from those represented in the generator (i.e., are composed of sets of features with low intersection), there is no way for the generator to receive gradient signal on missing modes from the discriminator.

Figure 1: Recovering dropped modes via multiple discriminators. The weak discriminator provides feedback, allowing the generator to recover forgotten modes. The strong discriminator experiences vanishing gradient and cannot help the generator to recover modes.

However, if the discriminator only represents the data approximately (in the sense that it also cannot fully distinguish between these modes), it may be possible to recover the missing mode gradient signal. If this can be achieved by using a low capacity111Throughout the paper, we refer to capacity as the architecture size of a given neural network in terms of number of parameters. discriminator, it is ultimately undesirable given that the end goal is to generate samples that resemble well the target dataset. From now on, we will refer to such low capacity discriminators as weak and to high capacity discriminators as strong. In order to ensure both high quality and mode coverage, we consider multiple discriminators (as in Durugkar, Gemp, and Mahadevan, 2016) with different strengths to train the generator. We propose to train the generator using a curriculum based on an on-line multi-armed bandit algorithm (Matiisen et al., 2017; Graves et al., 2017), dynamically changing the weight/resources allocated to each discriminator, which we show is crucial for achieving good results. Our primary contributions are:

  1. [leftmargin=*,noitemsep,nolistsep]

  2. We provide important insights into the missing mode problem as demonstrated by the gradient signal available to the generator from the discriminator.

  3. As a potential solution to the missing modes problem, we introduce a new framework based on adversarial bandits (Littlestone and Warmuth, 1994; Auer et al., 1995; Freund and Schapire, 1997) resource allocation, where the generator gets its training signal from a set of teacher networks with increasing capacity.

  4. We show that the proposed approach leads to a curriculum learning characterized by successive phases of the generator prioritizing different discriminators.

The remainder of this paper is organized as follows. Previous literature relevant to this work is briefly reviewed on Section 2. The proposed approach is formally introduced in Section 3, and an empirical analysis is reported in Section 4. Conclusions and future directions are finally presented in Section 5.

2 Related Work

Mode coverage and data / model augmentation

The intuition that missing modes are due to vanishing gradients resonates with some successful approaches on stabilizing and improving GAN training through data and model augmentation. Instance noise (Arjovsky and Bottou, 2017) has been shown to improve stability (see also Roth et al., 2017), which can be understood as smoothing the data modes in the pixel space. Progressively reducing the downsampling through training (either by copying parameters or feeding low resolution samples into a larger generator) have also been considered previously (Huang et al., 2016; Karras et al., 2017) as solutions to increase mode overlap. This is akin to a hand-crafted curriculum, progressively increasing the difficulty of the problem at a-priori chosen points in the complete training procedure.

Multiple discriminators and generators

Several works have also incorporated multiple generators or discriminators in order to improve learning. Multiple-generator methods (Tolstikhin et al., 2017; Hoang et al., 2017; Kwak and Zhang, 2016) typically work by encouraging the generators to divide the task of generating by modes in the target dataset (without additional supervision). Using multiple discriminators (Neyshabur, Bhojanapalli, and Chakrabarti, 2017; Juefei-Xu, Boddeti, and Savvides, 2017), on the other hand, is known to provide a better learning signal for the generator if said discriminators compositionally represent well the target datasets. Closest to our work, Durugkar, Gemp, and Mahadevan (2016) consider discriminators of different complexity to provide varied signal. We will show that wisely designing the reward allows to track the progress made by the generator and encourages a curriculum learning.

Multi-armed bandit as a curriculum learning method for GANs

Curriculum learning (Bengio et al., 2009) phrases a given machine learning problem as a set of tasks of increasing difficulty. GANs can also be said to share aspects with curriculum learning: the discriminator defines an objective of progressive difficulty,

thus allowing the generator to gradually learn to more faithfully mimic the target distribution. However, there is no explicit mechanism to encourage a sensible curriculum for either model. For example, if the discriminator learns to represent disjoint modes faster than the generator learns to cover them, this can lead to the generator missing modes with no gradient signal to recover.

In this paper, we propose an algorithm which gives rise to a curriculum in a direct manner. Our approach borrows from curriculum learning in multi-armed bandit setting (Matiisen et al., 2017; Graves et al., 2017)

, where learning is typically done by measuring the change in a performance criterion of a given agent (i.e. a loss function, score or gradient norm can be used) that appears to affect the form of the optimal policy. In our method, given a set of discriminators, the goal is to weight the feedback received by the generator proportionally to the information contained in the gradients from each discriminator.

3 Adaptative Curriculum GAN

Here we formulate the problem and approach for training a single generator on a target dataset using a curriculum over multiple discriminators, which we call Adaptative Curriculum GAN (acGAN). First, define a generator function, , which maps noise from a domain to the domain of a target dataset, (such as the space of images). Let denote the target density 222Here, we assume for the sake of notation that the target data admits a density., and let denote the prior density defined on used to draw noise samples for input into the generator. We wish to train this generator function using discriminators, , such that on each episode , we select the mixture of discriminators that provides the best learning signal.

3.1 Mixing discriminators

This mixture-of-experts problem, where each discriminator plays the role of a teacher, can be tackled under the full-information adversarial bandit setting (Littlestone and Warmuth, 1994; Freund and Schapire, 1997; Auer et al., 1995). On each episode , a bandit player associates normalized weights with discriminators . The generator is then trained based on the mixture described by , and a reward is observed for each discriminator , characterizing the generator’s improvement with respect to . Let denote the total observed reward at time . The goal of the player is to learn the optimal policy that maximizes the expected total reward333 denotes the standard simplex on ..

The Hedge algorithm (Freund and Schapire, 1997), also known as Boltzmann or Gibbs distribution, addresses this full-information game by maintaining probabilities


for each discriminator , where estimates the gain of at episode . In this case, is a parameter of the distribution:

corresponds to a uniform distribution over all models. We found experimentally that using a moving average on previous rewards 

(which also featured in Matiisen et al., 2017) stabilizes the training:


where is the smoothing parameter.


fake samples

Evaluate performance of by observing rewards

update with Eq. 4

and G with Eq. 5

update with Eq. 2

and compute with Eq. 1
Figure 2: Proposed procedure for training the generator

To demonstrate how this can be used to train GANs, consider the usual value function (Goodfellow et al., 2014):


On each episode , given the mixture of discriminators , each discriminator is trained by taking a gradient step to increase the expected value function


and the generator is trained by taking a gradient step to increase


The latter corresponds to the non-saturated version of Eq. 4 for the generator. The intuition is that training the generator with all the discriminators simultaneously (as a mixture) should force the generator to fool all discriminators at the same time (Durugkar, Gemp, and Mahadevan, 2016). Since each discriminator has an increasing level view of the modes distribution, they should have a complementary role. While the weaker discriminator focuses on modes coverage, the stronger discriminator ensures samples quality (showed in Section 4.1). This should result into a better overall coverage of the modes in the input distribution.

Algorithm 1 describes our proposed acGAN procedure. We denote and parameterize this algorithm as where .

1:  Given: : number of discriminators, : time steps, : warmup time, : moving average coefficient, : Boltzmann constant
3:  for  do
4:     Update all discriminators using Eq. 4
5:     Update the generator using Eq. 5
6:     if  then
7:        Evaluate the performance of and observe a reward for each discriminator
8:        Update all values according to Eq. 2
10:     end if
11:  end for
Algorithm 1 Generic acGAN algorithm
Remark 1.

At the beginning of the training, we define a warm-up period , prior to which we train and with a uniform probability, i.e . In other words, we consider . This guarantees that each discriminator is updated a minimum number of times (or provides feedback a minimum number of times to the generator) and prevents one from dominating the others (i.e, ) at the beginning of the training. Without this safeguard, the remaining weights would hardly recover a significant probability and the generator may never get informative gradient from the corresponding discriminator. Note that warm-ups are not uncommon either in bandits algorithm, e.g. for adding robustness to the tails of reward distributions (Baransi, Maillard, and Mannor, 2014).

3.2 Reward shaping

In order to provide meaningful feedback for learning efficient mixtures of discriminators, we consider different reward functions to generate . We argue that progress (i.e., the learning slope (Matiisen et al., 2017; Graves et al., 2017)) of the generator is a more sensible way to evaluate our policy. Let be the generator parameters at episode . We define the two following quantities for measuring generator progress:


The former measures the progress of the generator with respect to the discriminator score , while the latter assess the change in the loss function (Eq. 3). Since the change in the quality sample (Eq .3.2) led to better performance than the change in the loss function (Eq .3.2), all our experiments (see Section. 4) use Eq .3.2.

3.3 Connection to existing methods

Interestingly, some existing methods in the GAN literature can be seen as a specific case of acGAN:


The original GMAN (Durugkar, Gemp, and Mahadevan, 2016) algorithm can be recovered by setting and taking the loss function to be the reward . Note how the authors of GMAN call their algorithm GMAN-, where is also the Boltzmann coefficient.


The uniform case is defined by assigning a fixed uniform probability for each discriminator :

This corresponds to Eq. 1 with .

To support the results of our theoretical work, we conducted a set of experiments which we describe below.

4 Experiments

In this section, we first give an understanding of how each discriminator provides informative feedback to the generator. We then compare our proposed approach (acGAN) against existing methods from the literature.

4.1 Retaining mode information through weaker capacity discriminators and smoothness

We begin by analyzing the gradient norm of the discriminator networks and we show that weak capacity discriminators are smoother

than strong discriminators. This property corresponds to a "coarse-grained" representation of the distribution, which allows the generator to recover missing modes. We further show we can increase the smoothness of a weak discriminator by corrupting its inputs with white noise. This results in an increase of the discriminator’s entropy (see Supplementary Material for more details) and hence smoother gradient signal.

Weak Discriminators: a way to retain modes

We now highlight the role of weaker capacity discriminators. To this extent, we performed the following experiments on the 8 Gaussian synthetic dataset:

  • [leftmargin=*]

  • We pretrained the generator (with dense layers of

    units with ReLU activation layers except for the last layer) with one discriminator on only

    of the original modes.

  • We trained a (vanilla) GAN on all Gaussian components, initializing with the -mode generator above. The discriminator had dense layers of units (ReLU hidden activation layers).

  • We trained acGAN with the generator initialized with the -mode generator (as with vanilla GAN). We considered discriminators, with , and dense layers respectively (same activation scheme as previously applies here).

Figure 3: Modes used for pretraining the generator (left) and modes recovered by Vanilla GAN (middle) and acGAN (right). The more modes the better.
Figure 4: Gradient norm of each discriminator with respect to the input. We clipped the magnitude with respect to the weaker discriminator range. Since weaker discriminators are smoother by construction, they help the generator to recover missing modes. On the other hand, vanilla GAN can hardly recover modes due to its vanishing gradient.

Results (Fig. 3) show the Vanilla GAN could only retrieve additional modes, while acGAN recovered all () modes. We examined the gradients provided by the discriminators using a density plot (Fig. 4) of the gradient norm for each discriminator with respect to the input, i.e., for . Observe that there is a clear progression from a stronger discriminator with more distinct, higher gradients to the weaker discriminator smoother gradients. Additionally, note that the discriminator from the vanilla GAN, which has very high gradient norm values, has gradients for modes not present in the generator: the discriminator has information useful for learning about these missing modes, but the generator does not learn these modes due to vanishing gradients. Our results support both our original hypothesis that missing modes are due to vanishing gradients and that using a coarse-grain discriminator can be used to recover missing modes. To provide further insight, we show the evolution of the gradient norm of each discriminator at training time in the Supplementary Material. We also note that the discontinuities in the gradients is due to the ReLU activation partitioning the subspace through overlapping half-planes, which contrasts the smooth decay of hyperbolic tangent and sigmoid444 nonlinearities, and we further explore the effect of different nonlinear activation layers on the gradient norm of the weak discriminator in the Supplementary Material.

4.2 Performance of acGAN against existing baselines

In this section, we evaluate the performance of our proposed method (acGAN), on various datasets. All experiments consider the reward shown in Eq. 3.2. We first conducted a sanity check on 2 mode-dropping datasets: synthetic data consisting of a mixture of 25 Gaussians and Stacked-MNIST with 1000 modes. We then tested it on CIFAR10 and finally show generated samples on celebA dataset (see Supplementary Material). We aim to analyze specific properties such as diversity of generated samples and quality in terms of  (FID, Heusel et al., 2017) score when available, along with convergence of the method (how fast it reaches its minimum FID score). Additionally, our results hint at the emergence of a curriculum during the training process.
All parameters used to obtain the results can be found in the Supplementary Material. We split the batch of inputs between discriminators. We abuse of language with the term epoch

, which in the context of the current paper means that the generator has been trained on a number of iterations equivalent to an epoch. For example, CIFAR-10 has 50,000 training images and, assuming a batch size of 64, one epoch represents roughly 781 iterations for the generator.

FD Modes Quality samples
Vanilla GAN
Uniform (3D)
acGAN (3D)
Table 1: Results on the Gaussian mixture synthetic data. Our method acGAN could cover allc 25 modes.
Figure 5: KDE plots of the modes recovered by each examined approach with 3 discriminators.

Synthetic Gaussian mixture dataset

The synthetic dataset is composed of 25 bivariate Gaussian mixtures arranged in a two-dimensional grid. We launched a single run of 15 epochs for all methods with 3 discriminators. We report 3 measures in Table 1: the Fréchet Distance (FD), the number of recovered modes and the proportion of high quality samples (which is the proportion of samples covering a mode). More details on those metrics can be found in the Supplementary Material.

We compared the performance of our proposed methods to that of the Uniform algorithm and of the vanilla GAN (Goodfellow et al., 2014). Our proposed methods could cover the 25 modes. KDE plots for the 3 discriminators case are shown in Fig. 5.

Modes (max 1000) KL
DCGAN (Radford, Metz, and Chintala, 2015)
ALI (Dumoulin et al., 2016)
Unrolled GAN (Metz et al., 2016)
VEEGAN (Srivastava et al., 2017a)
PacGAN (Lin et al., 2017)
GAN+MINE (Belghazi et al., 2018)
acGAN (3D)
acGAN (5D)
Table 2: Number of modes covered and Kullback-Leiber divergence between the real and generated distributions on Stacked-MNIST. acGAN could recover the 1000 modes.
Figure 6: Stacked-MNIST generated samples for acGAN with 3 (left) and 5 (right) discriminators.


We use the Stacked-MNIST dataset (Srivastava et al., 2017b) to measure the mode coverage of our proposed approach. The dataset is generated by stacking 3 randomly selected digits from the MNIST dataset: one on each RGB channel to produce a final

RGB tensor. The dataset has 128,000 training images and is assumed to have

modes. Results of our experiments are shown in Table 2.

We report our results (averaged over 10 runs) in Table 2 and compare them with other existing baselines in the literature. Our method could recover all 1000 modes like PaCGAN (Lin et al., 2017) and MINE (Belghazi et al., 2018)

; these two approaches either increase the dimensionality of the generator inputs either by packing multiple samples or by adding a latent code vector which helps overcoming mode collapse. Generated samples are shown in Fig. 

6, our results further verify our hypothesis that acGAN is a sensible approach to ensuring good mode coverage and sample quality.

Figure 7: Weight of each discriminator over the training epochs. We can see phase switching at the beginning where each discriminator’s weight is dominating before eventually converging to a uniform distribution.


We conducted an in-depth study of acGAN’s performance on CIFAR-10 by running experiments on 5 independent seeds for 50 epochs each.

We found a particular pattern in the acGAN’s learning process: it consists of distinct regimes where one discriminator’s weight dominates over the others. To illustrate this, we averaged the sampling probability of each discriminator over every 200 iterations and plotted results in Fig. 7 for and discriminators, respectively. The reported curves suggest that, for discriminators, the weakest discriminator network is often sampled at the beginning until the generator learns enough from it, at which point it begins to use the stronger discriminator more often. Note how the strong discriminator is sampled more frequently than the weak one. In fact, because the generator needs to produce samples of higher quality to fool the strong discriminator, training with the latter might take longer as opposed to using weaker discriminators (which are more lenient). By the end of training, all discriminators are being used in equal proportions, meaning that every discriminator plays a complementary role from mode coverage to quality samples. A similar pattern is observed for the -discriminators case.

To assess the quality of produced results, we report the minimum Fréchet Inception Distance (FID, Heusel et al., 2017) (and corresponding epoch) reached in Table 3. The squared FID was computed every epoch with 1,000 held-out samples at training time. As in Fedus et al. (2017), a ResNet pre-trained on CIFAR-10 was employed to obtain representations for FID computation rather than Inception V3. Proceeding this way yields a more informative score, given that our classifier was trained on the same data as the generative models. Details on the FID score can be found in the Supplementary Material. We compared our results to Durugkar, Gemp, and Mahadevan (2016). Since the authors reported that GMAN-1 () had an overall better performance, we used this version in our experiments and refer to it as GMAN. Previously, we observed that the feedback provided to the generator is shared between all the discriminators. Especially, not all gradient comes from the strong discriminator (unlike for the Vanilla GAN). One might be concerned by a degradation of the quality samples. We show that having more discriminators leads to better mode coverage and samples quality (see the FID curves for an increasing number of discriminators in the Supplementary Material). Overall, we noticed that acGAN achieved the best FID score when compared to the baseline as presented in Fig. 8 and 9 (plots are shown in a larger format in in the Supplementary Material). GMAN performed worse than expected and increasing the number of discriminators did not significantly improve its FID score. We suspect that the original loss function of the GAN (which is equivalent to the Jensen-Shannon divergence minimization) is not a good signal to assess the progress of . Indeed, Arjovsky, Chintala, and Bottou (2017) argued and introduced a toy example showing that this version of adversarial nets is not informative when there is little overlap between the supports of the true and approximate distributions, as commonly seen at the beginning of the training process. Finally, not keeping a moving average via a

-value can lead to high variance.

Figure 8: FID scores computed with 1,000 samples at the end of each epoch for different methods with 3 discriminators. acGAN outperforms the baselines Uniform and GMAN.
Figure 9: FID curves with 5 discriminators. acGAN presented earlier convergence and reached lower FID values.
Best FID (epoch) Mean Best FID
Vanilla GAN 5.02 (20) - 5.28 (27) - 4.27 (30) - 4.80 (34) - 4.63 (41) 4.80
WGAN-GP555We replaced the batch norm layer with instance norm 4.29 (43) - 4.24 (28) - 3.98 (47) - 3.99 (37) - 3.93 (50) 4.08

3 Disc

Uniform 4.18 (20) - 4.07 (39) - 4.35 (45) - 5.07 (30) - 4.39 (47) 4.41
GMAN 3.87 (43) - 4.05 (46) - 5.24 (42) - 5.71 (42) - 4.10 (22) 4.59
acGAN 3.93 (39) - 3.57 (38) - 4.25 (42) - 3.43 (40) - 3.11 (43) 3.66

5 Disc

Uniform 3.42 (47) - 3.69 (49) - 4.37 (37) - 3.64 (37) - 3.47 (40) 3.72
GMAN 4.58 (44) - 4.40 (20) - 3.91 (47) - 4.81 (25) - 4.42 (38) 4.42
acGAN 3.62 (35) - 2.62 (49) - 4.14 (35) - 2.66 (42) - 3.67 (34) 3.34
Table 3: Best FID scores on CIFAR-10 computed on 1,000 samples during training time (lower is better).

5 Conclusion

In this work, we model the training of the generator against discriminators of increasing complexity within a one-student/multiple-teachers paradigm. We address this mixture-of-experts problem under the adversarial bandit setting with full-information, where we rely on the Hedge algorithm to learn the weights assigned to each discriminator in the mixture. Since designing a suitable reward function is a key ingredient to control the shape of the learned policy, we examined two sensible reward functions which relied on sample quality and the GAN loss function. We empirically found the high quality sample reward (Eq. 3.2) to yield the best results. Keeping a moving average on the rewards helped smoothing the weights put on discriminators and resulted in a more stable mixture.

Then, we demonstrated a complementary regulation mechanism between weak and strong discriminators. While weaker discriminators enjoy smoother properties and provide more informative feedback to the generator, stronger discriminators focus one finer grain detail to ensure sample quality.

Finally, we conducted a series of experiments to show the emergence of a curriculum during the training process. That is, lower-capacity discriminators have higher weights at the beginning but, as the training progresses, higher weights are allocated to higher-capacity discriminators. We showed how existing algorithms could be recovered from our model via the -value. The performed experiments showed that our proposed approach leads to an earlier convergence and a better FID score compared to existing baselines in the field, i.e. Uniform and GMAN.

As a direction for future investigation, approaches not relying on the adversarial framework could be investigated to model the non-stationarity of the reward distributions. For example, finding a meaningful representation for the state of the generator could allow the use of contextual bandits algorithms.


Supplementary Material

Figure 10: Adding noise (bottom row) reduces gradient norm magnitude of each discriminator. This increases their smoothness properties and helps recovering modes of the distribution. We clipped the gradient magnitude with respect to the corresponding discriminator corrupted with noise.
Figure 11: Probability for each discriminator without (top) and with (bottom) white Gaussian noise. Noise tends to smooth their decision boundary and increase their entropy. That helps to provide more informative gradient to the generator.

Regularizing the discriminator through additive white noise

As was explored in Arjovsky and Bottou (2017), one way to stabilize GAN training is corrupting the input of the network with additive white Gaussian noise of the form . Here, we explore smoothing the discriminator by using the noise, We ran the following experiment in order to illustrate the mechanism. We (once again) train on the 8 Gaussian synthetic dataset with 3 discriminators (1,2 and 3 hidden layers of units) both with and without adding independent Gaussian noise to the discriminator’s input. A noticeable downside of feeding corrupted inputs to the network is the degradation of samples’ quality: the so-called salt and pepper effect becomes more visible as the discriminators train. To solve this issue, we decay the noise at time step by a multiplicative coefficient: , where is a real constant controlling the noise reduction speed. Initial Gaussian noise was picked to be of the form , with variances of , , , for being the weakest discriminator and the strongest. Adding white noise increases the entropy (read uncertainty) of the discriminator (a proof is shown in the Supplementary Material) and tends to smooth its decision boundary (see the probability and gradient norm values in Figs. 11 and 10). Fitting a discriminator to uncorrupted input is prone to faster overfitting as opposed to training on noisy data when fixing the number of parameters, a great illustration of which is provided by Gu (2008). Empirical results are shown in Fig. 10. We see that by corrupting the real data we manage to cover all 8 modes and the sample quality is conserved by decaying the variance of the noise. The evolution of generated data points is shown in Fig. 14.

5.1 Effect of different nonlinear activation layer on the weak discriminator’s smoothness

In this section, we aim to illustrate the effect of 3 nonlinear activation layers (Tanh, Leaky RelU and ReLU) on the gradient norm of the discriminator. We ran the training of the generator with 3 discriminator (using Soft-acGAN) on the 8 Gaussian dataset. For some performance issue, we just replace the activation layer of the weak discriminator with respectively the 3 above mentionned activation layers and let the other discriminator with ReLU. Fig. 12 shows the gradient norm of the weak discriminator on the whole space . We see that Tanh has a very uniform gradient norm across the space while ReLU is the most discontinuous. Leaky ReLU has an intermediate pattern. Yet, Tanh seems to have flat behaviour (very small magnitude), this may be due to the Tanh function that has very low gradient signal at the extremity (indeed, we witness very poor performance with that activation layer). Leaky ReLU is less discontinuous although it also partitions the subspace in the same way as ReLU.

Figure 12: Although Tanh (left) presents smoother partition of the subspace than LeakyReLU (middle) and ReLU (right), it seems to have weak gradient signal (small gradient norm magnitude).

5.2 Evolution of the gradient norm during the training

In this section, we show the evolution of the gradient norm of each discriminator throughout the training process (results shown in Fig. 13). We see at the beginning of the training (first row) as the generator has just learned the top left modes, discriminator has flat behavior on the bottom right part of the subspace and has higher gradient norm on the top left part. A the training process, we see that missing modes has high gradient norm (second row third column). Finally, at the end when the generator has learned all the modes the weak discriminator seems to have more uniform gradient norm on the space while strong discriminator has equal gradient norm value at each modes locations.

Figure 13: Evolution of the gradient norm for each discriminator and samples generated (last column). The generator recovers modes thanks to the gradients provided by the weak and intermediate discriminators. Each discriminator in turn evolves to learn its coarse to fine-grained representation of the data. Note also that the strong discriminator has a good representation of all the modes before the generator has learned them, indicating that mode dropping in this setting is not due to those modes being absent in the discriminator. We have clipped the gradient range with respect to the weak discriminator of the corresponding row.

5.3 Regularizing the discriminator through additive noise

Figure 14: As we exponentially decay the noise, samples quality increase (2500 samples are plotted).
Increasing entropy with additive Gaussian noise.

In order to have discriminator networks with varying degrees of strength, we first resorted to nested architectures: for instance, the stronger discriminator should have a more complex architecture than the weaker. Moreover, we proceeded to corrupt the inputs with additive Gaussian white noise. Formally, to the input matrix of the discriminator we added , thus creating new input which was then fed to the discriminator. For practical purposes, noise for image data should be on a bounded support in order to obtain meaningful RGB values.
Letting the weaker discriminators train on inputs corrupted with a Gaussian noise with larger variance allows the network to learn a high-level representation of the dataset, while feeding uncorrupted inputs will let the corresponding to specialize. This tradeoff between sample space coverage and estimation accuracy is known as the spike-and-slab prior and is frequently used in Bayesian variable selection methods similar to the one proposed by (Mitchell and Beauchamp, 1988).
Consider the following relation, known in information theory as the entropy power inequality (EPI). Let be a continuous, real-valued and independent random vector on a bounded support and , both of dimension :


Applying logarithm on both sides and using , we get an expression for the entropy of the sum and :


from which it follows that adding i.i.d. Gaussian noise to the inputs increases the total entropy of the data. Here we used the fact that for all and that the uniform distribution

has the maximal entropy over all continuous random variables with bounded support

. The quantity controls the tightness of the bound. Because Eq. 9 is valid for all , it is necessary valid for . Maximizing the expression shows that picking increases the overall entropy and approaches the uniform distribution.
Finally, recall that the entropy where

. That is, maximizing the entropy of a distribution is equivalent up to an additive constant to minimizing the Kullback-Leibler divergence between the distribution and a uniform random variable with identical support (provided adequate restrictions on the support).

Fitting a weak discriminator to the corrupted data should increase its capacity to generalize more than that of the stronger discriminator by acting as a regularization technique and preventing the network from overfitting.
Analogous mechanisms are widely used in conjunction with other learning algorithms, such as support vector machines, where adding noise to the data is equivalent to increasing the classification margin as shown by (Xu, Caramanis, and Mannor, 2009).
As a final remark, it is important to select a proper noise distribution in order to avoid introducing bias and respect the original structure of the data.

5.4 Experimental parameters

Algorithm parameters
number of discriminators
Optimizer parameters
Stacked-MNIST RMSprop ()
CIFAR-10 Adam ()
Synthetic (25 Gaussians) Adam ()
Synthetic (8 Gaussians) Adam ()
CelebA Adam ()
Table 4:

General experimental hyperparameters.

5.5 Synthetic data

We utilize the 2D-ring with 8 Gaussians and the 2D-grid with 25 Gaussians Gulrajani et al. (2017). Three metrics were employed to evaluate the results:

  1. [leftmargin=*]

  2. High Quality samples

  3. Number of Covered modes

  4. Fréchet Distance (FD)

The percentage of "High Quality" samples is defined as the proportion of generated samples which are within standard deviation of the closest mode. The next metric reported is the number of modes covered, i.e. the count of modes that has generated samples closes enough (). The Fréchet Distance originally from Dowson and Landau (1982) is defined as:


where and

are first and second order moments of the real data distributions and estimates from generated data, respectively.


The generator network’s architecture comprises 4 dense layers of units each. We used 3 discriminators with respectively 2,3 and 4 dense layers of

units. ReLU activations were used in all layers, except for the last one, where a linear activation function was used for the generator and a sigmoid for the discriminator.



We used DCGAN’s architecture Radford, Metz, and Chintala (2015) to create lower capacity discriminators (in terms of feature representation power). For the 3Ds case, we used discriminators 3, 4, 5 (described in the following tables). For the 5Ds case, we used discriminators 1, 2, 3, 4, 5.

Layer Outputs Kernel size Stride BN Activation
Fully connected 2*2*512 4, 4 2, 2 Yes ReLU
Transposed convolution 4*4*256 4, 4 2, 2 Yes ReLU
Transposed convolution 8*8*128 4, 4 2, 2 Yes ReLU
Transposed convolution 14*14*64 4, 4 2, 2 Yes ReLU
Transposed convolution 28*28*3 4, 4 2, 2 No Tanh
Table 5: Generator’s architecture.
Layer Outputs Kernel size Stride BN Activation
Input 28*28*3
Convolution 14*14*64 4, 4 2, 2 No LeakyReLU
Convolution 7*7*128 4, 4 2, 2 Yes LeakyReLU
Convolution 4*4*256 4, 4 2, 2 Yes LeakyReLU
Convolution 2*2*512 4, 4 2, 2 Yes LeakyReLU
Convolution 1 4, 4 2, 2 No Sigmoid
Table 6: Discriminator 5.
Layer Outputs Kernel size Stride BN Activation
Input 28*28*3
Convolution 14*14*64 4, 4 2, 2 No LeakyReLU
Convolution 7*7*128 4, 4 2, 2 Yes LeakyReLU
Convolution 4*4*256 4, 4 2, 2 Yes LeakyReLU
Convolution 2*2*512 4, 4 2, 2 Yes LeakyReLU
Fully connected 1 4, 4 2, 2 No Sigmoid
Table 7: Discriminator 4.
Layer Outputs Kernel size Stride BN Activation
Input 28*28*3
Convolution 13*13*64 6, 6 2, 2 No LeakyReLU
Convolution 6*6*128 6, 6 2, 2 Yes LeakyReLU
Convolution 2*2*256 6, 6 2, 2 Yes LeakyReLU
Convolution 1 6, 6 2, 2 No Sigmoid
Table 8: Discriminator 3.
Layer Outputs Kernel size Stride BN Activation
Input 28*28*3
Convolution 13*13*64 6, 6 2, 2 No LeakyReLU
Convolution 6*6*128 6, 6 2, 2 Yes LeakyReLU
Convolution 2*2*256 6, 6 2, 2 Yes LeakyReLU
Fully connected 1 6, 6 2, 2 No Sigmoid
Table 9: Discriminator 2.
Layer Outputs Kernel size Stride BN Activation
Input 28*28*3
Convolution 12*12*64 8, 8 2, 2 No LeakyReLU
Convolution 4*4*128 8, 8 2, 2 Yes LeakyReLU
Convolution 1 8, 8 2, 2 No Sigmoid
Table 10: Discriminator 1.
(a) acGAN - 3 disc.
(b) acGAN - 5 disc.
Figure 15: Stacked-MNIST generated samples.

5.6 Cifar-10

FID score. FID scores, as introduced in Heusel et al. (2017), were computed for CIFAR-10. It is defined as the squared Fréchet distance between the Gaussian having the first and second order statistics matching those obtained from image features. The late layers of a pretrained classifier are used as low dimensional representation of images for statistics estimation.

Architecture. For our strongest discriminator we use the DCGAN architecture but with halved the number of filter, i.e. . For the 3D case, we introduced two extra discriminators with kernel sizes of 6 and 8. For the 5D case, we add two discriminators with kernel sizes 4 and 6 respectively to the set of 3D discriminator networks. In both 3D and 5D, we replaced the last layer from the DCGAN model with a fully connected dense layer. The generator network was taken from the original DCGAN architecture but with halved filter sizes too, i.e. . ReLU activation units were used for the generator network while LeakyRelu is used for the discriminators with a coefficient of .

Influence of the number of discriminators. An important assumption in the current paper is that increasing the number of discriminator networks helps the model converge faster. To assess that, we conducted experiments with the acGAN algorithm while varying the number of discriminators for ( being the Vanilla GAN) and averaging results over 5 seeds. According to Fig. 16, we see that a higher number of discriminators indeed leads to earlier convergence of the FID score curve.

Figure 16: Increasing the number of discriminators induces an earlier convergence of FID. Moreover, lower FID values are reached.
(a) 3 Discriminators
(b) 5 Discriminators
Figure 17: Average FID score of each method for different number of discriminators. In both plots, the acGAN algorithm presented faster convergence compared to the other methods.
Figure 18: acGAN with 5 discriminators shows earler convergence and better performance than Vanilla GAN (1 Disc) and WGAN-GP.
(a) Real Images
(b) Vanilla GAN
(c) Uniform - 3 disc.
(d) Uniform - 5 disc.
(e) GMAN - 3 disc.
(f) GMAN - 5 disc.
Figure 19: CIFAR-10 generated samples (1).
(a) acGAN - 3 disc.
(b) acGAN - 5 disc.
Figure 20: CIFAR-10 generated samples (2).


For both the CelebA (Liu et al., 2015) datasets, we conducted single-run experiments of 50,000 iterations each counted in generator steps ( epochs). We downscaled the original images to pixels out of practical concerns.


Similarly to CIFAR-10, we observe the emergence of a curriculum in Fig. 23. In particular, we note the presence of alternating phases during which a specific discriminator is dominating in the 3D and 5D cases. In the end, all discriminator probabilities converge to a stationary (i.e. long term) uniform distribution just like for previously mentioned datasets.


We used the same architecture as for the CIFAR-10 experiments except that the original numbers of filters were set to: for the discriminators and for the generator.

(a) Real Images
(b) Vanilla GAN
(c) Uniform - 3 disc.
(d) Uniform - 5 disc.
Figure 21: CelebA generated samples (1).
(a) GMAN - 3 disc.
(b) GMAN - 5 disc.
(c) acGAN - 3 disc.
(d) acGAN - 5 disc.
Figure 22: CelebA generated samples (2).
Figure 23: Weight of each discriminator over the training epochs. We could see switching phase, where one discriminator’s weight is dominant with respect to the rest. After some epochs, all weights converge to a uniform regime.
Generating 128x128 images.

In this experiment, we generated high resolution images with 3 and 5 discriminators. A convolutional layer with 2048 feature maps was added to both generator and discriminators architectures. The 3 discriminators settings used a kernel size of 4,6 and 8. For the 5 discriminators case, we added a discriminator of kernel size 4 and 6 but replaced the last layer with dense layers. The same parameters was employed as for CelebA 64x64.

(a) acGAN - 3 discriminators
(b) acGAN - 5 discriminators
Figure 24: Interpolating in latent space with 3 and 5 Discriminators.
(a) acGAN - 3 discriminators
(b) acGAN - 5 discriminators
Figure 25: 128x128 CelebA samples for acGAN trained for 50 epochs with 3 and 5 discriminators.