1 Introduction
Generative Adversarial Networks (GANs) (Goodfellow et al., 2014)
offer a new approach to generative modeling, using gametheoretic training schemes to implicitly learn a given probability density. Prior to the emergence of GAN architectures, realistic generative modeling remained elusive. While offering unprecedented realism, GAN training still remains fraught with stability issues. Commonly reported shortcomings involve the lack of useful gradient signal provided by the discriminator, and mode collapse, i.e. lack of diversity in the generator’s samples.
Considerable research effort has been devoted in recent literature to overcome training instability ^{1}^{1}1Instability in the sense commonly used in GANs literature, i.e. the discriminator is able to easily distinguish between real and fake samples during the training phase (Neyshabur et al., 2017; Arjovsky et al., 2017; Berthelot et al., 2017). within the GAN framework. Some architectures such as BEGAN (Berthelot et al., 2017)
have applied autoencoders as discriminators and proposed a new loss function to help stabilize training. Methods such as TTUR
(Heusel et al., 2017), in turn, have attempted to define separate schedules for updating the generator and discriminator. The PacGAN algorithm (Lin et al., 2017) proposes to modify the discriminator’s architecture to accept mconcatenated samples as input. These samples are jointly classified as either real or generated, and the authors show that such an approach can help enforce sample diversity. Furthermore,
spectral normalization was introduced to the discriminator’s parameters in SNGAN (Miyato et al., 2018)aiming to ensure Lipschitz continuity, which is empirically shown to yield high quality samples across several sets of hyperparameters. Alternatively, recent works have proposed to tackle GANs instability issues with multiple discriminators.
Neyshabur et al. (2017) propose a GAN variation in which one generator is trained against a set of discriminators, where each one sees a fixed random projection of the inputs. Prior work, including (Durugkar et al., 2016; Doan et al., 2018) have also explored training with multiple discriminators.In this paper, we build upon Neyshabur et al. (2017)’s introduced framework and propose reformulating the average loss minimization to further stabilize GAN training. Specifically, we propose treating the loss signal provided by each discriminator as an independent objective function. To achieve this, we simultaneously minimize the losses using multiobjective optimization techniques. Namely, we exploit previously introduced methods in literature such as the multiple gradient descent (MGD) algorithm (Désidéri, 2012)
. However, due to MGD’s prohibitively high cost in the case of large neural networks, we propose to use more efficient alternatives such as maximization of the hypervolume in the region defined between a fixed, shared upper bound on the losses, which we will refer to as the
nadir point , and each of the component losses.In contrast to Neyshabur et al. (2017)’s approach, where the average loss is minimized when training the generator, hypervolume maximization (HV) optimizes a weighted loss, and the generator’s training will adaptively assign greater importance to feedback from discriminators against which it performs poorly.
Experiments performed on MNIST show that HV presents a good compromise in the computational cost vs. samples quality tradeoff, when compared to average loss minimization or GMAN’s approach (low quality and cost), and MGD (high quality and cost). Also, the sensitivity to introduced hyperparameters is studied and results indicate that increasing the number of discriminators consequently increases the generator’s robustness along with sample quality and diversity. Experiments on CIFAR10 indicate the method described produces higher quality generator samples in terms of quantitative evaluation. Moreover, image quality and sample diversity are once more shown to consistently improve as we increase the number of discriminators.
In summary, our main contributions are the following:

We offer a new perspective on multiplediscriminator GAN training by framing it in the context of multiobjective optimization, and draw similarities between previous research in GANs variations and MGD, commonly employed as a general solver for multiobjective optimization.

We propose a new method for training multiplediscriminator GANs: Hypervolume maximization, which weighs the gradient contributions of each discriminator by its loss.
The remainder of this document is organized as follows: Section 2 introduces definitions on multiobjective optimization and MGD. In Section 3 we describe prior relevant literature. Hypervolume maximization is detailed in Section 4, with experiments and results presented in Section 5. Conclusions and directions for future work are drawn in Section 6.
2 Preliminaries
In this section we provide some definitions regarding multiobjective optimization from prior literature which will be useful in the following sections. Boldface notation is used to denote vectorvalued variables.
Multiobjective optimization. A multiobjective optimization problem is defined as (Deb, 2001):
(1) 
where is the number of objectives, is the variables space and is a decision vector or possible solution to the problem. is a set of objective functions that maps the dimensional variables space to the dimensional objective space.
Paretodominance. Let and be two decision vectors. is said to dominate (denoted by ) if and only if for all and for some . If a decision vector x is dominated by no other vector in , x is called a nondominated solution.
Paretooptimality. A decision vector is said to be Paretooptimal if and only if there is no such that , i.e. is a nondominated solution. The Paretooptimal Set (PS) is defined as the set of all Paretooptimal solutions , i.e., . The set of all objective vectors such that x is Paretooptimal is called Pareto front (PF), that is .
Paretostationarity. Paretostationarity is a necessary condition for Paretooptimality. For differentiable everywhere for all , F is Paretostationary at x if there exists a set of scalars , such that:
(2) 
Multiple Gradient Descent. Multiple gradient descent (Désidéri, 2012; Schäffler et al., 2002; Peitz & Dellnitz, 2018) was proposed for the unconstrained case of multiobjective optimization of assuming a convex, continuously differentiable and smooth for all . MGD finds a common descent direction for all by defining the convex hull of all and finding the minimum norm element within it. Consider given by:
(3) 
will be either 0 in which case x is a Paretostationary point, or and then is a descent direction for all . Similar to gradient descent, MGD consists in finding the common steepest descent direction at each iteration , and then updating parameters with a learning rate according to .
3 Related work
3.1 Training GANs with multiple discriminators
While we would prefer to always have strong gradients from the discriminator during training, the vanilla GAN makes this difficult to ensure, as the discriminator quickly learns to distinguish real and generated samples (Goodfellow, 2016), thus providing no meaningful error signal to improve the generator thereafter. Durugkar et al. (2016) proposed the Generative MultiAdversarial Networks (GMAN) which consists of training the generator against a softmax weighted arithmetic average of different discriminators:
(4) 
where , , and is the loss of discriminator and is defined as
(5) 
where and are the outputs of the th discriminator and the generator, respectively. The goal of using the proposed averaging scheme is to favor worse discriminators, thus providing more useful gradients to the generator during training. Experiments were performed with (equal weights), (only worst discriminator is taken into account), , and learned by the generator. Models with were tested and evaluated using a proposed metric and the Inception score (Salimans et al., 2016). Results showed that the simple average of discriminator’s losses provided the best values for both metrics in most of the considered cases.
Neyshabur et al. (2017) proposed training a GAN with discriminators using the same architecture. Each discriminator sees a different randomly projected lowerdimensional version of the input image. Random projections are defined by a randomly initialized matrix , which remains fixed during training. Theoretical results provided show the distribution induced by the generator will converge to the real data distribution , as long as there is a sufficient number of discriminators. Moreover, discriminative tasks in the projected space are harder, i.e. real and fake examples are more alike, thus avoiding early convergence of discriminators, which leads to common stability issues in GAN training such as modecollapse (Goodfellow, 2016). Essentially, the authors trade one hard problem for easier subproblems. The losses of each discriminator are the same as shown in Eq. 5. However, the generator loss is defined as the sum of the losses provided by each discriminator, as shown in Eq. 6. This choice of does not exploit available information such as the performance of the generator with respect to each discriminator.
(6) 
3.2 Hypervolume maximization
Let be the solutions for a multiobjective optimization problem. The hypervolume of is defined as (Fleischer, 2003): , where is the Lebesgue measure and is a point dominated by all (i.e. is upperbounded by ), referred to as the nadir point. can be understood as the size of the space covered by (Bader & Zitzler, 2011).
The hypervolume was originally introduced as a quantitative metric for coverage and convergence of Paretooptimal fronts obtained through populationbased algorithms (Beume et al., 2007). Methods based on direct maximization of exhibit favorable convergence even in challenging scenarios, such as simultaneous minimization of 50 objectives (Bader & Zitzler, 2011). In the context of Machine Learning, singlesolution hypervolume maximization has been applied to neural networks as a surrogate loss for mean squared error (Miranda & Zuben, 2016), i.e. the loss provided by each example in a training batch is treated as a single cost and the multiobjective approach aims to minimize costs over all examples. Authors show that such method provides an inexpensive boostinglike training.
4 Multiobjective training of GANs with multiple discriminators
We introduce a variation of the GAN game in which the generator solves the following multiobjective problem:
(7) 
where each , , is the loss provided by the th discriminator. Training proceeds in the usual fashion (Goodfellow et al., 2014), i.e. with alternate updates between the discriminators and the generator. Updates of each discriminator are performed to minimize the loss described in Eq. 5.
A natural choice for our generator’s updates is the MGD algorithm, described in Section 2. However, computing the direction of steepest descent before every parameter update step, as required in MGD, can be prohibitively expensive for large neural networks. Therefore, we propose an alternative scheme for multiobjective optimization and argue that both our proposal and previously published methods can all be viewed as performing a computationally more efficient version of the MGD update rule, without the burden of needing to solve a quadratric program, i.e. computing , every iteration.
4.1 Hypervolume maximization for training GANs
Fleischer (2003) has shown that maximizing yields Paretooptimal solutions. Since MGD converges to a set of Paretostationary points, i.e. a superset of the Paretooptimal solutions, hypervolume maximization yields a subset of the solutions obtained using MGD. We exploit this property and define the generator loss as the negative loghypervolume, as defined in Eq. 8:
(8) 
where the nadir point coordinate is an upper bound for all . In Fig. 1 we provide an illustrative example for the case where . The highlighted region corresponds to . Since the nadir point is fixed, will be maximized, and consequently minimized, if and only if each is minimized.
Moreover, by adapting the results shown in (Miranda & Zuben, 2016), the gradient of with respect to any generator’s parameter is given by:
(9) 
In other words, the gradient can be obtained by computing a weighted sum of the gradients of the losses provided by each discriminator, whose weights are defined as the inverse distance to the nadir point components. This formulation will naturally assign more importance to higher losses in the final gradient, which is another useful property of hypervolume maximization.
Nadir point selection. It is evident from Eq. 9 that the selection of directly affects the importance assignment of gradients provided by different discriminators. Particularly, as the quantity grows, the multiobjective GAN game approaches the one defined by the simple average of . Previous literature has discussed in depth the effects of the selection of in the case of populationbased methods (Auger et al., 2009, 2012). However, those results are not readily applicable for the singlesolution case. As will be shown in Section 5, our experiments indicate that the choice of plays an important role in the final quality of samples. Nevertheless, this effect becomes less relevant as the number of discriminators increases.
Nadir point adaptation. Similarly to (Miranda & Zuben, 2016), we propose an adaptive scheme for such that at iteration : , where is a userdefined parameter which will be referred to as slack. This enforces to be higher when is high and low otherwise, which induces a similar behavior as an average loss when training begins and automatically places more importance on the discriminators in which performance is worse as training progresses.
We further illustrate the proposed adaptation scheme in Fig. 2. Consider a twoobjective problem with and corresponding to and at iteration , respectively. If no adaptation is performed and is left unchanged throughout training, as represented by the red dashed lines in Fig. 2, for a large enough . This will assign similar weights to gradients provided by the different losses, which defeats the purpose of employing hypervolume maximization rather than average loss minimization. Assuming that losses decrease with time, after updates, , since losses are now closer to . The employed adaptation scheme thus keeps the gradient weighting relevant even when losses become low. This effect will become more aggressive as training progresses, assigning more gradient importance to higher losses, as .
Comparison to average loss minimization. The upper bound proven by Neyshabur et al. (2017) assumes that the marginals of the real and generated distributions are identical along all random projections. However, average loss minimization does not ensure equally good approximation between the marginals along all directions. In the case of competing discriminators, i.e. if decreasing the loss on a given projection increases the loss with respect to another one, the distribution of losses can be uneven. With HV on the other hand, especially when is reduced throughout training, the overall loss will be kept high as long as there are discriminators with high loss. This objective tends to prefer central regions, in which all discriminators present a roughly equally low loss.
4.2 Relationship between multiple discriminator GANs and MGD
All methods described previously for the solution of GANs with multiple discriminators, i.e. average loss minimization (Neyshabur et al., 2017), GMAN’s weighted average (Durugkar et al., 2016) and hypervolume maximization can be defined as MGDlike twostep algorithms consisting of: Step 1  consolidate all gradients into a single update direction (compute the set ); Step 2  update parameters in the direction returned in Step 1. The definition of Step 1 for the different methods studied here can be summarized as follows:
5 Experiments
We performed four sets of experiments aiming to understand the following phenomena: (i) How alternative methods for training GANs with multiple discriminators perform in comparison to MGD; (ii) How alternative methods perform in comparison to each other in terms of sample quality and coverage; (iii) How the varying number of discriminators impacts performance given the studied methods; and (iv) Whether the multiplediscriminator setting is practical given the added cost involved in training a set of discriminators.
Firstly, we exploited the relatively low dimensionality of MNIST and used it as testbed for comparing MGD with the other approaches, i.e. average loss minimization (AVG), GMAN’s weighted average loss, and HV, proposed in this work. Moreover, multiple initializations and slack combinations were evaluated in order to investigate how varying the number of discriminators affects robustness to those factors.
Then, experiments were performed with an upscaled version of CIFAR10 at the resolution of 64x64 pixels while increasing the number of discriminators. Upscaling was performed with the aim of running experiments utilizing the same architecture described in (Neyshabur et al., 2017). We evaluated HV’s performance compared to baseline methods in terms of its resulting sample quality. Additional experiments were carried out with CIFAR10 at its original resolution in order to provide a clear comparison with well known singlediscriminator settings. We further analyzed HV’s impact on the diversity of generated samples using the stacked MNIST dataset (Srivastava et al., 2017). Finally, the computational cost and performance are compared for the single vs. multiplediscriminator cases. Samples of generators trained on stacked MNIST in the Appendix along with samples from CelebA at a resolution as well as the Cats dataset at a resolution.
In all experiments performed, the same architecture, set of hyperparameters and initialization were used for both AVG, GMAN and our proposed method, the only variation being the generator loss. Unless stated otherwise, Adam (Kingma & Ba, 2014) was used to train all the models with learning rate, and set to , and , respectively. Minibatch size was set to . The Fréchet Inception Distance (FID) (Heusel et al., 2017) was used for comparison. Details on FID computation can be found in Appendix A.
5.1 MGD compared with alternative methods
We employed MGD in our experiments with MNIST and, in order to do so, a quadratic program has to be solved prior to every parameters update. For this, we used Scipy’s implementation of the Serial Least Square Quadratic Program solver^{2}^{2}2https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html. Three and four fully connected layers with LeakyReLU activations were used for the generator and discriminator, respectively. Dropout was also employed in the discriminator and the random projection layer was implemented as a randomly initialized fully connected layer, reducing the vectorized dimensionality of MNIST from to . The output layer of a pretrained LeNet (LeCun et al., 1998) was used for FID computation.
Experiments over epochs with discriminators are reported in Fig. 3 and Fig. 4. In Fig. 3, boxplots refer to independent computations of FID over images sampled from the generator which achieved the minimum FID at train time. FID results are measured at training time with over images and the best values are reported in Fig. 4 along with the necessary time to achieve it.
MGD outperforms all tested methods. However, its cost per iteration does not allow its use in more relevant datasets outside MNIST. Hypervolume maximization, on the other hand, performs closer to MGD than the considered baselines, while introducing no relevant extra cost.
In Fig. 5, we analyze convergence in the Paretostationarity sense, by plotting the norm of the update direction for each method, given by . All methods converged to similar norms, leading to the conclusion that different Paretostationary solutions will perform differently in terms of quality of samples. Best FID as a function of wallclock time is shown in Fig. 13 at the Appendix.
HV sensitivity to initialization and choice of . Analysis of the performance sensitivity with the choice of the slack parameter and initialization was performed under the following setting: models were trained for epochs on MNIST with hypervolume maximization using 8, 16, 24 discriminators. Three independent runs (different initializations) were executed with each and number of discriminators, totaling 36 final models. Fig. 6 reports the boxplots obtained for FID independent computations using images, for each of the models obtained under the setting described. Results clearly indicate that increasing the number of discriminators yields much smaller variation in the FID obtained by the final model.
5.2 HV as an alternative for MGD
5.2.1 Upscaled CIFAR10
We evaluate the performance of HV compared to baseline methods using the upscaled CIFAR10 dataset. FID was computed with a pretrained ResNet (He et al., 2016). ResNet was trained on the 10class classification task of CIFAR10 up to approximately test accuracy. DCGAN (Radford et al., 2015) and WGANGP (Gulrajani et al., 2017) were included in the experiments for FID reference. Same architectures as in (Neyshabur et al., 2017) were employed for all multidiscriminators settings. An increasing number of discriminators was used. Inception score (Salimans et al., 2016) as well as FID computed with other models are included in the AppendixTable 7.
FIDResNet  FID (5k)  IS (5k)  FID (10k)  IS (10k)  
SNGAN (Miyato et al., 2018)    25.5      
WGANGP (Miyato et al., 2018)    40.2      
DCGAN (Miyato et al., 2018)          
SNGAN (our implementation)  1.55  27.93  25.29  
DCGAN + 24 Ds and HV  1.21  27.74  24.90 
In Fig. 7, we report the boxplots of independent evaluations of FID on images for the best model obtained with each method across independent runs. Results once more indicate that HV outperforms other methods in terms of quality of the generated samples. Moreover, performance clearly improves as the number of discriminators grows. Fig. 8 shows the FID at train time, i.e. measured with generated images after each epoch, for the best models across runs. Models trained against more discriminators clearly converge to smaller values. We report the norm of the update direction for each method in Fig. 10(a) in the Appendix.
5.2.2 Cifar10
We run experiments with CIFAR10 in its original resolution aiming to contextualize our proposed approach with respect to previously introduced methods. We thus repeated similar experiments as reported in (Miyato et al., 2018)Table 2, for the model referred to as standard CNN. The same architecture is employed and spectral normalization is removed from the discriminators, while a random projection input layer is added.
Results in terms of both FID and Inception score using their original implementations, evaluated on top of 5000 generated images as in (Miyato et al., 2018) as well as with 10000 images, are reported in Table 1 for our proposed approach and our implementation of (Miyato et al., 2018), along with the FID measured using a ResNet classifier trained in advance on the CIFAR10 dataset.
As can be seen, the addition of the multiple discriminators setting along with hypervolume maximization yields a relevant shift in performance for the DCGANlike generator, taking all evaluated metrics to within a competitive margin of recently proposed GANs, as well as outperforms our own implementation of SNGAN (using the best performing setup for this architecture as reported by
Miyato et al. (2018)).5.3 Computational cost
In Table 2 we present a comparison of minimum FID (measured with a pretrained ResNet) obtained during training, along with computation cost in terms of time and space for different GANs, with both 1 and 24 discriminators. The computational cost of training GANs under a multiplediscriminator setting is higher by design, in terms of both FLOPS and memory, if compared with singlediscriminators settings. However, a corresponding improvement in performance is the result of the additional cost. This effect was consistently observed using 3 different wellknown approaches, namely DCGAN (Radford et al., 2015), Leastsquare GAN (LSGAN) (Mao et al., 2017), and HingeGAN (Miyato et al., 2018). The architectures of all single discriminator models follow that of DCGAN, described in (Radford et al., 2015). For the 24 discriminators models, we used the setting described in Section 5.2.1. All models were trained with minibatch of size 64 over 150 epochs.
We further highlight that even though training with multiple discriminators may be more computationally expensive when compared to conventional approaches, such a framework supports fully parallel training of the discriminators, a feature which is not trivially possible in other GAN settings. For example in WGAN, the discriminator is serially updated multiple times for each generator update. In Fig. 10(b) in the Appendix, we provide a comparison between wallclock time per iteration between all methods evaluated. Serial implementations of discriminator updates with 8 and 16 discriminators were observed to run faster than WGANGP. Moreover, all experiments performed within this work were executed in single GPU hardware, which indicates the multiple discriminator setting is a practical approach.
# Disc.  FIDResNet  FLOPS  Memory  

DCGAN  1  4.22  8e10  1292 
24  1.89  5e11  5671  
LSGAN  1  4.55  8e10  1303 
24  1.91  5e11  5682  
HingeGAN  1  6.17  8e10  1303 
24  2.25  5e11  5682 
5.4 Effect of the number of discriminators on sample diversity
We repeat the experiments in (Srivastava et al., 2017) aiming to analyze how the number of discriminators affects the sample diversity of the corresponding generator when trained using hypervolume maximization. The stacked MNIST dataset is employed and results reported in (Lin et al., 2017) are used for comparison. HV results for 8, 16, and 24 discriminators were obtained with 10k and 26k generator images, averaged over 10 runs. The number of covered modes along with the KL divergence between the generated mode distribution and test data are reported in Table 3.
Model  Modes (Max 1000)  KL 

DCGAN (Radford et al., 2015)  
ALI (Dumoulin et al., 2016)  
Unrolled GAN (Metz et al., 2016)  
VEEGAN (Srivastava et al., 2017)  
PacDCGAN2 (Lin et al., 2017)  
HV  8 disc. (10k)  
HV  16 disc. (10k)  
HV  24 disc. (10k)  
HV  8 disc. (26k)  
HV  16 disc. (26k)  
HV  24 disc. (26k) 
As in previous experiments, results consistently improved as we increased the number of discriminators. All evaluated models using HV outperformed DCGAN, ALI, Unrolled GAN and VEEGAN. Moreover, HV with 16 and 24 discriminators achieved stateoftheart coverage values. Thus, increasing each model’s capacity by using more discriminators directly resulted in an improvement in the corresponding generator coverage. Training details as well as architecture information are presented in the Appendix.
6 Conclusion
In this work we show that employing multiple discriminators on GAN training is a practical approach for directly trading extra capacity  and thereby extra computational cost  for higher quality and diversity of generated samples. Such an approach is complimentary to other advances in GANs training and can be easily used together with other methods. We introduce a multiobjective optimization framework for studying multiple discriminator GANs, and showed strong similarities between previous work and the multiple gradient descent algorithm. The proposed approach was observed to consistently yield higher quality samples in terms of FID, and increasing the number of discriminators was shown to increase sample diversity and generator robustness.
Deeper analysis of the quantity is a subject of future investigation. We hypothesize that using it as a penalty term might reduce the necessity of a high number of discriminators.
References
 Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.

Auger et al. (2009)
Auger, A., Bader, J., Brockhoff, D., and Zitzler, E.
Theory of the hypervolume indicator: optimal distributions and
the choice of the reference point.
In
Proceedings of the tenth ACM SIGEVO workshop on Foundations of genetic algorithms
, pp. 87–102. ACM, 2009.  Auger et al. (2012) Auger, A., Bader, J., Brockhoff, D., and Zitzler, E. Hypervolumebased multiobjective optimization: Theoretical foundations and practical implications. Theoretical Computer Science, 425:75–103, 2012.
 Bader & Zitzler (2011) Bader, J. and Zitzler, E. HypE: An algorithm for fast hypervolumebased manyobjective optimization. Evolutionary computation, 19(1):45–76, 2011.
 Berthelot et al. (2017) Berthelot, D., Schumm, T., and Metz, L. BEGAN: boundary equilibrium generative adversarial networks. CoRR, abs/1703.10717, 2017. URL http://arxiv.org/abs/1703.10717.
 Beume et al. (2007) Beume, N., Naujoks, B., and Emmerich, M. SMSEMOA: Multiobjective selection based on dominated hypervolume. European Journal of Operational Research, 181(3):1653–1669, 2007.

Deb (2001)
Deb, K.
Multiobjective optimization using evolutionary algorithms
, volume 16. John Wiley & Sons, 2001.  Désidéri (2012) Désidéri, J.A. Multiplegradient descent algorithm (MGDA) for multiobjective optimization. Comptes Rendus Mathematique, 350(56):313–318, 2012.
 Doan et al. (2018) Doan, T., Monteiro, J., Albuquerque, I., Mazoure, B., Durand, A., Pineau, J., and Hjelm, R. D. Online adaptative curriculum learning for gans. CoRR, abs/1808.00020, 2018. URL http://arxiv.org/abs/1808.00020.
 Dumoulin et al. (2016) Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., and Courville, A. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 Durugkar et al. (2016) Durugkar, I., Gemp, I., and Mahadevan, S. Generative multiadversarial networks. arXiv preprint arXiv:1611.01673, 2016.
 Fleischer (2003) Fleischer, M. The measure of pareto optima applications to multiobjective metaheuristics. In International Conference on Evolutionary MultiCriterion Optimization, pp. 519–533. Springer, 2003.
 Fréchet (1957) Fréchet, M. Sur la distance de deux lois de probabilité. COMPTES RENDUS HEBDOMADAIRES DES SEANCES DE L ACADEMIE DES SCIENCES, 244(6):689–692, 1957.
 Goodfellow (2016) Goodfellow, I. NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pp. 5769–5779, 2017.

He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6629–6640, 2017.
 JolicoeurMartineau (2018) JolicoeurMartineau, A. The relativistic discriminator: a key element missing from standard gan. arXiv preprint arXiv:1807.00734, 2018.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lin et al. (2017) Lin, Z., Khetan, A., Fanti, G., and Oh, S. PacGAN: The power of two samples in generative adversarial networks. arXiv preprint arXiv:1712.04086, 2017.
 Mao et al. (2017) Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Smolley, S. P. Least squares generative adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2813–2821. IEEE, 2017.
 Metz et al. (2016) Metz, L., Poole, B., Pfau, D., and SohlDickstein, J. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016.
 Miranda & Zuben (2016) Miranda, C. S. and Zuben, F. J. V. Singlesolution hypervolume maximization and its use for improving generalization of neural networks. CoRR, abs/1602.01164, 2016. URL http://arxiv.org/abs/1602.01164.
 Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
 Neyshabur et al. (2017) Neyshabur, B., Bhojanapalli, S., and Chakrabarti, A. Stabilizing GAN training with multiple random projections. arXiv preprint arXiv:1705.07831, 2017.
 Peitz & Dellnitz (2018) Peitz, S. and Dellnitz, M. Gradientbased multiobjective optimization with uncertainties. In NEO 2016, pp. 159–182. Springer, 2018.
 Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, pp. 2234–2242, 2016.
 Schäffler et al. (2002) Schäffler, S., Schultz, R., and Weinzierl, K. Stochastic method for the solution of unconstrained vector optimization problems. Journal of Optimization Theory and Applications, 114(1):209–222, 2002.
 Srivastava et al. (2017) Srivastava, A., Valkoz, L., Russell, C., Gutmann, M. U., and Sutton, C. VEEGAN: Reducing mode collapse in GANs using implicit variational learning. In Advances in Neural Information Processing Systems, pp. 3310–3320, 2017.
 Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.
Appendix
A  Objective evaluation metric.
In (Heusel et al., 2017), authors proposed to use as a quality metric the squared Fréchet distance (Fréchet, 1957)
between Gaussians defined by estimates of the first and second order moments of the outputs obtained through a forward pass in a pretrained classifier of both real and generated data. They proposed the use of Inception V3
(Szegedy et al., 2016) for computation of the data representation and called the metric Fréchet Inception Distance (FID), which is defined as:(10) 
where and are estimates of the first and second order moments from the representations of real data distributions and generated data, respectively.
We employ FID throughout our experiments for comparison of different approaches. However, in datasets other than CIFAR10 at its original resollution, for each dataset in which FID was computed, the output layer of a pretrained classifier on that particular dataset was used instead of Inception. and were estimated on the complete test partitions, which are not used during training.
B  Experimental setup for stacked MNIST experiments and generator’s samples
Architectures of the generator and discriminator are detailed in Tables 4 and 5
, respectively. Batch normalization was used in all intermediate convolutional and fully connected layers of both models. We employed RMSprop to train all the models with learning rate and
set to and , respectively. Minibatch size was set to . The setup in (Lin et al., 2017) is employed and we build 128000 and 26000 samples for train and test sets, respectively.Layer  Outputs  Kernel size  Stride  Activation 

Input:  
Fully connected  2*2*512  4, 4  2, 2  ReLU 
Transposed convolution  4*4*256  4, 4  2, 2  ReLU 
Transposed convolution  8*8*128  4, 4  2, 2  ReLU 
Transposed convolution  14*14*64  4, 4  2, 2  ReLU 
Transposed convolution  28*28*3  4, 4  2, 2  Tanh 
Layer  Outputs  Kernel size  Stride  Activation 

Input  28*28*3  
Projection  14*14*3  8, 8  2, 2  
Convolution  7*7*64  4,4  2, 2  LeakyReLU 
Convolution  5*5*128  4, 4  2, 2  LeakyReLU 
Convolution  2*2*256  4, 4  2, 2  LeakyReLU 
Convolution  1  4, 4  2, 2  Sigmoid 
C  Extra results on upscaled CIFAR10
C.1  Multiple discriminators across different initializations and other scores
Table 6 presents the best FID (computed with a pretrained ResNet) achieved by each approach at train time, along with the epoch in which it was achieved, for each of 3 independent runs. Train time FIDs are computed using 1000 generated images.
#D  Method  Best FID (epoch) 

1  DCGAN  7.09 (68), 9.09 (21), 4.22 (101) 
WGANGP  5.09 (117), 5.69 (101) 7.13 (71)  
8  AVG  3.35 (105), 4.64 (141), 3.00 (76) 
GMAN  4.28 (123), 4.24 (129), 3.80 (133)  
HV  3.87 (102), 4.54 (82), 3.20 (98)  
16  AVG  3.16 (96), 2.50 (91), 2.77 (116) 
GMAN  2.69 (129), 2.36 (144), 2.48 (120)  
HV  2.56 (85), 2.70 (97), 2.68 (133)  
24  AVG  2.10 (94), 2.44 (132), 2.43 (129) 
GMAN  2.16 (120), 2.02 (98), 2.13 (130)  
HV  2.05 (83), 1.89 (97), 2.23 (130) 
In Fig. 10(a), we report the norm of the update direction of the best model obtained for each method. Interestingly, different methods present similar behavior in terms of convergence in the Paretostationarity sense, i.e. the norm upon convergence is lower for models trained against more discriminators, regardless of the employed method.
We computed extra scores using 10000 images generated by the best model reported in Table 6, i.e. the same models utilized to generate the results shown in Fig. 7. Both Inception score and FID were computed with original implementations, while FIDVGG and FIDResNet were computed using a VGG and a ResNet we pretrained. Results are reported with respect to DCGAN’s scores to avoid direct comparison with results reported elsewhere for CIFAR10 on its usual resolution ().
WGANGP  AVG8  AVG16  AVG24  GMAN8  GMAN16  GMAN24  HV8  HV16  HV24  

Inception Score  1.08  1.02  1.26  1.36  0.95  1.32  1.42  1.00  1.30  1.44 
FID  0.80  0.98  0.76  0.73  0.92  0.79  0.65  0.89  0.77  0.72 
FIDVGG  1.29  0.91  1.03  0.85  0.87  0.78  0.73  0.78  0.75  0.64 
FIDResNet  1.64  0.88  0.90  0.62  0.80  0.72  0.73  0.75  0.73  0.51 
D  CelebA dataset 128x128
In this experiment, we verify whether the proposed multiple discriminators setting is capable of generating higher resolution images. For that, we employed the CelebA at a size of 128x128. We used a similar architecture for both generator and discriminators networks as described in the previous experiments. A convolutional layer with 2048 feature maps was added to both generator and discriminators architectures due to the increase in the image size. Adam optimizer with the same set of hyperparameters as for CIFAR10 and CelebA 64x64 was employed. We trained models with 6, 8, and 10 discriminators during 24 epochs. Samples from each generator are shown in Figure 11.



E  Generating 256x256 Cats
We show the proposed multiplediscriminators setting scales to higher resolution even in the small dataset regime, by reproducing the experiments presented in (JolicoeurMartineau, 2018). We used the same architecture for the generator. For the discriminator, we removed batch normalization from all layers and used stride equal to 1 at the last convolutional layer, after adding the initial projection step. The Cats dataset ^{3}^{3}3https://www.kaggle.com/crawford/catdataset was employed, we followed the same preprocessing steps, which, in our case, yielded 1740 training samples with resolution of 256x256. Our model is trained using 24 discriminators and Adam optimizer with the same hyperparameters as for CIFAR10 and CelebA previously described experiments. In Figure 12 we show generator’s samples after 288 training epochs. One epoch corresponds to updating over 27 minibatches of size 64.