Generative Adversarial Networks (GANs) (Goodfellow et al., 2014)
offer a new approach to generative modeling, using game-theoretic training schemes to implicitly learn a given probability density. Prior to the emergence of GAN architectures, realistic generative modeling remained elusive. While offering unprecedented realism, GAN training still remains fraught with stability issues. Commonly reported shortcomings involve the lack of useful gradient signal provided by the discriminator, and mode collapse, i.e. lack of diversity in the generator’s samples.
Considerable research effort has been devoted in recent literature to overcome training instability 111Instability in the sense commonly used in GANs literature, i.e. the discriminator is able to easily distinguish between real and fake samples during the training phase (Neyshabur et al., 2017; Arjovsky et al., 2017; Berthelot et al., 2017). within the GAN framework. Some architectures such as BEGAN (Berthelot et al., 2017)
have applied auto-encoders as discriminators and proposed a new loss function to help stabilize training. Methods such as TTUR(Heusel et al., 2017), in turn, have attempted to define separate schedules for updating the generator and discriminator. The PacGAN algorithm (Lin et al., 2017) proposes to modify the discriminator’s architecture to accept m
concatenated samples as input. These samples are jointly classified as either real or generated, and the authors show that such an approach can help enforce sample diversity. Furthermore,spectral normalization was introduced to the discriminator’s parameters in SNGAN (Miyato et al., 2018)
aiming to ensure Lipschitz continuity, which is empirically shown to yield high quality samples across several sets of hyperparameters. Alternatively, recent works have proposed to tackle GANs instability issues with multiple discriminators.Neyshabur et al. (2017) propose a GAN variation in which one generator is trained against a set of discriminators, where each one sees a fixed random projection of the inputs. Prior work, including (Durugkar et al., 2016; Doan et al., 2018) have also explored training with multiple discriminators.
In this paper, we build upon Neyshabur et al. (2017)’s introduced framework and propose reformulating the average loss minimization to further stabilize GAN training. Specifically, we propose treating the loss signal provided by each discriminator as an independent objective function. To achieve this, we simultaneously minimize the losses using multi-objective optimization techniques. Namely, we exploit previously introduced methods in literature such as the multiple gradient descent (MGD) algorithm (Désidéri, 2012)
. However, due to MGD’s prohibitively high cost in the case of large neural networks, we propose to use more efficient alternatives such as maximization of the hypervolume in the region defined between a fixed, shared upper bound on the losses, which we will refer to as thenadir point , and each of the component losses.
In contrast to Neyshabur et al. (2017)’s approach, where the average loss is minimized when training the generator, hypervolume maximization (HV) optimizes a weighted loss, and the generator’s training will adaptively assign greater importance to feedback from discriminators against which it performs poorly.
Experiments performed on MNIST show that HV presents a good compromise in the computational cost vs. samples quality trade-off, when compared to average loss minimization or GMAN’s approach (low quality and cost), and MGD (high quality and cost). Also, the sensitivity to introduced hyperparameters is studied and results indicate that increasing the number of discriminators consequently increases the generator’s robustness along with sample quality and diversity. Experiments on CIFAR-10 indicate the method described produces higher quality generator samples in terms of quantitative evaluation. Moreover, image quality and sample diversity are once more shown to consistently improve as we increase the number of discriminators.
In summary, our main contributions are the following:
We offer a new perspective on multiple-discriminator GAN training by framing it in the context of multi-objective optimization, and draw similarities between previous research in GANs variations and MGD, commonly employed as a general solver for multi-objective optimization.
We propose a new method for training multiple-discriminator GANs: Hypervolume maximization, which weighs the gradient contributions of each discriminator by its loss.
The remainder of this document is organized as follows: Section 2 introduces definitions on multi-objective optimization and MGD. In Section 3 we describe prior relevant literature. Hypervolume maximization is detailed in Section 4, with experiments and results presented in Section 5. Conclusions and directions for future work are drawn in Section 6.
In this section we provide some definitions regarding multi-objective optimization from prior literature which will be useful in the following sections. Boldface notation is used to denote vector-valued variables.
Multi-objective optimization. A multi-objective optimization problem is defined as (Deb, 2001):
where is the number of objectives, is the variables space and is a decision vector or possible solution to the problem. is a set of -objective functions that maps the -dimensional variables space to the -dimensional objective space.
Pareto-dominance. Let and be two decision vectors. is said to dominate (denoted by ) if and only if for all and for some . If a decision vector x is dominated by no other vector in , x is called a non-dominated solution.
Pareto-optimality. A decision vector is said to be Pareto-optimal if and only if there is no such that , i.e. is a non-dominated solution. The Pareto-optimal Set (PS) is defined as the set of all Pareto-optimal solutions , i.e., . The set of all objective vectors such that x is Pareto-optimal is called Pareto front (PF), that is .
Pareto-stationarity. Pareto-stationarity is a necessary condition for Pareto-optimality. For differentiable everywhere for all , F is Pareto-stationary at x if there exists a set of scalars , such that:
Multiple Gradient Descent. Multiple gradient descent (Désidéri, 2012; Schäffler et al., 2002; Peitz & Dellnitz, 2018) was proposed for the unconstrained case of multi-objective optimization of assuming a convex, continuously differentiable and smooth for all . MGD finds a common descent direction for all by defining the convex hull of all and finding the minimum norm element within it. Consider given by:
will be either 0 in which case x is a Pareto-stationary point, or and then is a descent direction for all . Similar to gradient descent, MGD consists in finding the common steepest descent direction at each iteration , and then updating parameters with a learning rate according to .
3 Related work
3.1 Training GANs with multiple discriminators
While we would prefer to always have strong gradients from the discriminator during training, the vanilla GAN makes this difficult to ensure, as the discriminator quickly learns to distinguish real and generated samples (Goodfellow, 2016), thus providing no meaningful error signal to improve the generator thereafter. Durugkar et al. (2016) proposed the Generative Multi-Adversarial Networks (GMAN) which consists of training the generator against a softmax weighted arithmetic average of different discriminators:
where , , and is the loss of discriminator and is defined as
where and are the outputs of the -th discriminator and the generator, respectively. The goal of using the proposed averaging scheme is to favor worse discriminators, thus providing more useful gradients to the generator during training. Experiments were performed with (equal weights), (only worst discriminator is taken into account), , and learned by the generator. Models with were tested and evaluated using a proposed metric and the Inception score (Salimans et al., 2016). Results showed that the simple average of discriminator’s losses provided the best values for both metrics in most of the considered cases.
Neyshabur et al. (2017) proposed training a GAN with discriminators using the same architecture. Each discriminator sees a different randomly projected lower-dimensional version of the input image. Random projections are defined by a randomly initialized matrix , which remains fixed during training. Theoretical results provided show the distribution induced by the generator will converge to the real data distribution , as long as there is a sufficient number of discriminators. Moreover, discriminative tasks in the projected space are harder, i.e. real and fake examples are more alike, thus avoiding early convergence of discriminators, which leads to common stability issues in GAN training such as mode-collapse (Goodfellow, 2016). Essentially, the authors trade one hard problem for easier subproblems. The losses of each discriminator are the same as shown in Eq. 5. However, the generator loss is defined as the sum of the losses provided by each discriminator, as shown in Eq. 6. This choice of does not exploit available information such as the performance of the generator with respect to each discriminator.
3.2 Hypervolume maximization
Let be the solutions for a multi-objective optimization problem. The hypervolume of is defined as (Fleischer, 2003): , where is the Lebesgue measure and is a point dominated by all (i.e. is upper-bounded by ), referred to as the nadir point. can be understood as the size of the space covered by (Bader & Zitzler, 2011).
The hypervolume was originally introduced as a quantitative metric for coverage and convergence of Pareto-optimal fronts obtained through population-based algorithms (Beume et al., 2007). Methods based on direct maximization of exhibit favorable convergence even in challenging scenarios, such as simultaneous minimization of 50 objectives (Bader & Zitzler, 2011). In the context of Machine Learning, single-solution hypervolume maximization has been applied to neural networks as a surrogate loss for mean squared error (Miranda & Zuben, 2016), i.e. the loss provided by each example in a training batch is treated as a single cost and the multi-objective approach aims to minimize costs over all examples. Authors show that such method provides an inexpensive boosting-like training.
4 Multi-objective training of GANs with multiple discriminators
We introduce a variation of the GAN game in which the generator solves the following multi-objective problem:
where each , , is the loss provided by the -th discriminator. Training proceeds in the usual fashion (Goodfellow et al., 2014), i.e. with alternate updates between the discriminators and the generator. Updates of each discriminator are performed to minimize the loss described in Eq. 5.
A natural choice for our generator’s updates is the MGD algorithm, described in Section 2. However, computing the direction of steepest descent before every parameter update step, as required in MGD, can be prohibitively expensive for large neural networks. Therefore, we propose an alternative scheme for multi-objective optimization and argue that both our proposal and previously published methods can all be viewed as performing a computationally more efficient version of the MGD update rule, without the burden of needing to solve a quadratric program, i.e. computing , every iteration.
4.1 Hypervolume maximization for training GANs
Fleischer (2003) has shown that maximizing yields Pareto-optimal solutions. Since MGD converges to a set of Pareto-stationary points, i.e. a superset of the Pareto-optimal solutions, hypervolume maximization yields a subset of the solutions obtained using MGD. We exploit this property and define the generator loss as the negative log-hypervolume, as defined in Eq. 8:
where the nadir point coordinate is an upper bound for all . In Fig. 1 we provide an illustrative example for the case where . The highlighted region corresponds to . Since the nadir point is fixed, will be maximized, and consequently minimized, if and only if each is minimized.
Moreover, by adapting the results shown in (Miranda & Zuben, 2016), the gradient of with respect to any generator’s parameter is given by:
In other words, the gradient can be obtained by computing a weighted sum of the gradients of the losses provided by each discriminator, whose weights are defined as the inverse distance to the nadir point components. This formulation will naturally assign more importance to higher losses in the final gradient, which is another useful property of hypervolume maximization.
Nadir point selection. It is evident from Eq. 9 that the selection of directly affects the importance assignment of gradients provided by different discriminators. Particularly, as the quantity grows, the multi-objective GAN game approaches the one defined by the simple average of . Previous literature has discussed in depth the effects of the selection of in the case of population-based methods (Auger et al., 2009, 2012). However, those results are not readily applicable for the single-solution case. As will be shown in Section 5, our experiments indicate that the choice of plays an important role in the final quality of samples. Nevertheless, this effect becomes less relevant as the number of discriminators increases.
Nadir point adaptation. Similarly to (Miranda & Zuben, 2016), we propose an adaptive scheme for such that at iteration : , where is a user-defined parameter which will be referred to as slack. This enforces to be higher when is high and low otherwise, which induces a similar behavior as an average loss when training begins and automatically places more importance on the discriminators in which performance is worse as training progresses.
We further illustrate the proposed adaptation scheme in Fig. 2. Consider a two-objective problem with and corresponding to and at iteration , respectively. If no adaptation is performed and is left unchanged throughout training, as represented by the red dashed lines in Fig. 2, for a large enough . This will assign similar weights to gradients provided by the different losses, which defeats the purpose of employing hypervolume maximization rather than average loss minimization. Assuming that losses decrease with time, after updates, , since losses are now closer to . The employed adaptation scheme thus keeps the gradient weighting relevant even when losses become low. This effect will become more aggressive as training progresses, assigning more gradient importance to higher losses, as .
Comparison to average loss minimization. The upper bound proven by Neyshabur et al. (2017) assumes that the marginals of the real and generated distributions are identical along all random projections. However, average loss minimization does not ensure equally good approximation between the marginals along all directions. In the case of competing discriminators, i.e. if decreasing the loss on a given projection increases the loss with respect to another one, the distribution of losses can be uneven. With HV on the other hand, especially when is reduced throughout training, the overall loss will be kept high as long as there are discriminators with high loss. This objective tends to prefer central regions, in which all discriminators present a roughly equally low loss.
4.2 Relationship between multiple discriminator GANs and MGD
All methods described previously for the solution of GANs with multiple discriminators, i.e. average loss minimization (Neyshabur et al., 2017), GMAN’s weighted average (Durugkar et al., 2016) and hypervolume maximization can be defined as MGD-like two-step algorithms consisting of: Step 1 - consolidate all gradients into a single update direction (compute the set ); Step 2 - update parameters in the direction returned in Step 1. The definition of Step 1 for the different methods studied here can be summarized as follows:
We performed four sets of experiments aiming to understand the following phenomena: (i) How alternative methods for training GANs with multiple discriminators perform in comparison to MGD; (ii) How alternative methods perform in comparison to each other in terms of sample quality and coverage; (iii) How the varying number of discriminators impacts performance given the studied methods; and (iv) Whether the multiple-discriminator setting is practical given the added cost involved in training a set of discriminators.
Firstly, we exploited the relatively low dimensionality of MNIST and used it as testbed for comparing MGD with the other approaches, i.e. average loss minimization (AVG), GMAN’s weighted average loss, and HV, proposed in this work. Moreover, multiple initializations and slack combinations were evaluated in order to investigate how varying the number of discriminators affects robustness to those factors.
Then, experiments were performed with an upscaled version of CIFAR-10 at the resolution of 64x64 pixels while increasing the number of discriminators. Upscaling was performed with the aim of running experiments utilizing the same architecture described in (Neyshabur et al., 2017). We evaluated HV’s performance compared to baseline methods in terms of its resulting sample quality. Additional experiments were carried out with CIFAR-10 at its original resolution in order to provide a clear comparison with well known single-discriminator settings. We further analyzed HV’s impact on the diversity of generated samples using the stacked MNIST dataset (Srivastava et al., 2017). Finally, the computational cost and performance are compared for the single- vs. multiple-discriminator cases. Samples of generators trained on stacked MNIST in the Appendix along with samples from CelebA at a resolution as well as the Cats dataset at a resolution.
In all experiments performed, the same architecture, set of hyperparameters and initialization were used for both AVG, GMAN and our proposed method, the only variation being the generator loss. Unless stated otherwise, Adam (Kingma & Ba, 2014) was used to train all the models with learning rate, and set to , and , respectively. Mini-batch size was set to . The Fréchet Inception Distance (FID) (Heusel et al., 2017) was used for comparison. Details on FID computation can be found in Appendix A.
5.1 MGD compared with alternative methods
We employed MGD in our experiments with MNIST and, in order to do so, a quadratic program has to be solved prior to every parameters update. For this, we used Scipy’s implementation of the Serial Least Square Quadratic Program solver222https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html. Three and four fully connected layers with LeakyReLU activations were used for the generator and discriminator, respectively. Dropout was also employed in the discriminator and the random projection layer was implemented as a randomly initialized fully connected layer, reducing the vectorized dimensionality of MNIST from to . The output layer of a pretrained LeNet (LeCun et al., 1998) was used for FID computation.
Experiments over epochs with discriminators are reported in Fig. 3 and Fig. 4. In Fig. 3, box-plots refer to independent computations of FID over images sampled from the generator which achieved the minimum FID at train time. FID results are measured at training time with over images and the best values are reported in Fig. 4 along with the necessary time to achieve it.
MGD outperforms all tested methods. However, its cost per iteration does not allow its use in more relevant datasets outside MNIST. Hypervolume maximization, on the other hand, performs closer to MGD than the considered baselines, while introducing no relevant extra cost.
In Fig. 5, we analyze convergence in the Pareto-stationarity sense, by plotting the norm of the update direction for each method, given by . All methods converged to similar norms, leading to the conclusion that different Pareto-stationary solutions will perform differently in terms of quality of samples. Best FID as a function of wall-clock time is shown in Fig. 13 at the Appendix.
HV sensitivity to initialization and choice of . Analysis of the performance sensitivity with the choice of the slack parameter and initialization was performed under the following setting: models were trained for epochs on MNIST with hypervolume maximization using 8, 16, 24 discriminators. Three independent runs (different initializations) were executed with each and number of discriminators, totaling 36 final models. Fig. 6 reports the box-plots obtained for FID independent computations using images, for each of the models obtained under the setting described. Results clearly indicate that increasing the number of discriminators yields much smaller variation in the FID obtained by the final model.
5.2 HV as an alternative for MGD
5.2.1 Upscaled CIFAR-10
We evaluate the performance of HV compared to baseline methods using the upscaled CIFAR-10 dataset. FID was computed with a pretrained ResNet (He et al., 2016). ResNet was trained on the 10-class classification task of CIFAR-10 up to approximately test accuracy. DCGAN (Radford et al., 2015) and WGAN-GP (Gulrajani et al., 2017) were included in the experiments for FID reference. Same architectures as in (Neyshabur et al., 2017) were employed for all multi-discriminators settings. An increasing number of discriminators was used. Inception score (Salimans et al., 2016) as well as FID computed with other models are included in the Appendix-Table 7.
|FID-ResNet||FID (5k)||IS (5k)||FID (10k)||IS (10k)|
|SNGAN (Miyato et al., 2018)||-||25.5||-||-|
|WGAN-GP (Miyato et al., 2018)||-||40.2||-||-|
|DCGAN (Miyato et al., 2018)||-||-||-||-|
|SNGAN (our implementation)||1.55||27.93||25.29|
|DCGAN + 24 Ds and HV||1.21||27.74||24.90|
In Fig. 7, we report the box-plots of independent evaluations of FID on images for the best model obtained with each method across independent runs. Results once more indicate that HV outperforms other methods in terms of quality of the generated samples. Moreover, performance clearly improves as the number of discriminators grows. Fig. 8 shows the FID at train time, i.e. measured with generated images after each epoch, for the best models across runs. Models trained against more discriminators clearly converge to smaller values. We report the norm of the update direction for each method in Fig. 10-(a) in the Appendix.
We run experiments with CIFAR-10 in its original resolution aiming to contextualize our proposed approach with respect to previously introduced methods. We thus repeated similar experiments as reported in (Miyato et al., 2018)-Table 2, for the model referred to as standard CNN. The same architecture is employed and spectral normalization is removed from the discriminators, while a random projection input layer is added.
Results in terms of both FID and Inception score using their original implementations, evaluated on top of 5000 generated images as in (Miyato et al., 2018) as well as with 10000 images, are reported in Table 1 for our proposed approach and our implementation of (Miyato et al., 2018), along with the FID measured using a ResNet classifier trained in advance on the CIFAR-10 dataset.
As can be seen, the addition of the multiple discriminators setting along with hypervolume maximization yields a relevant shift in performance for the DCGAN-like generator, taking all evaluated metrics to within a competitive margin of recently proposed GANs, as well as outperforms our own implementation of SNGAN (using the best performing setup for this architecture as reported byMiyato et al. (2018)).
5.3 Computational cost
In Table 2 we present a comparison of minimum FID (measured with a pretrained ResNet) obtained during training, along with computation cost in terms of time and space for different GANs, with both 1 and 24 discriminators. The computational cost of training GANs under a multiple-discriminator setting is higher by design, in terms of both FLOPS and memory, if compared with single-discriminators settings. However, a corresponding improvement in performance is the result of the additional cost. This effect was consistently observed using 3 different well-known approaches, namely DCGAN (Radford et al., 2015), Least-square GAN (LSGAN) (Mao et al., 2017), and HingeGAN (Miyato et al., 2018). The architectures of all single discriminator models follow that of DCGAN, described in (Radford et al., 2015). For the 24 discriminators models, we used the setting described in Section 5.2.1. All models were trained with minibatch of size 64 over 150 epochs.
We further highlight that even though training with multiple discriminators may be more computationally expensive when compared to conventional approaches, such a framework supports fully parallel training of the discriminators, a feature which is not trivially possible in other GAN settings. For example in WGAN, the discriminator is serially updated multiple times for each generator update. In Fig. 10-(b) in the Appendix, we provide a comparison between wall-clock time per iteration between all methods evaluated. Serial implementations of discriminator updates with 8 and 16 discriminators were observed to run faster than WGAN-GP. Moreover, all experiments performed within this work were executed in single GPU hardware, which indicates the multiple discriminator setting is a practical approach.
5.4 Effect of the number of discriminators on sample diversity
We repeat the experiments in (Srivastava et al., 2017) aiming to analyze how the number of discriminators affects the sample diversity of the corresponding generator when trained using hypervolume maximization. The stacked MNIST dataset is employed and results reported in (Lin et al., 2017) are used for comparison. HV results for 8, 16, and 24 discriminators were obtained with 10k and 26k generator images, averaged over 10 runs. The number of covered modes along with the KL divergence between the generated mode distribution and test data are reported in Table 3.
|Model||Modes (Max 1000)||KL|
|DCGAN (Radford et al., 2015)|
|ALI (Dumoulin et al., 2016)|
|Unrolled GAN (Metz et al., 2016)|
|VEEGAN (Srivastava et al., 2017)|
|PacDCGAN2 (Lin et al., 2017)|
|HV - 8 disc. (10k)|
|HV - 16 disc. (10k)|
|HV - 24 disc. (10k)|
|HV - 8 disc. (26k)|
|HV - 16 disc. (26k)|
|HV - 24 disc. (26k)|
As in previous experiments, results consistently improved as we increased the number of discriminators. All evaluated models using HV outperformed DCGAN, ALI, Unrolled GAN and VEEGAN. Moreover, HV with 16 and 24 discriminators achieved state-of-the-art coverage values. Thus, increasing each model’s capacity by using more discriminators directly resulted in an improvement in the corresponding generator coverage. Training details as well as architecture information are presented in the Appendix.
In this work we show that employing multiple discriminators on GAN training is a practical approach for directly trading extra capacity - and thereby extra computational cost - for higher quality and diversity of generated samples. Such an approach is complimentary to other advances in GANs training and can be easily used together with other methods. We introduce a multi-objective optimization framework for studying multiple discriminator GANs, and showed strong similarities between previous work and the multiple gradient descent algorithm. The proposed approach was observed to consistently yield higher quality samples in terms of FID, and increasing the number of discriminators was shown to increase sample diversity and generator robustness.
Deeper analysis of the quantity is a subject of future investigation. We hypothesize that using it as a penalty term might reduce the necessity of a high number of discriminators.
- Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
Auger et al. (2009)
Auger, A., Bader, J., Brockhoff, D., and Zitzler, E.
Theory of the hypervolume indicator: optimal -distributions and
the choice of the reference point.
Proceedings of the tenth ACM SIGEVO workshop on Foundations of genetic algorithms, pp. 87–102. ACM, 2009.
- Auger et al. (2012) Auger, A., Bader, J., Brockhoff, D., and Zitzler, E. Hypervolume-based multiobjective optimization: Theoretical foundations and practical implications. Theoretical Computer Science, 425:75–103, 2012.
- Bader & Zitzler (2011) Bader, J. and Zitzler, E. HypE: An algorithm for fast hypervolume-based many-objective optimization. Evolutionary computation, 19(1):45–76, 2011.
- Berthelot et al. (2017) Berthelot, D., Schumm, T., and Metz, L. BEGAN: boundary equilibrium generative adversarial networks. CoRR, abs/1703.10717, 2017. URL http://arxiv.org/abs/1703.10717.
- Beume et al. (2007) Beume, N., Naujoks, B., and Emmerich, M. SMS-EMOA: Multiobjective selection based on dominated hypervolume. European Journal of Operational Research, 181(3):1653–1669, 2007.
Multi-objective optimization using evolutionary algorithms, volume 16. John Wiley & Sons, 2001.
- Désidéri (2012) Désidéri, J.-A. Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. Comptes Rendus Mathematique, 350(5-6):313–318, 2012.
- Doan et al. (2018) Doan, T., Monteiro, J., Albuquerque, I., Mazoure, B., Durand, A., Pineau, J., and Hjelm, R. D. Online adaptative curriculum learning for gans. CoRR, abs/1808.00020, 2018. URL http://arxiv.org/abs/1808.00020.
- Dumoulin et al. (2016) Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., and Courville, A. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
- Durugkar et al. (2016) Durugkar, I., Gemp, I., and Mahadevan, S. Generative multi-adversarial networks. arXiv preprint arXiv:1611.01673, 2016.
- Fleischer (2003) Fleischer, M. The measure of pareto optima applications to multi-objective metaheuristics. In International Conference on Evolutionary Multi-Criterion Optimization, pp. 519–533. Springer, 2003.
- Fréchet (1957) Fréchet, M. Sur la distance de deux lois de probabilité. COMPTES RENDUS HEBDOMADAIRES DES SEANCES DE L ACADEMIE DES SCIENCES, 244(6):689–692, 1957.
- Goodfellow (2016) Goodfellow, I. NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
- Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
- Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pp. 5769–5779, 2017.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In
- Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6629–6640, 2017.
- Jolicoeur-Martineau (2018) Jolicoeur-Martineau, A. The relativistic discriminator: a key element missing from standard gan. arXiv preprint arXiv:1807.00734, 2018.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Lin et al. (2017) Lin, Z., Khetan, A., Fanti, G., and Oh, S. PacGAN: The power of two samples in generative adversarial networks. arXiv preprint arXiv:1712.04086, 2017.
- Mao et al. (2017) Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Smolley, S. P. Least squares generative adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2813–2821. IEEE, 2017.
- Metz et al. (2016) Metz, L., Poole, B., Pfau, D., and Sohl-Dickstein, J. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016.
- Miranda & Zuben (2016) Miranda, C. S. and Zuben, F. J. V. Single-solution hypervolume maximization and its use for improving generalization of neural networks. CoRR, abs/1602.01164, 2016. URL http://arxiv.org/abs/1602.01164.
- Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
- Neyshabur et al. (2017) Neyshabur, B., Bhojanapalli, S., and Chakrabarti, A. Stabilizing GAN training with multiple random projections. arXiv preprint arXiv:1705.07831, 2017.
- Peitz & Dellnitz (2018) Peitz, S. and Dellnitz, M. Gradient-based multiobjective optimization with uncertainties. In NEO 2016, pp. 159–182. Springer, 2018.
- Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, pp. 2234–2242, 2016.
- Schäffler et al. (2002) Schäffler, S., Schultz, R., and Weinzierl, K. Stochastic method for the solution of unconstrained vector optimization problems. Journal of Optimization Theory and Applications, 114(1):209–222, 2002.
- Srivastava et al. (2017) Srivastava, A., Valkoz, L., Russell, C., Gutmann, M. U., and Sutton, C. VEEGAN: Reducing mode collapse in GANs using implicit variational learning. In Advances in Neural Information Processing Systems, pp. 3310–3320, 2017.
- Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.
A - Objective evaluation metric.
between Gaussians defined by estimates of the first and second order moments of the outputs obtained through a forward pass in a pretrained classifier of both real and generated data. They proposed the use of Inception V3(Szegedy et al., 2016) for computation of the data representation and called the metric Fréchet Inception Distance (FID), which is defined as:
where and are estimates of the first and second order moments from the representations of real data distributions and generated data, respectively.
We employ FID throughout our experiments for comparison of different approaches. However, in datasets other than CIFAR-10 at its original resollution, for each dataset in which FID was computed, the output layer of a pretrained classifier on that particular dataset was used instead of Inception. and were estimated on the complete test partitions, which are not used during training.
B - Experimental setup for stacked MNIST experiments and generator’s samples
|Fully connected||2*2*512||4, 4||2, 2||ReLU|
|Transposed convolution||4*4*256||4, 4||2, 2||ReLU|
|Transposed convolution||8*8*128||4, 4||2, 2||ReLU|
|Transposed convolution||14*14*64||4, 4||2, 2||ReLU|
|Transposed convolution||28*28*3||4, 4||2, 2||Tanh|
|Projection||14*14*3||8, 8||2, 2|
|Convolution||5*5*128||4, 4||2, 2||LeakyReLU|
|Convolution||2*2*256||4, 4||2, 2||LeakyReLU|
|Convolution||1||4, 4||2, 2||Sigmoid|
C - Extra results on upscaled CIFAR-10
C.1 - Multiple discriminators across different initializations and other scores
Table 6 presents the best FID (computed with a pretrained ResNet) achieved by each approach at train time, along with the epoch in which it was achieved, for each of 3 independent runs. Train time FIDs are computed using 1000 generated images.
|#D||Method||Best FID (epoch)|
|1||DCGAN||7.09 (68), 9.09 (21), 4.22 (101)|
|WGAN-GP||5.09 (117), 5.69 (101) 7.13 (71)|
|8||AVG||3.35 (105), 4.64 (141), 3.00 (76)|
|GMAN||4.28 (123), 4.24 (129), 3.80 (133)|
|HV||3.87 (102), 4.54 (82), 3.20 (98)|
|16||AVG||3.16 (96), 2.50 (91), 2.77 (116)|
|GMAN||2.69 (129), 2.36 (144), 2.48 (120)|
|HV||2.56 (85), 2.70 (97), 2.68 (133)|
|24||AVG||2.10 (94), 2.44 (132), 2.43 (129)|
|GMAN||2.16 (120), 2.02 (98), 2.13 (130)|
|HV||2.05 (83), 1.89 (97), 2.23 (130)|
In Fig. 10-(a), we report the norm of the update direction of the best model obtained for each method. Interestingly, different methods present similar behavior in terms of convergence in the Pareto-stationarity sense, i.e. the norm upon convergence is lower for models trained against more discriminators, regardless of the employed method.
We computed extra scores using 10000 images generated by the best model reported in Table 6, i.e. the same models utilized to generate the results shown in Fig. 7. Both Inception score and FID were computed with original implementations, while FID-VGG and FID-ResNet were computed using a VGG and a ResNet we pretrained. Results are reported with respect to DCGAN’s scores to avoid direct comparison with results reported elsewhere for CIFAR-10 on its usual resolution ().
D - CelebA dataset 128x128
In this experiment, we verify whether the proposed multiple discriminators setting is capable of generating higher resolution images. For that, we employed the CelebA at a size of 128x128. We used a similar architecture for both generator and discriminators networks as described in the previous experiments. A convolutional layer with 2048 feature maps was added to both generator and discriminators architectures due to the increase in the image size. Adam optimizer with the same set of hyperparameters as for CIFAR-10 and CelebA 64x64 was employed. We trained models with 6, 8, and 10 discriminators during 24 epochs. Samples from each generator are shown in Figure 11.
E - Generating 256x256 Cats
We show the proposed multiple-discriminators setting scales to higher resolution even in the small dataset regime, by reproducing the experiments presented in (Jolicoeur-Martineau, 2018). We used the same architecture for the generator. For the discriminator, we removed batch normalization from all layers and used stride equal to 1 at the last convolutional layer, after adding the initial projection step. The Cats dataset 333https://www.kaggle.com/crawford/cat-dataset was employed, we followed the same pre-processing steps, which, in our case, yielded 1740 training samples with resolution of 256x256. Our model is trained using 24 discriminators and Adam optimizer with the same hyperparameters as for CIFAR-10 and CelebA previously described experiments. In Figure 12 we show generator’s samples after 288 training epochs. One epoch corresponds to updating over 27 minibatches of size 64.