1 Introduction
A fully datadriven paradigm in conducting science has been emerged during the last years with the advent of GANs [1]
. A GAN offers a new methodology for drawing samples from an unknown distribution where only samples from this distribution are available making them one of the hottest areas in machine learning/artificial intelligence research. Indicatively, GANs have been successfully utilized in (conditional) image creation
[2, 3, 4], generating very realistic samples [5, 6], speech signal processing [7, 8][9] and astronomy [10], to name a few.A GAN is a twoplayer zerosum game [1, 11]
between a Discriminator and a Generator, both being powerful neural networks. They are simultaneously trained to achieve a
Nash equilibriumof the game, where the Discriminator cannot distinguish the real and the fake samples while the Generator has learned the unknown distribution. It is wellknown that the training procedure of GANs often fails and several specific heuristics and hacks have been devised
[12]along with generalpurpose acceleration techniques such as batch normalization
[13]. To alleviate the difficulties of training, extensions and generalizations stemming from the utilization of a different loss function has been proposed. For instance, fGAN
[14] is a generalization where the divergence is used instead of the ShannonJensen divergence of the original GAN. Another widelyapplied extension is Wasserstein GAN [15] which has been further improved in [16]. On the other hand, there are relatively few studies that aim directly to improve the convergence speed of training of an existing GAN.In this paper, instead of proposing a new GAN architecture or a new GAN loss function we propose a new training algorithm inspired by the multiplicative weight update method (MWUM) [17]
. Our goal is to improve the training of the Generator by transferring ideas from Game Theory. Intuitively, the new algorithm puts more weight to fake samples that are more probable to fool the Discriminator and simultaneously reduces the weight of samples that are confidently discriminated as fake. Our contributions are summarized as follow: (i) By adding weights to the training of GANs, we manage to improve the training performance with minor additional computational cost. The new approach is called
Weighted Generative Adversarial Network(WeGAN). (ii) We provide rigorous arguments that the weights of WeGAN locally reduces the loss function more or at least as much as the equallyweighted stochastic gradient descent for the Generator. (iii) The proposed algorithm is not specific to vanilla GAN
[1], but it is directly transferable to other extensions such as conditional GANs, Wasserstein GAN and fGAN. This is an important generalization property of WeGAN.Before proceeding, it is worthnoting that training methods utilizing weights for the Generator have been recently proposed [18, 9, 19]. These methods are essentially equivalent since they assign importance weights to the generated samples in order to obtain a tighter lower bound for their variational formula. However, importance weights GAN (IWGAN) cannot be applied to any type of objective function and additionally they might diverge due to their unboundedness. We implemented IWGAN and present its performance in the Results section comparing it to our algorithm.
2 Preliminaries
GAN formulation. Let be the Generator and be the Discriminator of the GAN [1].
is the probability estimate that sample
is real while is the sample output of the Generator giving a noise sample . In order to be trained, the following objective function of the twoplayer zerosum game has to be optimized:where is the distribution to be learned, while is the noise input distribution. Typically, an optimum (Nash equilibrium) of this zerosum game, which is a saddle point, is estimated using stochastic gradient descent. As it was proved in [1], the global optimum of this zerosum game is the point where for any sample and the Generator generates samples according to the real distribution.
MWUM basics. MWUM is a classic algorithmic technique with numerous applications. The main idea behind this method is the existence of a number of ”experts” that give some kind of advice to a decision maker. To any ”expert” a specific weight is assigned and the initial weights are equal for any ”expert”. Then, the decision maker takes the decision according to the advice of the ”experts” taking into account the weight of any of them. After this the weights are multiplicatively updated according to the performance of the advice of any individual ”expert”, increasing the weights of the ”experts” with good performance and decreasing them otherwise and so on. We continue with the description of our algorithm and the connection to this method.
[title=Algorithm 1] for number of iterations do
Sample from the input distribution .
Update the Discriminator by ascending its stochastic gradient:
Compute the unnormalized weights:
3 Weighted GAN algorithm
The proposed algorithm presented in Fig. 1 is a modification of the original GAN training algorithm. Inspired by the MWUM, instead of equallyweighted ’fake’ samples, we assign a weight to each sample (the ”expert” in MWUM) which multiplies the respective gradient term of the Generator. The weighting aims to put more strength to samples that fool the Discriminator and thus are closer to the real data. Indeed, when and the Discriminator understands that the sample is fake the weight decreases by a factor . On the other hand, when the weight remains the same and after the normalization step it has a value greater or equal than the previous one. Notice also that the weights in Algorithm 1 depend only of the current value of the Discriminator while in the standard MWUM the weights are updated cumulatively. This modification was necessary because the input samples are different at each iteration. Indeed, new samples are generated and there is no obvious map between the current samples and the samples from the previous iteration.
3.1 Theoretical properties of WeGAN algorithm
A key assumption of our algorithm as well as in other weighting algorithms is that the Discriminator is faithful in the sense that it produces sound decisions for both real and fake samples. Quantitatively, it means that the Discriminator should return on average values above 0.5 when the sample comes from the real distribution and below 0.5 when fake samples are fed to the Discriminator. Next, we show that for a fixed Discriminator, the optimal Generator with weights as in Algorithm 1 achieves lower or equal loss value than the optimal Generator with equallyweighted samples. Hence, we expect that the inferred Generator is stronger favorably affecting the speed of convergence.
Theorem 1.
Fix Discriminator and let and be the respective optimum Generator under weighted and equallyweighted loss function defined by
Let the weight vector,
, be defined according to Algorithm 1 then(1) 
Proof. By definition, it holds for the optimum Generator that
If we prove that for any , it holds that when is defined as in Algorithm 1 we are done because we get the desired result for . Without loss of generality, we prove the case with samples. Using a more elaborate but similar argument we can prove it for the general case.
Assuming that , it is easy to show that and . Next, let be positive integers such that and , with be arbitrarily small constants for . This is possible due to the fact that the set of rational numbers is a dense subset of real numbers. Since implies , then, it holds
for arbitrarily small positive . Thus, we prove for that
At equilibrium. It is straightforward to show that at the Nash equilibrium the weights of WeGAN are uniform. Indeed, it holds that for all and thus
This observation can serve either as a criterion to stop the training process or as an evaluation metric to assess whether or not the training process converged to an optimum. Monitoring the variance of the weights is the simplest statistic for both tasks.
WeGAN generalization. The proposed algorithm is not exclusive for vanilla GAN and it can be easily extended and applied to any variation of GANs that incorporates a Discriminator mechanism. Therefore, we do not propose just an extension of vanilla GAN but rather a novel training algorithm for general GANs. For instance, we could assign the same formula as in vanilla GAN for the weights for Wasserstein GAN. The presented theoretical analysis still holds for this case.
4 Results
For a fair comparison, we evaluate the performance of the various training algorithms without changing the architecture of the networks. Moreover, with the exception of CIFAR, the presented results are averaged over 1000 iterations.
4.1 An illustrative example
We present a benchmark example where the new algorithm converges to the data distribution faster than vanilla GAN. The ‘real’ data are drawn from a mixture of 8 normal distributions with each of the 8 components being equallyprobable. The mean values are equallydistributed on a circle with radius 3 and covariance matrix
. Moreover, both Generator and Discriminator are fullyconnected neural networks with 2 hidden layers and 32 units per layer. The input random variable has a
dimensional standard normal while the output of the Discriminator is the sigmoid function.
The upper and middle plots of Fig. 2 show the relative improvement of WeGAN with respect of vanilla GAN for various values of
(circle, square & star lines) as a function of the number of epochs. The chosen performance metric is the maximum mean discrepancy (MMD)
[20] which measures the closeness between the real data and the generated ones. The relative improvement is higher at the early stage when only iteration in the training of the Discriminator is performed (upper plot of Fig. 2). In contrast, the highest relative improvement occurs closer to the convergence regime when iterations in Discriminator’s training are performed (middle plot). For comparison purposes we added IWGAN (dashed line) which also outperforms vanilla GAN but it is slightly worse that WeGAN with . Moreover, there were cases where IWGAN diverges because it produced a weight with infinite value. In the lower plot of Fig. 2, we present the relative performance improvement between the baseline training algorithm for the Wasserstein GAN and the respective weighted variation. We observe that improvements happen but they are less prominent. Additionally, higher values of result in better performance which is the opposite situation when compared with the vanilla GAN.4.2 Mnist
We extend our experiments on common benchmark MNIST image database of handwritten digits [21, 22]. In this experiment, a single hidden layer based fully connected neural network has been used for both Generator and Discriminator with 128 hidden units. Whereas, the input to Generator is set to 100 dimensional standard normal random variables. Two popular evaluation metrics i.e., Inception Score (IS) [12] and Fréchet Inception Distance (FID) [23]
are used to quantitatively assess the performance of GANs. Both metrics assume access to a pretrained classifier and provide an objective score based on the distribution of the sample that is to be evaluated. Overall relative performance, for IWGAN and various versions of WeGAN with respect to vanilla GAN in terms of IS (upper plot) and FID (lower plot) metrics, are presented in Fig.
3. Evidently, WeGAN algorithm outperforms standard vanilla GAN with relative improvement of almost 10% in IS and 30% in FID metrics. Results reveal that WeGAN with has the best improvement when compared to other variations of values which is consistent with the earlier reported results. By examining Fig. 3, we also observe that IWGAN achieves higher relative improvement in the early epochs, however, fails to maintain the performance as oppose to WeGAN at which procures the best performance.4.3 Cifar
CIFAR10 is a well studied dataset of natural images [24]. We use this dataset to examine the performance of GANs. For the Generator, we use a deep convolutional network with a single linear layer followed by convolutional layers. Whereas, the Discriminator has convolutional layers and linear layer at the end. Batch normalization is applied to both networks. The input noise with dimensionality of
is drawn from a uniform distribution. Fig.
4 shows IS (upper plot) and FID (lower plot) scores for the CIFAR10 dataset in terms of relative improvement with reference to vanilla GAN. It can be observed that the proposed WeGAN with is preferred over all respective weighted variations in IS score with 5–10% of improvement. Whereas, WeGAN with & both performs comparatively well in FID score. Unfortunately, the performance metrics produce conflicting outcomes making it hard to draw a clear conclusion for this dataset. We also evaluate IWGAN, however, its performance remains approximately the same against the baseline vanilla GAN.5 Conclusions and Future Directions
Inspired by the multiplicative weight update method, we proposed a novel algorithm to train GANs. Results indicated that the performance is improved when compared to the baseline training procedure. Moreover, WeGAN is not restricted to a particular type of GAN but it can be easily applied to any type. As future directions we list a more extensive study in terms of applications and network architectures, a systematic evaluation of the hyperparameter’s behavior as well as extensions towards adding suitable weights to the Discriminator, too.
6 Acknowledgements
We would like to thank Yannis Sfakianakis for his help in the implementation of some experiments.
References
 [1] I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative Adversarial Nets,” in Proceedings of Annual Conference on Neural Information Processing Systems (NIPS ’14), 2014, pp. 2672–2680.
 [2] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
 [3] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
 [4] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with auxiliary classifier GANs,” arXiv preprint arXiv:1610.09585, 2016.
 [5] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of GANs for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017.
 [6] M. Brundage, S. Avin, J. Clark, H. Toner, P. Eckersley, B. Garfinkel, A. Dafoe, P. Scharre, T. Zeitzoff, B. Filar, et al., “The malicious use of artificial intelligence: Forecasting, prevention, and mitigation,” arXiv preprint arXiv:1802.07228, 2018.
 [7] S. Pascual, A. Bonafonte, and J. Serra, “SEGAN: Speech enhancement generative adversarial network,” arXiv preprint arXiv:1703.09452, 2017.
 [8] Y. Saito, S. Takamichi, and H. Saruwatari, “Statistical parametric speech synthesis incorporating generative adversarial networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 1, pp. 84–96, 2018.
 [9] T. Che, Y. Li, R. Zhang, R. D. Hjelm, W. Li, Y. Song, and Y. Bengio, “Maximumlikelihood augmented discrete generative adversarial networks,” arXiv preprint arXiv:1702.07983, 2017.
 [10] K. Schawinski, C. Zhang, H. Zhang, L. Fowler, and G. K. Santhanam, “Generative adversarial networks recover features in astrophysical images of galaxies beyond the deconvolution limit,” Monthly Notices of the Royal Astronomical Society: Letters, vol. 467, no. 1, pp. L110–L114, 2017.
 [11] M. J. Osborne and A. Rubinstein, A course in game theory, MIT press, 1994.
 [12] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training GANs,” in Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.
 [13] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 [14] S. Nowozin, B. Cseke, and R. Tomioka, “fGAN: Training generative neural samplers using variational divergence minimization,” in Advances in Neural Information Processing Systems, 2016, pp. 271–279.
 [15] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” arXiv preprint arXiv:1701.07875, 2017.
 [16] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of Wasserstein GANs,” in Advances in Neural Information Processing Systems, 2017, pp. 5767–5777.
 [17] S. Arora, E. Hazan, and S. Kale, “The multiplicative weights update method: a metaalgorithm and applications,” Theory of Computing, vol. 8, no. 1, pp. 121–164, 2012.
 [18] R. D. Hjelm, A. P. Jacob, T. Che, A. Trischler, K. Cho, and Y. Bengio, “Boundaryseeking generative adversarial networks,” arXiv preprint arXiv:1702.08431, 2017.
 [19] Z. Hu, Z. Yang, R. Salakhutdinov, and E. P. Xing, “On unifying deep generative models,” arXiv preprint arXiv:1706.00550, 2017.
 [20] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel twosample test,” Journal of Machine Learning Research, vol. 13, no. Mar, pp. 723–773, 2012.

[21]
Y. LeCun,
“The MNIST database of handwritten digits,”
http://yann. lecun. com/exdb/mnist/, 1998.  [22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [23] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two timescale update rule converge to a local Nash equilibrium,” in Advances in Neural Information Processing Systems, 2017, pp. 6626–6637.
 [24] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., Citeseer, 2009.