A recent trend in generative models is to use a deep neural network as a generator. Two notable approaches are variational auto-encoders (VAE)Kingma and Welling (2013); Rezende et al. (2014) as well as Generative Adversarial Networks (GAN) Goodfellow et al. (2014)
. Unlike VAEs, the GAN approach offers a way to circumvent log-likelihood-based estimation and it also typically produces visually sharper samplesGoodfellow et al. (2014). The goal of the generator network is to generate samples that are indistinguishable from real samples, where indistinguishability is measured by an additional discriminative model. This creates an adversarial game setting where one pits a generator against a discriminator.
Let us denote the data distribution by and the model distribution by . A probabilistic discriminator is denoted by and a generator by . The GAN objective is:
Each of the two players (generator/discriminator) tries to optimize their own objective, which is exactly balanced by the loss of the other player, thus yielding a two-player zero-sum minimax game. Standard GAN approaches aim at finding a pure Nash Equilibrium by using traditional gradient-based techniques to minimize each player’s cost in an alternating fashion. However, an update made by one player can repeatedly undo the progress made by the other one, without ever converging.
In general, alternating gradient descent fails to converge even for very simple games Salimans et al. (2016). In the setting of GANs, one of the central open issues is this non-convergence problem, which in practice leads to oscillations between different kinds of generated samples Metz et al. (2016).
While standard GAN methods seek to find pure minimax strategies, we propose to consider mixed strategies, which allows us to leverage online learning algorithms for mixed strategies in large games. Building on the approach of Freund and Schapire (1999), we propose a novel training algorithm for GANs that we call Chekhov GAN .
On the theory side, we focus on simpler GAN architectures. The most elementary architecture that one might consider is a shallow one, e.g. a GAN architecture which consists of a single layer network as a discriminator, and a generator with one hidden layer (see Fig. 1). However, one typically requires a powerful generator that can model complex data distribution. This leads us to consider a semi-shallow architecture where the generator is any arbitrary network (Fig. 1). In this paper, we address the following questions: 1) Can we efficiently find an equilibrium for semi-shallow GAN architectures? 2) Can we extend this result to more complex architectures?
We answer the first question in the affirmative, and provide a method that provably finds an equilibrium in the setting of a semi-shallow architecture. This is done in spite of the fact that the game induced by such architectures is not convex-concave. Our proof relies on analyzing semi-concave games, i.e., games which are concave with respect to the player, but need not have a special structure with respect to the player. We prove that in such games, players may efficiently invoke regret minimization procedures in order to find equilibrium. To the best of our knowledge, this result is novel in the context of GANs, and might also find use in other scenarios where such structure may arise.
On the practical side, we develop an efficient heuristic guided by our theoretical results, which we apply to commonly used deep GAN architectures shown in Fig. 1. We provide experimental results demonstrating that our approach exhibits better empirical stability compared to GANs and generates more diverse samples, while retaining the visual quality.
In Section 2, we briefly review necessary notions from online learning and zero-sum games. We then present our approach and its theoretical guarantees in Section 3. Lastly, we present empirical results on standard benchmark datasets in Section 4.
2 Background & Related Work
The classical way to learn a generative model consists of minimizing a divergence function between a parametrized model distributionand the true data distribution . The original GAN approach Goodfellow et al. (2014) was shown to be related to the Jensen-Shannon divergence. This was later generalized by Nowozin et al. (2016) that described a broader family of GAN objectives stemming from
-divergences. A different popular type of GAN objectives is the family of Integral Probability MetricsMüller (1997), such as the kernel MMD Gretton et al. (2012); Li et al. (2015) or the Wasserstein metric Arjovsky and Bottou (2017). All of these divergence measures yield a minimax objective.
using mini-batch stochastic gradient descent and show convergence when the updates are made in function space. In practice, this condition is not met - since this procedure works in the parameter space - and many issues arise during trainingArjovsky and Bottou (2017); Radford et al. (2015), thus requiring careful initialization and proper regularization as well as other tricks Metz et al. (2016); Pfau and Vinyals (2016); Radford et al. (2015); Salimans et al. (2016). Even so, several problems are still commonly observed including a phenomena where the generator oscillates, without ever converging to a fixed point, or mode collapse when the generator maps many latent codes to the same point, thus failing to produce diverse samples.
The closest work related to our approach is Arora et al. (2017) that showed the existence of an approximate mixed equilibrium with certain generalization properties; yet without providing a constructive way to find such equilibria. Instead, they advocate the use of mixed strategies, and suggest to do so by using the exponentiated gradient algorithm Kivinen and Warmuth (1997). The work of Tolstikhin et al. (2017) also uses a similar mixture approach based on boosting. Other works have studied the problem of equilibrium and stabilization of GANs, often relying on the use of an auto-encoder as discriminator Berthelot et al. (2017) or jointly with the GAN models Che et al. (2016). In this work, we focus on providing convergence guarantees to a mixed equilibrium (definition in Section 3.2) using a technique from online optimization that relies on the players’ past actions.
2.2 Online Learning
Online learning is a sequential decision making framework in which a player aims at minimizing a cumulative loss function revealed to her sequentially. The source of the loss functions may be arbitrary or even adversarial, and the player seeks to provide worst case guarantees on her performance. Formally, this framework can be described as a repeated game ofrounds between a player and an adversary . At each round : (1) chooses a point according to some algorithm , (2) chooses a loss function , (3) suffers a loss , and the loss function is revealed to her. The adversary is usually limited to choosing losses from a structured class of objectives , most commonly linear/convex losses. Also, the decision set is often assumed to be convex. The performance of the player’s strategy is measured by the regret, defined as,
Thus, the regret measures the cumulative loss of the player compared to the loss of the best fixed decision in hindsight. A player aims at minimizing her regret, and we are interested in no-regret strategies for which players ensure an regret for any loss sequence 222A regret which depends linearly on is ensured by any strategy and is therefore trivial..
While there are several no-regret strategies, many of them may be seen as instantiations of the Follow-the-Regularized-Leader (FTRL) algorithm where
FTRL takes the accumulated loss observed up to time and then chooses the point in that minimizes the accumulated loss plus a regularization term . The regularization term prevents the player from abruptly changing her decisions between consecutive rounds333 Tikhonov regularization is one of the most popular regularizers.. This property is often crucial to obtaining no-regret guarantees. Note that FTRL is not always guaranteed to yield no-regret, and is mainly known to provide such guarantees in the setting where losses are linear/convex Hazan et al. (2016); Shalev-Shwartz et al. (2012).
2.3 Zero-sum Games
Consider two players, , which may choose pure decisions among continuous sets and , respectively. A zero-sum game is defined by a function which sets the utilities of the players. Concretely, upon choosing a pure strategy the utility of is , while the utility of is . The goal of either / is to maximize their worst case utilities; thus,
This definition of a game makes sense if there exists a point , such that neither nor may increase their utility by unilateral deviation. Such a point is called a Pure Nash Equilibrium, which is formally defined as a point which satisfies the following conditions:
While a pure Nash equilibrium does not always exist, the pioneering work of Nash Nash et al. (1950) established that there always exists a Mixed Nash Equilibrium (MNE or simply equilibrium), i.e., there always exist two distributions such that,
Finding an exact MNE might be computationally hard, and we are usually satisfied with finding an approximate MNE. This is defined below,
Let . Two distributions are called -MNE if the following holds,
Terminology: In the sequel when we discuss zero-sum games, we shall sometimes use the GAN terminology, relating the player as the generator, and the player , as the discriminator.
No-Regret & Zero-sum Games:
In zero-sum games, no-regret algorithms may be used to find an approximate MNE. Unfortunately, computationally tractable no-regret algorithms do not always exist. An exception is the setting when is convex-concave. In this case, the players may invoke the powerful no-regret methods from online convex optimization to (approximately) solve the game. This seminal idea was introduced in Freund and Schapire (1999), where it was demonstrated how to invoke no-regret algorithms during rounds to obtain an approximation guarantee of in zero-sum matrix games. This was later improved by Daskalakis et al. (2015); Rakhlin and Sridharan (2013), demonstrating a guarantee of . The result that we are about to present builds on the scheme of Freund and Schapire (1999).
3 Finding an Equilibrium in GANs
Why Mixed Equilibrium? In this work our ultimate goal is to efficiently find an approximate MNE for the game. However, in GANs, we are usually interested in designing good generators, and one might ask whether finding an equilibrium serves this cause better than solving the minimax problem, i.e., finding . Interestingly, the minimax value of a pure strategy for the generator is always higher than the minimax value of the equilibrium strategy of the generator. The benefit of finding an equilibrium can be demonstrated on a simple zero-sum game. Consider the following paper-rock-scissors game, i.e. a zero-sum game with the minimax objective
Solving for the minimax objective yields a pure strategy with a minimax value of ; conversely, the equilibrium strategy of the
player is a uniform distribution over actions; and its minimax value is. Thus, finding an equilibrium by allowing mixed strategies implies a smaller minimax value (as we show in the Section 3.3 this is true in general).
This section presents a method that efficiently finds an equilibrium for semi-shallow GANs as depicted in Fig. 1. Such architectures do not induce a convex-concave game, and therefore the result of Freund and Schapire (1999) does not directly apply. Nevertheless, we show that semi-shallow GANs imply an interesting game structure which gives rise to an efficient procedure for finding an equilibrium. In Sec. 3.1 we show that semi-shallow GANs define games with a property that we denote as semi-concave. Later, Sec. 3.2 provides an algorithm with provable guarantees for such games. Finally, in Section 3.3 we show that the minimax objective of the generator’s equilibrium strategy is optimal with respect to the minimax objective.
3.1 Semi-shallow GANs
Semi-shallow GANs do not lead to a convex-concave game. Nonetheless, here we show that for an appropriate choice of the activation function they induce a game which is concave with respect to the discriminator. As we present in Sec.3.2, this property alone allows us to efficiently find an equilibrium.
Consider the GAN objective in Eq. (1) and assume that the adversary is a single-layer with a sigmoid activation function, meaning , where . Then the GAN objective is concave in .
Note that the above is not restricted to the sigmoid activation function, but it also holds for other choices of activation, e.g. cumulative gaussian distribution, i.e.
Note that the logarithm of
for the sigmoid and cumulative gaussian activations correspond to the well known logit and probit models,McCullagh and Nelder (1989).
3.2 Semi-concave Zero-sum Games
Here we discuss the setting of zero-sum games (see Eq. (4)) which are semi-concave. Formally a game, , is semi-concave if for any fixed the function is concave in . Algorithm 1 presents our method for semi-concave games. This algorithm is an instantiation of the scheme derived by Freund and Schapire (1999), with specific choices of the online algorithms used by the players. Note that both are two different instances of the FTRL approach presented in Eq. (3).
Let us discuss Algorithm 1 and then present its guarantees. First note that each player calculates a sequence of points based on an online algorithm . Interestingly, the sequence of (loss/reward) functions given to the online algorithm is based on the game objective , and also on the decisions made by the other player. For example, the loss sequence that receives is . After rounds we end up with two mixed strategies , each being a uniform distribution over the respective online decisions . Note that the first decision points are set by before encountering any (loss/reward) function, and the dummy functions are only introduced in order to simplify the exposition. Since ’s goal is to minimize, it is natural to think of the ’s as loss functions, and measure the guarantees of according to the regret as defined in Equation (2). Analogously, since ’s goal is to maximize, it is natural to think of the ’s as reward functions, and measure the guarantees of according to the following appropriate definition of regret,
The following theorem presents our guarantees for semi-concave games:
Let be a convex set. Also, let be a semi-concave zero-sum game, and assume is -Lipschitz countinuous. Then upon invoking Alg. 1 for steps, using the FTRL versions appearing below, it outputs mixed strategies that are -MNE, where .
The most important point to note is that the accuracy of the approximation improves as the number of iterations grows. This enables obtaining arbitrarily good approximation for a large enough . Note that is in fact follow-the-leader, i.e., FTRL without regularization. The discriminator, , also uses the FTRL scheme. Yet, instead of the original reward functions, , it utilizes linear approximations . Also note the use of the (minus) square norm as regularization444Note that the minus sign in the regularization is since the discriminator’s goal is to maximize, thus the , may be thought of as reward functions.. The parameter depends on the Lipschitz constant of as well as on the diameter of defined as, . Concretely, .
The proof makes use of a theorem due to Freund and Schapire (1999) which shows that if both and ensure no-regret then it implies convergence to approximate MNE. Since the game is concave with respect to , it is well known that the FTRL version appearing in Thm. 1 is a no-regret strategy (see e.g. Hazan et al. (2016)). The challenge is therefore to show that is also a no-regret strategy. This is non-trivial, especially for semi-concave games that do not necessarily have any special structure with respect to the generator 555The result of Hazan and Koren (2016) shows that there does not exist any efficient no-regret algorithm, , in the general case where the loss sequence received by is arbitrary.. However, the loss sequence received by the generator is not arbitrary but rather it follows a special sequence based on the choices of the discriminator, . In the case of semi-concave games, the sequence of discriminator decisions, has a special property which “stabilizes" the loss sequence , which in turn enables us to establish no-regret for . ∎
Remark: Note that Alg. in Thm. 1 assumes the availability of an oracle that can efficiently find a global minimum for the FTL objective, . This involves a minimization over a sum of generative networks. Therefore, our result may be seen as a reduction from the problem of finding an equilibrium to an offline optimization problem. This reduction is not trivial, especially in light of the negative results of Hazan and Koren (2016)
, which imply that in the general case finding an equilibrium is hard, even with such an efficient offline optimization oracle at hand. Thus, our result enables to take advantage of progress made in supervised deep learning in order to efficiently find an equilibrium for GANs.
3.3 Minimax value of Equilibrium Strategy
In GANs we are mainly interested in ensuring the performance of the generator (resp. discriminator) with respect to the minimax (resp. maximin) objective. Let the pair of mixed strategies that Algorithm 1 outputs. Note that the minimax value of might be considerably smaller than the pure minimax value, as is shown in the example regarding the paper-rock-scissors game (see Sec. 3). The next lemma shows that the mixed strategy is always (approximately) better with respect to the pure minimax value,(see proof in appendix A.2)
Analogous result hold for with respect to the pure maximin objective.
3.4 Practical Chekhov GAN Algorithm
In this section we describe a practical application of the general approach described in Alg. 1 to common deep GAN architectures. There are several differences compared to the theoretical algorithm that we have presented: (i) We use the FTRL objective (Eq (3)) for both players. Note that Alg. appearing in Thm 1 uses FTRL with linear approximations, which is only appropriate for semi-concave games. (ii) As calculating the global minimizer of the FTRL objective is impractical, we update the weights based on the gradients of the FTRL objective, using traditional optimization techniques such as SGD or Adam. This differs from the standard GAN training which only employs the gradient of the last loss/reward function. (iii) The full FTRL algorithm requires to save the entire history of past generators/discriminators, which is computationally intractable. We find it sufficient to maintain a summary of the history, using a small number of representatives. In order to capture a diverse subset of the history, we keep a queue containing models whose spacing between each other is determined by the following heuristic. Every update steps, we remove the oldest model in the queue and add the current one. The number of steps between switches, , can be set as a constant, but we find it more effective to keep it small at the beginning and increase its value as the number of rounds increases. We hypothesize that as the training progresses and the individual models become more powerful, we should switch the models at a lower rate, keeping them more spaced out. The pseudo-code and a detailed description of the algorithm appears in the Appendix.
4 Experimental results
We now compare Chekhov GAN to various baselines and demonstrate improved stability and sample diversity. We test our method on models where the traditional GAN training has difficulties converging and engages in a behavior of mode collapse. We also perform experiments on harder tasks using the DCGAN architecture Radford et al. (2015) for which we show that Chekhov GAN
reduces mode dropping while retaining high visual sample quality. For all of the experiments, we generate from the newest generator only. Experimental details and comparisons to additional baselines, as well as a set of recommended hyperparameters are available in AppendixC and Appendix B, respectively.
4.1 Non-convergence and Mode Dropping
4.1.1 Toy Dataset: Mixture of Gaussians
We first train a simple architecture using the standard GAN approach as well as Chekhov GAN on a synthesized 2D dataset following Metz et al. (2016); Che et al. (2016). The data consists of a mixture of 7 Gaussians whose centers are aligned in a circle. On this dataset, it can directly be seen how the traditional GAN updates lead to mode dropping where one observes a collapse of large volumes of probability mass onto a few modes. This is hypothesised to be due to the differences of the minimax and maximin solutions of the game Goodfellow (2016). If the order of the min and max operations switch, the minimization with respect to the generator’s parameters is performed in the inner loop. This causes the generator to map every latent code to one or very few points for which the discriminator believes are likely. As simultaneous gradient descent updates do not clearly prioritize any specific ordering of minimax or maximin, in practice we often obtain results that resemble the latter. This phenomena is clearly observed in Fig. 2. In contrast, Chekhov GAN takes advantage of the history of the player’s actions which yields better gradient information. Intuitively, the generator is updated such that it fools the past discriminators. In order to do so, the generator has to spread its mass more fairly according to the true data distribution.
4.1.2 Augmented MNIST
We now evaluate the ability of our approach to avoid mode collapse on real image data coming from an augmented version of the MNIST dataset. Similarly to Metz et al. (2016); Che et al. (2016), we combine three randomly selected MNIST digits to form 3-channel images, resulting in a dataset with 1000 different classes, one for each of the possible combinations of the ten MNIST digits.
We train a simplified DCGAN architecture (see details in Appendix C) with both GAN and Chekhov GAN
with a different number of saved past states. The evaluation of each model is done as follows. We generate a fixed amount of samples (25,600) from each model and classify them using a pre-trained MNIST classifier with an accuracy of. The models that exhibit less mode collapse are expected to generate samples from most of the 1000 modes.
We report two different evaluation metrics in Table1: i) the number of classes for which a model generated at least one sample, and ii) the reverse KL divergence. The reverse KL divergence between the model and the target data distribution is computed by considering that the data distribution is a uniform distribution over all classes.
|Models||0 states (GAN)||5 states||10 states|
|Generated Classes||629 121.08||743 64.31||795 37|
|Reverse KL||1.96 0.64||1.40 0.21||1.24 0.17|
4.2 Image Modeling
We turn to the evaluation of our model for the task of generating rich image data for which the modes of the data distribution are unknown. In the following, we perform experiments that indirectly measure mode coverage through metrics based on the sample diversity and quality.
4.2.1 Inference via Optimization on CIFAR10
We train a DCGAN architecture on CIFAR10 Krizhevsky and Hinton (2009) and evaluate the performance of each model using the inference via optimization technique introduced in Metz et al. (2016) and explained in Appendix C.3.3.
The average MSE over 10 rounds using different seeds is reported in Table 2. Using Chekhov GAN
with as few as 5 past states results in a significant gain which can be further improved by increasing the number of past states to 10 and 25. In addition, the training procedure becomes more stable as indicated by the decrease in the standard deviation. The percentage of minibatches that achieve the lowest reconstruction loss with the different models is given in Table2. This can also be visualized by comparing the closest images from each model to real target images as shown in Figure 3. The images are randomly selected images from the batch which has the largest absolute difference in MSE between GAN and Chekhov GAN with 25 states. The samples obtained by the original GAN are often blurry while samples from Chekhov GAN are both sharper and exhibit more variety, suggesting a better coverage of the true data distribution.
|Target||Past States||0 (GAN)||5 states||10 states||25 states|
|MSE||61.13 3.99||58.84 3.67||56.99 3.49||48.42 2.99|
|Best Rank (%)||0 %||0 %||18.66 %||81.33 %|
|MSE||59.5 3.65||56.66 3.60||53.75 3.47||46.82 2.96|
|Best Rank (%)||0 %||0 %||17.57 %||82.43 %|
4.2.2 Estimation of Missing Modes on CelebA
We estimate the number of missing modes on the CelebA dataset Liu et al. (2015) by using an auxiliary discriminator as performed in Che et al. (2016). The experiment consists of two phases. In the first phase we train GAN and Chekhov GAN models and generate a fixed number of images. In the second phase we independently train a noisy discriminator using the DCGAN architecture where the training data is the previously generated data from each of the models, respectively. The noisy discriminator is then used as a mode estimator. Test images from CelebA are provided to the mode estimator and the number of images that are classified as fake can be viewed as images on a missing mode. Table 4 showcases number of missed modes for the two models. Generated samples from each model are given in the Appendix.
|0 states (GAN)||5 states (Chekhov GAN )|
|0.25||3004 4154||1407 1848|
|0.5||2568.25 4148||1007 1805|
CelebA: Number of images from the test set that the auxiliary discriminator classifies as not real. Gaussian noise with varianceis added to the input of the auxiliary discriminator, with the standard deviation shown in the first row. The test set consists of 50,000 images.
Interestingly, even with small number of past states (K=5), Chekhov GAN manages to stabilize the training and generate more diverse samples on all the datasets. In terms of computational complexity, our algorithm scales linearly with . However, all the elements in the sum are independent and can be computed efficiently in a parallel manner.
Here we provide the proof of Thm. 1.
The mixed strategies that Algorithm 1 outputs are -MNE, where
here are bounds on the regret of .
According to Thm. 2, it is sufficient to show that both , and ensure a regret bound of .
Guarantees for : this FTRL version is well known in online learning, and its regret guarantees can be found in the literature, (e.g, Theorem 5.1 in Hazan et al. (2016)). The following lemma provides its guarantees,
Let be the diameter of . Invoking with , ensures the following regret bound over the sequence of concave functions ,
Moreover, the following applies for the sequence generated by ,
Note that the proof heavily relies on the concavity of the ’s, which is due to the concavity of the game with respect to the discriminator. For completeness we provide a proof of the second part of the lemma in Sec. A.3.
Guarantees for : By Lemma 2, the sequence generated by the discriminator not only ensures low regret but is also stable in the sense that consecutive decision points are close by. This is the key property which will enable us to establish a regret bound for algorithm . Next we state the guarantees of ,
Let . Consider the loss sequence appearing in Alg 1, . Then algorithm ensures the following regret bound over this sequence,
5.1 Proof of Lemma 3
For any sequence of loss functions , the regret of FTL is bounded as follows,
Since is FTL, the above bound applies. Thus, using the above bound together with the stability of the sequence we obtain,
where the fourth line uses , the fifth line uses the Lipschitz continuity of . And the sixth line used the stability of the ’s due to Lemma 2. Finally, we use . ∎
5.2 Proof of Theorem 2
Writing explicitly and , and plugging these into the regret guarantees of , we have,
By definition, . Using this together with Equation (5), we get,
Recalling that , and denoting , we conclude that,
We can similarly show that,
which concludes the proof. ∎
We have presented a principled approach to training GANs, which is guaranteed to reach convergence to a mixed equilibrium for semi-shallow architectures. Empirically, our approach presents several advantages when applied to commonly used GAN architectures, such as improved stability or reduction in mode dropping. Our results open an avenue for the use of online-learning and game-theoretic techniques in the context of training GANs. One question that remains open is whether the theoretical guarantees can be extended to more complex architectures.
The authors would like to thank Hoda Heidari and Johannes Kirschner for helpful comments and suggestions. This research was partially supported by ERC StG 307036. This work was done in part while Andreas Krause was visiting the Simons Institute for the Theory of Computing. K.Y.L. is supported by the ETH Zürich Postdoctoral Fellowship and Marie Curie Actions for People COFUND program.
- Arjovsky and Bottou (2017) M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In NIPS 2016 Workshop on Adversarial Training. In review for ICLR, volume 2016, 2017.
- Arora et al. (2017) S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.
- Berthelot et al. (2017) D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
- Che et al. (2016) T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li. Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136, 2016.
- Daskalakis et al. (2015) C. Daskalakis, A. Deckelbaum, and A. Kim. Near-optimal no-regret algorithms for zero-sum games. Games and Economic Behavior, 92:327–348, 2015.
- Freund and Schapire (1999) Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, 1999.
- Glorot and Bengio (2010) X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Aistats, volume 9, pages 249–256, 2010.
- Goodfellow (2016) I. Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
- Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Nets. pages 2672–2680, 2014.
Gretton et al. (2012)
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola.
A kernel two-sample test.
Journal of Machine Learning Research, 13(Mar):723–773, 2012.
- Hazan and Koren (2016) E. Hazan and T. Koren. The computational power of optimization in online learning. In Proc. STOC, pages 128–141. ACM, 2016.
- Hazan et al. (2016) E. Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
- Kalai and Vempala (2005) A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
- Kingma and Ba (2014) D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kingma and Welling (2013) D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. arXiv.org, Dec. 2013.
- Kivinen and Warmuth (1997) J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63, 1997.
- Krizhevsky and Hinton (2009) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
Li et al. (2015)
Y. Li, K. Swersky, and R. S. Zemel.
Generative moment matching networks.In ICML, pages 1718–1727, 2015.
Liu et al. (2015)
Z. Liu, P. Luo, X. Wang, and X. Tang.
Deep learning face attributes in the wild.
Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015.
- McCullagh and Nelder (1989) P. McCullagh and J. A. Nelder. Generalized linear models, no. 37 in monograph on statistics and applied probability, 1989.
- Metz et al. (2016) L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016.
- Müller (1997) A. Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(02):429–443, 1997.
- Nash et al. (1950) J. F. Nash et al. Equilibrium points in n-person games. Proceedings of the national academy of sciences, 36(1):48–49, 1950.
- Nowozin et al. (2016) S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
- Pfau and Vinyals (2016) D. Pfau and O. Vinyals. Connecting generative adversarial networks and actor-critic methods. arXiv preprint arXiv:1610.01945, 2016.
- Radford et al. (2015) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Rakhlin and Sridharan (2013) S. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pages 3066–3074, 2013.
Rezende et al. (2014)
D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.arXiv.org, 2014.
- Salimans et al. (2016) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2226–2234, 2016.
- Shalev-Shwartz et al. (2012) S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
- Tolstikhin et al. (2017) I. Tolstikhin, S. Gelly, O. Bousquet, C.-J. Simon-Gabriel, and B. Schölkopf. Adagan: Boosting generative models. arXiv preprint arXiv:1701.02386, 2017.
Appendix A Remaining Proofs
a.1 Proof of Proposition 1
Look at the first term in the GAN objective, . For a fixed we have,
and it can be shown that the above expression is always concave in 666For , the -dimensional function is concave. Note that is a composition of over a linear function in , and is therefore concave.. Since an expectation over concave functions is also concave, this implies the concavity of the first term in .
Similarly, look at the second term in the GAN objective, . For a fixed we have,
and it can be shown that the above expression is always concave in . Since an expectation over concave functions is also concave, this implies the concavity of the second term in .
Thus is a sum of two concave terms and is therefore concave in . ∎
a.2 Proof of Lemma 1
Writing explicitly and , and plugging these into the regret guarantees of , we have,
Summing the above equations and dividing by , we get,
Next we show that the second term above is always smaller than the minimax value,
Plugging the above into Equation (9), and recalling , we get,
which concludes the proof. ∎
a.3 Proof of the second part of Lemma 2 (Stability of FTRL sequence in concave case)
Here we establish the stability of the FTRL decision rule, , depicted in Theorem 1.
Note that the following applies to this FTRL objective,
Where is a constant independent of .
Let us denote by the projection operator onto , meaning,
By Equation (10) the FTRL rule, , can be written as follows,
The projection operator is a contraction (see e.g, Hazan et al. (2016)), using this together with the above implies,
where we used which is due to the Lipschitz continuity of . We also used . ∎
Appendix B Practical Chekhov GAN Algorithm
The pseudo-code of algorithms and is given in Algorithm 3. The algorithm is symmetric for both players and consists as follows. At every step if we are currently in the switching mode (i.e. ) and the queue is full, we remove a model from the end of the queue, which is the oldest one. Otherwise, we do not remove any model from the queue, but instead just override the head (first) element with the current update.
We set the initial spacing, , to , where
is the number of update steps per epoch, andis the number of past states we keep. The number of updates per epoch is just the number of the data points divided by the size of the minibatches we use. The default value of is 10. Depending on the dataset and number of total update steps, for higher values of , this is the only parameter that needs to be tuned. We find that our model is not sensitive to the regularization hyperparameters. For symmetric architectures of the generator and the discriminator (such as DCGAN), for practitioners, we recommend using the same regularization for both players. For our experiments we set the default regularization to 0.1.
Appendix C Experiments
c.1 Toy Dataset: Mixture of Gaussians
The toy dataset consists of a mixture 7 Gaussians with a standard deviation of 0.01 and means equally spaced around a unit circle.
The architecture for the generator consists in two fully connected layers (of size 128) and a linear projection to the dimensionality of the data (i.e. 2). The activation functions for the fully connected layers are tanh. The discriminator is symmetric and hence, composed of two fully connected layers (of size 128) followed by a linear layer of size 1. The activation functions for the fully connected layers are tanh, whereas the final layer uses sigmoid as an activation function.
Following Metz et al. (2016), we intialize the weights for both networks to be orthogonal with scaling of 0.8. AdamKingma and Ba (2014) was used as an optimizer for both the discriminator and the generator, with a learning rate of and . The discriminator and generator respectively minimize and maximize the objective
The setup is the same for both models. For Chekhov GAN we use past states with L2 regularization on the network weights using an initial regularization parameter of 0.01.
Effect of the latent dimension.
We find that for the case where , GANs with the traditional updates fail to cover all modes by either rotating around the modes (as shown in Metz et al. (2016)) or converge to only a subset of the modes. However, if we sample the latent code from a lower dimensional space, e.g. , such that it matches the data dimensionality, the generator needs to learn a simpler mapping. We then observe that both GAN and Chekhov GAN are able to recover the true data distribution in this case (see Figure 3).
converge to the true data distribution when the dimensionality of the noise vector is 2
We run an additional experiment directly targeted at testing for mode collapse. We sample points from the data distribution with different probabilities for each mode. Using the same architectures, we perform an experiment with 5 Gaussian mixtures, again of standard deviation 0.01 arranged in a circle. The probabilities to sample points from each of the modes are [0.35, 0.35, 0.1, 0.1, 0.1]. In this case two modes have higher probability and could potentially attract the gradients towards them and cause mode collapse. Chekhov GAN manages to recover the true data distribution in this case as well, unlike vanilla GANs (Figure 4).
c.2 Augmented MNIST
We here detail the experiment on the Stacked MNIST dataset. The dataset is created by stacking three randomly selected MNIST images in the color channels, resulting in a 3-channel image that belongs to one out of 1000 possible classes. The architectures of the generator and discriminator are given in Table 5 and Table 6, respectively.
|Layer||Number of outputs|
|Fully Connected||512 (reshape to [-1, 4, 4, 64] )|
|Layer||Number of outputs|
|Flatten and Fully Connected||1|
We use a simplified version of the DCGAN architecture as suggested by Metz et al. (2016). It contains "deconvolutional layers" which are implemented as transposed convolutions. All convolutions and deconvolutions use kernel size of
with a stride of 2. The weights are initialized using the Xavier initializationGlorot and Bengio (2010)
. The activation units for the discriminator are leaky ReLUs with a leak of 0.3, whereas the generator uses standard ReLUs. We train all models for 20 epochs with a batch size of 32, using the RMSProp optimizer with batch normalization. The optimal learning rate for GAN is 0.001, and forChekhov GAN is 0.01. For all Chekhov GAN models we use regularization of 0.1 for the discriminator and 0.0001 for the generator. The regularization is L2 regularization only on the fully connected layers. For , the increase parameter inc is set to 50. For K=10, inc is 120.
c.3 CIFAR10 / CelebA
|Layer||Number of outputs|
|Fully Connected||32,768 (reshape to [-1, 4, 4, 512] )|
|Layer||Number of outputs|
|Flatten and Fully Connected||1|
As for MNIST, we apply batch normalization. The activation functions for the generator are ReLUs, whereas the discriminator uses leaky ReLUs with a leak of 0.3. The learning rate for all the models is 0.0002 for both the generator and the discriminator and the updates are performed using the Adam optimizer. The regularization for Chekhov GAN is 0.1 and the increase parameter inc is 10.
c.3.1 Results on CIFAR10
We train for 30 epochs, which we find to be the optimal number of training steps for vanilla GAN in terms of MSE on images from the validation set. Table 9 includes comparison to other baselines. The first set of baselines (given with purple color) consist of GAN where the updates in the inner loop (for the discriminator), the outer loop (for the generator), or both are performed 25 times. The baselines shown with green color are regularized versions of GANs, where we apply the same regularization as in our Chekhov GAN in order to show that the gain is not due to the regularization only. Figure 5 presents two randomly sampled batches from the generator trained with GAN and Chekhov GAN .
All models are trained for 10 epochs. Randomly generated batches of images are shown in Figure 6.
c.3.3 Details about Inference via Optimization on CIFAR10
This approach consists in finding a noise vector that when used as input to the generator would produce an image that is the closest to a target image in terms of mean squared error (MSE):
We report the MSE in image space between and . This measures the ability of the generator to generate samples that look like real images. A model engaging in mode collapse would fail to generate (approximate) images from the real data. Conversely, if a model covers the true data distribution it should be able to generate any specific image from it.