When minimizing over , it is known that decreases fastest if moves in the direction . In addition, any direction orthogonal to will leave
unchanged. In this work, we show that these orthogonal directions that are ignored by gradient descent can be critical in equilibrium problems, which are central to game theory. If each playerin a game updates with , can follow a cyclical trajectory, similar to a person riding a merry-go-round (see Figure 1). This toy scenario actually perfectly reflects an aspect of training for a particular machine learning model mentioned below, and is depicted more technically later on in Figure 2. To arrive at the equilibrium point, a person riding the merry-go-round should walk perpendicularly to their direction of travel, taking them directly to the center.
Equilibrium problems have drawn heightened attention in machine learning due to the emergence of the Generative Adversarial Network (GAN) Goodfellow et al. (2014). GANs have served a variety of applications including generating novel images Karras et al. (2017), simulating particle physics de Oliveira et al. (2017)
, and imitating expert policies in reinforcement learningHo and Ermon (2016)
. Despite this plethora of successes, GAN training remains heuristic.
Deep learning has benefited from an understanding of simpler, more fundamental techniques. For example, multinomial logistic regression formulates learning a multiclass classifier as minimizing the cross-entropy of a log-linear model where class probabilities are recovered via a softmax
. The minimization problem is convex and is solved efficiently with guarantees using stochastic gradient descent (SGD). Unsurprisingly, the majority of deep classifiers incorporate asoftmax at the final layer, minimize a cross-entropy loss, and train with a variant of SGD. This progression from logistic regression to classification with deep neural nets is not mirrored in GANs. In contrast, from their inception, GANs were architected with deep nets. Only recently has the Wasserstein Linear-Quadratic GAN (LQ-GAN) Feizi et al. (2017); Nagarajan and Kolter (2017) been proposed as a minimal model for understanding GANs.
In this work, we analyze the convergence of several GAN training algorithms in the LQ-GAN setting. We survey several candidate theories for understanding convergence in GANs, naturally leading us to select Variational Inequalities, an intuitive generalization of the widely relied-upon theories from Convex Optimization. According to our analyses, none of the current GAN training algorithms is globally convergent in this setting. We propose a new technique, Crossing-the-Curl, for training GANs that converges with high probability in the N-dimensional (N-d) LQ-GAN setting.
This work makes the following contributions (proofs can be found in the supplementary material):
The first global convergence analysis of several GAN training methods for the N-d LQ-GAN,
Crossing-the-Curl, the first technique with stochastic convergence for the N-d LQ-GAN,
2 Generative Adversarial Networks
The Generative Adversarial Network (GAN) Goodfellow et al. (2014) formulates learning a generative model of data as finding a Nash equilibrium of a minimax game. The generator (
player) aims to synthesize realistic data samples by transforming vectors drawn from a fixed source distribution, e.g.,. The discriminator ( player) attempts to learn a scoring function that assigns low scores to synthetic data and high scores to samples drawn from the true dataset. The generator’s transformation function, , and discriminator’s scoring function, , are typically chosen to be neural networks parameterized by weights and respectively. The minimax objective of the original GAN Goodfellow et al. (2014) is
where is the source distribution, is the true data distribution, and .
In practice, finding the solution to (1) consists of local updates, e.g., SGD, to and . This continues until 1) has stabilized, 2) the generated data is judged qualitatively accurate, or 3) training has de-stabilized and appears irrecoverable, at which point, training is restarted. The difficulty of training GANs has spurred research that includes reformulating the minimax objective Arjovsky et al. (2017); Mao et al. (2017); Mroueh and Sercu (2017); Mroueh et al. (2017); Nowozin et al. (2016); Uehara et al. (2016); Zhao et al. (2016), devising training heuristics Gulrajani et al. (2017); Karras et al. (2017); Salimans et al. (2016); Roth et al. (2017), proving the existence of equilibria Arora et al. (2017), and conducting local stability analyses Gidel et al. (2018); Mescheder et al. (2017, 2018); Nagarajan and Kolter (2017).
We acknowledge here that our algorithm, Crossing-the-Curl, was independently proposed in Balduzzi et al. (2018) as Symplectic Gradient Adjustment (SGA). In contrast to that work, this paper specifies a non-trivial application of this algorithm to LQ-GAN which obtains global convergence with high probability.
Recent work has studied a simplified setting, the Wasserstein LQ-GAN, where is a linear function, is a quadratic function, , and is Gaussian Feizi et al. (2017); Nagarajan and Kolter (2017). Follow-up research has shown that, in this setting, the optimal generator distribution is a rank- Gaussian containing the top- principal components of the data Feizi et al. (2017). Furthermore, it is shown that if the dimensionality of matches that of
, LQ-GAN is equivalent to maximum likelihood estimation of the generator’s resulting Gaussian distribution. To our knowledge, no GAN training algorithm with guaranteed convergence is currently known for this setting. We revisit the LQ-GAN in more detail in Section4.
3 Convergence of Equilibrium Dynamics
In this section, we review Variational Inequalities (VIs) and compare it to the ODE Method leveraged in recent work Nagarajan and Kolter (2017). See A.1.2 and A.1.1 for a discussion of two additional theories. Throughout the paper, refers to a convex set and refers to a vector field operator (or map) from to , although many of the results for VIs apply to set-valued maps, e.g., subdifferentials, as well. Here, we will cover the basics of the theories and introduce select theorems when necessary later on.
3.1 Variational Inequalities
Variational Inequalities (VIs) are used to study equilibrium problems in a number of domains including mechanics, traffic networks, economics, and game theory Dafermos (1980); Facchinei and J. (2003); Hartman and Stampacchia (1966); Nagurney and Zhang (1996). The Variational Inequality problem, VI, is to find an such that for all in the feasible set , . Under mild conditions (see Appendix A.2), constitutes a Nash equilibrium point. For readers familiar with convex optimization, note the consistent similarity throughout this subsection for when . In game theory, often maps to the set of player gradients. For example, the map corresponding to the minimax game in Equation (1) is .
A map, , is monotone Aslam Noor (1998) if for all and . Alternatively, if the Jacobian matrix of is positive semidefinite (PSD), then is monotone Nagurney and Zhang (1996); Schaible and Luc (1996). A matrix, , is PSD if for all , , or equivalently, is PSD if .
As in convex optimization, a hierarchy of monotonicity exists. For all and , is
|and quasimonotone iff||(3)|
If, in Equation (2), “” is replaced by “”, then is strictly-monotone; if “” is replaced by “”, then is -strongly-monotone. If is a gradient, then replace monotone with convex.
Table 1 cites algorithms with convergence rates for several settings. Whereas gradient descent achieves optimal convergence rates for various convex optimization settings, extragradient Korpelevich (1977) achieves optimal rates for VIs. Results have been extended to the online learning setting as well Gemp and Mahadevan (2016, 2017).
|Deterministic||Cavazzuti et al. (2002)||( Nemirovski (2004); Cai et al. (2014)) Juditsky et al. (2011)||Dang and Lan (2015)|
|Stochastic||Kannan and Shanbhag (2017)||( Yousefian et al. (2014); Kannan and Shanbhag (2017)) Juditsky et al. (2011)||Iusem et al. (2017)|
3.2 The ODE Method & Hurwitz Jacobians
Recently, Nagarajan and Kolter (2017) performed a local stability analysis of the gradient dynamics of Equation (1), proving that the Jacobian of evaluated at is Hurwitz111Our definition of Hurwitz is equivalent to the more standard: is Hurwitz if . Borkar (2008); Borkar and Meyn (2000); Khalil (1996)
, i.e., the real parts of its eigenvalues are strictly positive. This means that if simultaneous gradient descent using a “square-summable, not summable” step sequence enters an-ball with a low enough step size, it will converge to the equilibrium. This applies only in the deterministic setting because stochastic gradients can cause the iterates to exit this ball and diverge. Note that while the real parts of eigenvalues reveal exponential growth or decay of trajectories, the imaginary parts reflect any rotation in the system222Linearized Dynamical System: ; Euler’s formula: ..
The Hurwitz and monotonicity properties are complementary (see A.8). To summarize, Hurwitz encompasses dynamics with exponentially stable trajectories and with arbitrary rotation, while monotonicity includes cycles (Jacobians with zero eigenvalues) and is similar to convex optimization.
Given the preceding discussion, we believe VIs and monotone operator theory will serve as a strong foundation for deriving fundamental convergence results for GANs; this theory is
Similar to convexity suggesting its adoption by the GAN community should be smooth,
Mature with natural mechanisms for handling constraints, subdifferentials, and online scenarios,
Rich with algorithms with finite sample convergence for a hierarchy of monotone operators.
Finally, we suggest Scutari et al. (2010) for a lucid comparison of convex optimization, game theory, and VIs.
4 The Wasserstein Linear Quadratic GAN
In the Wasserstein Linear-Quadratic GAN, the generator and discriminator are restricted to be linear and quadratic respectively: and . Equation (1) becomes
Let , , , and . If is constrained to be lower triangular with positive diagonal, i.e., of Cholesky form, then is the unique minimax solution (see Proposition 9). The majority of this work focuses on the case where and are 1-d distributions. Equation (4) simplifies to
The map associated with this zero-sum game is constructed by concatenating the gradients of the two players’ losses ():
In this section, we will derive our proposed technique, Crossing-the-Curl, motivated by an examination of the ()-subsystem of LQ-GAN, i.e., fixed at for any . The results discussed here hold for the N-dimensional case as well. The map associated with this subsystem is plotted in Figure 2 and formally stated in Equation (6).
The Jacobian of is not Hurwitz, and simultaneous gradient descent, defined in Equation (7), will diverge for this problem (see A.5). However, is monotone and Lipschitz in the sense that . Table 1 offers an extragradient method (see Figure 2) with convergence rate, which is optimal for worst case monotone maps.
Nevertheless, an algorithm that travels perpendicularly to the vector field will proceed directly to the equilibrium. The intuition is to travel in the direction that is perpendicular to both and the axis of rotation. For a 2-d system, the axis of rotation can be obtained by taking the curl of the vector field. To derive a direction perpendicular to both and the axis of rotation, we can take their cross product:
where is Feynman notation for the gradient with respect to only and means evaluate the expression at . The factor ensures the algorithm moves toward regions of “tighter cycles” and simplifies notation. It may be sensible to perform some linear combination of simultaneous gradient descent and Crossing-the-Curl, so we will refer to as .
Note that the fixed point of remains the same as the original field . Furthermore, the reader may recognize as the gradient of the function , which is strongly convex, allowing an convergence rate in the deterministic setting. is derived from intuition in 2-d, however, we discuss reasons in the next subsection for why this approach generalizes to higher dimensions.
5.1 Discussion & Relation to Other Methods
These equivalences occur because the Jacobian is skew-symmetric () for the ()-subsystem. In the more general case, where is not necessarily skew-symmetric, Crossing-the-Curl represents a combination of the two techniques. Extragradient (EG) is key to solving VIs and the consensus algorithm has delivered impressive results for GANs, so this is promising for . To our knowledge, is novel and has not appeared in the Variational Inequality literature.
Crossing-the-Curl stands out in many ways though. Observe that in higher dimensions, the subspace orthogonal to is dimensional, which means is no longer the unique direction orthogonal to . However, every matrix can be decomposed into a symmetric part with real eigenvalues, , and a skew-symmetric part with purely imaginary eigenvalues, . Notice that for an optimization problem, where is the Hessian.333Assuming the objective function has continuous second partial derivatives—see Schwarz’s theorem. It is the imaginary eigenvalues, i.e., rotation, that set equilibrium problems apart from optimization and necessitate the development of new algorithms like extragradient. It is reassuring that this matrix appears explicitly in . In addition, reduces to gradient descent when applied to an optimization problem making the map agnostic to the type of problem at hand: optimization or equilibration.
The curl also shares close relation to the gradient. The gradient is applied to a scalar function and the curl is crossed with a vector function. Furthermore, under mild conditions, every vector field, , admits a Helmholdtz decomposition: where is a scalar function and is a vector function suggesting the gradient and curl are both fundamental components.
Consider the perspective of as preconditioning by a skew-symmetric matrix. Preconditioning with a positive definite matrix dates back to Newton’s method and has reappeared in machine learning with natural gradient Amari (1998). Dafermos (1983) considered asymmetric positive definite preconditioning matrices for VIs. Thomas (2014) extended the analysis of natural gradient to PSD matrices. We are not aware of any work using skew-symmetric matrices for preconditioning. The scalar for any skew-symmetric matrix , so calling a PSD matrix is not adequately descriptive.
Note that Crossing-the-Curl does not always improve convergence; this technique can transform a strongly-monotone field into a saddle and an unstable fixed point (non-monotone) into a strongly-monotone field (see A.9 for examples), so this technique should generally be used with caution.
Lastly, Crossing-the-Curl is inexpensive to compute. The Jacobian-vector product, , can be approximated accurately and efficiently with finite differences. Likewise, can be computed efficiently with double backprop Drucker and Le Cun (1992) by taking the gradient of . In total, three backprops are required, one for , one for , and one for .
In our analysis, we also consider the gradient regularization proposed in Nagarajan and Kolter (2017), , the Unrolled GAN proposed in Metz et al. (2016), , alternating gradient descent, , as well as any linear combination of , , and , deemed , which forms a family of maps that includes , , and :
Keep in mind that we are proposing as a generalization of Crossing-the-Curl. We state our main results here for the -subsystem.
For any , with at least one of and positive and both non-negative is strongly monotone. Also, its Jacobian is Hurwitz. See Proposition 13.
, , , and with are strongly-monotone with Hurwitz Jacobians. See Proposition 1.
6 Analysis of the Full System
Here, we analyze the maps for each of the algorithms discussed above, testing for quasimonotonicity (the weakest monotone property) and whether the Jacobian is Hurwitz for the full LQ-GAN system.
Proving quasiconvexity of 4th degree polynomials has been proven strongly NP-Hard Ahmadi et al. (2013). This implies that proving monotonicity of 3rd degree maps is strongly NP-Hard. The original contains quadratic terms suggesting it may welcome a quasimonotone analysis, however, the remaining maps all contain 3rd degree terms. Unsurprisingly, analyzing quasimonotonicity for represents the most involved of our proofs given in Appendix A.11.
The definition stated in (3) suggests checking the truth of an expression depending on four separate variables: , , , . While we used this definition for certain cases, the following alternate requirements proposed in Crouzeix and Ferland (1996) made the complete analysis of the system tractable. We restate simplified versions of the requirements we leveraged for convenience.
Consider the following conditions:
For all and such that we have .
For all and such that , we have that .
Theorem 1 (Crouzeix and Ferland (1996), Theorem 3).
Let be differentiable on the open convex set .
is quasimonotone on only if (A) holds, i.e. (A) is necessary but not sufficient.
is pseudomonotone on if (A) and (B) hold, i.e. (A) and (B) are sufficient but not necessary.
Condition (A) says that for a map to be quasimonotone, the map must be monotone along directions orthogonal to the vector field. In addition to this, condition (B) says that for a map to be pseudomonotone, the dynamics, , must not be leading away from the equilibrium anywhere.
Equipped with these definitions, we can conclude the following:
6.1 Learning the Variance: The ()-Subsystem
Results from the previous section suggest that we cannot solve the full LQ-GAN, but given that we can solve the ()-subsystem, we shift focus to the ()-subsystem assuming the mean has already been learned exactly, i.e., . We will revisit this assumption later.
We can conclude the following for the ()-subsystem:
No monotone exists. See Proposition 26.
These results are not purely theoretical. Figure 4 displays trajectories resulting from each of the maps.
We can further improve upon and by rescaling with : (12)(13) and (14)(15) respectively. This results in strongly-monotone and strongly-convex systems respectively, improving the stochastic convergence rate to . In deriving these results, we assumed the mean was given. We can relax this assumption and analyze the ()-subsystem under the assumption that the mean is “close enough”. Using a Hoeffding bound, we find that iterations of are required to achieve a probability of the mean being accurate enough to ensure the (
)-subsystem is strongly-monotone. Note that this approach of first learning the mean, then the variance retains the overallstochastic rate. We summarize the main points here.
A nonlinear scaling of and results in strictly monotone and -strongly monotone subsystems respectively. See Proposition 29.
and are not quasimonotone for the 2-d LQ-GAN system (with and without scaling). See Proposition 32.
Several takeaways emerge. One is that the stability of the system is highly dependent on the mean first being learned. In other words, batch norm is required for the monotonicity of LQ-GAN, so it is not surprising that GANs typically fail without these specialized layers.
Second is that stability is achieved by first learning a simple subsystem, (), then learning the more complex, ()-subsystem. This theoretically confirms the intuition behind progressive training of GANs Karras et al. (2017), which have generated the highest quality images to date.
Thirdly, because is symmetric (and ), we can integrate to discover the convex function it is implicitly descending via gradient descent: . Compare this to KL-divergence: . In contrast to , is convex in and may be a desirable alternative due to less extreme gradients near .
6.2 Learning the Covariance: The ()-Off-Diagonal Subsystem
After learning both the mean and variance of each dimension, the covariance of separate dimensions can be learned. Proposition A.14 in the Appendix states that the subsystem relevant to learning each row of is strictly monotone when all other rows are held fixed. In fact, the maps for these subsystems are affine and skew-symmetric just like the ()-subsystem. This implies that Crossing-the-Curl applied successively to each row of can solve for ; pseudocode is presented in Algorithm 1 in Appendix A.15. Note that this procedure is reminiscent of the Cholesky–Banachiewicz algorithm which computes row by row, beginning with the first row. The resulting algorithm is .
Our theoretical analysis proves convergence of the stagewise procedure using Crossing-the-Curl for the N-d LQGAN. Experiments solving the ()-subsystem alone for randomly generated support the analysis of Subsection 6.1—see the first row of Table 3. Not listed in the first row of the table are and which converge in and steps on average respectively with a constant step size of . Our novel maps, and , converge in a quarter of the iterations of the next best method (), and and in nearly a quarter of their parent counterparts. These experiments used analytical results of the expectations, i.e., the systems are deterministic.
The second and third rows of the table reveal that convergence slows considerably for higher dimensions. However, the stagewise procedure discussed in Subsection 6.2 is guaranteed to converge given the mean has been learned to a given accuracy. This procedure solves the 4-d deterministic LQ-GAN in iterations with a success rate. For the 4-d stochastic LQ-GAN using two-sample minibatch estimates, this procedure achieves in 100,000 iterations with a 0.75 success rate.
In this work, we performed the first global convergence analysis for a variety of GAN training algorithms. According to Variational Inequality theory, none of the current GAN training algorithms is globally convergent for the LQ-GAN. We proposed an intuitive technique, Crossing-the-Curl, with the first global convergence guarantees for any generative adversarial network. As a by-product of our analysis, we extract high-level explanations for why the use of batch norm and progressive training schedules for GANs are critical to training. In experiments with the multivariate LQ-GAN, Crossing-the-Curl achieves performance superior to any existing GAN training algorithm.
For future work, we will investigate alternate parameterizations of the discriminator such as . We will also work on devising heuristics for setting the coefficients of .
Crossing-the-Curl was independently proposed in Balduzzi et al. (2018) called Symplectic Gradient Adjustment (SGA). Like Crossing-the-Curl, this algorithm is motivated by attacking the challenges of rotation in differentiable games, however, it is derived by performing gradient descent on the Hamiltonian as opposed to generalizing a particular perpendicular direction selected from intuition in 2-d. Given the equivalence between SGA and Crossing-the-Curl, our work can also be viewed as proving that a non-trivial application of this algorithm can be used to solve the LQ-GAN. On the other hand, we have also proven in Proposition 7 that a naive application of this algorithm is insufficient for solving LQ-GAN suggesting more research is required to understand and more efficiently solve this complex problem.
- Ahmadi et al. (2013) A. A. Ahmadi, A. Olshevsky, P. A. Parrilo, and J. N. Tsitsiklis. Np-hardness of deciding convexity of quartic polynomials and related problems. Mathematical Programming, 2013.
- Amari (1998) S. I. Amari. Natural gradient works efficiently in learning. Neural Computation, 1998.
- Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
- Arora et al. (2017) S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.
- Aslam Noor (1998) M. Aslam Noor. Generalized set-valued variational inequalities. Le Matematiche, 1998.
- Balduzzi et al. (2018) David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. arXiv preprint arXiv:1802.05642, 2018.
- Basar and Olsder (1999) T. Basar and G. J. Olsder. Dynamic noncooperative game theory. SIAM, 1999.
- Borkar (2008) V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008.
- Borkar and Meyn (2000) V. S. Borkar and S. P. Meyn. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 2000.
- Cai et al. (2014) X. Cai, G. Gu, and B. He. On the o(1/t) convergence rate of the projection and contraction methods for variational inequalities with lipschitz continuous monotone operators. Computational Optimization and Applications, 2014.
- Cavazzuti et al. (2002) E. Cavazzuti, M. Pappalardo, and M. Passacantando. Nash equilibria, variational inequalities, and dynamical systems. Journal of Optimization Theory and Applications, 2002.
- Crouzeix and Ferland (1996) J. P. Crouzeix and J. A. Ferland. Criteria for differentiable generalized monotone maps. Mathematical Programming, 1996.
- Dafermos (1980) S. Dafermos. Traffic equilibria and variational inequalities. Transportation Science, 1980.
- Dafermos (1983) S. Dafermos. An iterative scheme for variational inequalities. Mathematical Programming, 1983.
- Dang and Lan (2015) C. D. Dang and G. Lan. On the convergence properties of non-euclidean extragradient methods for variational inequalities with generalized monotone operators. Computational Optimization and Applications, 2015.
- de Oliveira et al. (2017) L. de Oliveira, M. Paganini, and B. Nachman. Learning particle physics by example: location-aware generative adversarial networks for physics synthesis. Computing and Software for Big Science, 2017.
- Descartes (1886) René Descartes. La géométrie de René Descartes. A. Hermann, 1886.
Drucker and Le Cun (1992)
H. Drucker and Y. Le Cun.
Improving generalization performance using double backpropagation.IEEE Transactions on Neural Networks, 1992.
Even-Dar et al. (2009)
E. Even-Dar, Y. Mansour, and U. Nadav.
On the convergence of regret minimization dynamics in concave games.
Proceedings of the 41st Annual ACM symposium on Theory of Computing, 2009.
- Facchinei and J. (2003) F. Facchinei and Pang J. Finite-Dimensional Variational Inequalities and Complimentarity Problems. Springer, 2003.
- Feizi et al. (2017) S. Feizi, C. Suh, F. Xia, and D. Tse. Understanding gans: the lqg setting. arXiv preprint arXiv:1710.10793, 2017.
- Friesz (2010) T. L. Friesz. Dynamic optimization and differential games. Springer Science & Business Media, 2010.
- Gemp and Mahadevan (2016) I. Gemp and S. Mahadevan. Online monotone optimization. arXiv preprint arXiv:1608.07888, 2016.
- Gemp and Mahadevan (2017) I. Gemp and S. Mahadevan. Online monotone games. arXiv preprint arXiv:1710.07328, 2017.
- Gidel et al. (2018) Gauthier Gidel, Hugo Berard, Pascal Vincent, and Simon Lacoste-Julien. A variational inequality perspective on generative adversarial nets. arXiv preprint arXiv:1802.10551, 2018.
- Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
- Gordon et al. (2008) G. J. Gordon, A. Greenwald, and C. Marks. No-regret learning in convex games. In Proceedings of the 25th International Conference on Machine learning, 2008.
- Gulrajani et al. (2017) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, 2017.
- Hartman and Stampacchia (1966) P. Hartman and G. Stampacchia. On some nonlinear elliptic differential functional equations. Acta Mathematica, 1966.
- Higham et al. (2000) D. J. Higham, A. R. Humphries, and R. J. Wain. Phase space error control for dynamical systems. SIAM Journal on Scientific Computing, 2000.
- Ho and Ermon (2016) J. Ho and S. Ermon. Generative adversarial imitation learning. arXiv preprint arXiv:1606.03476, 2016.
- Iusem et al. (2017) A. N. Iusem, A. Jofré, R. I. Oliveira, and P. Thompson. Extragradient method with variance reduction for stochastic variational inequalities. SIAM Journal on Optimization, 2017.
- Juditsky et al. (2011) A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 2011.
- Kannan and Shanbhag (2017) A. Kannan and U. V. Shanbhag. Optimal stochastic extragradient schemes for pseudomonotone stochastic variational inequality problems and their variants. arXiv preprint arXiv:1410.1628, 2017.
- Karras et al. (2017) T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
- Khalil (1996) H. K. Khalil. Nonlinear Systems. Prentice-Hall, New Jersey, 1996.
- Korpelevich (1977) G. Korpelevich. The extragradient method for finding saddle points and other problems. 1977.
Mao et al. (2017)
X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley.
Least squares generative adversarial networks.
IEEE International Conference on Computer Vision (ICCV), 2017.
- Mescheder et al. (2017) L. Mescheder, S. Nowozin, and A. Geiger. The numerics of gans. In Advances in Neural Information Processing Systems, 2017.
- Mescheder et al. (2018) Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In International Conference on Machine Learning, pages 3478–3487, 2018.
- Metz et al. (2016) L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016.
- Mroueh and Sercu (2017) Y. Mroueh and T. Sercu. Fisher gan. In Advances in Neural Information Processing Systems, 2017.
- Mroueh et al. (2017) Y. Mroueh, T. Sercu, and V. Goel. Mcgan: Mean and covariance feature matching gan. arXiv preprint arXiv:1702.08398, 2017.
- Nagarajan and Kolter (2017) V. Nagarajan and J. Z. Kolter. Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems, 2017.
- Nagurney and Zhang (1996) A. Nagurney and D. Zhang. Projected Dynamical Systems and Variational Inequalities with Applications. Kluwer Academic Press, 1996.
- Nemirovski (2004) A. Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 2004.
- Nowozin et al. (2016) S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. arXiv preprint arXiv:1606.00709, 2016.
- Roth et al. (2017) Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization. In Advances in Neural Information Processing Systems, pages 2018–2028, 2017.
- Roughgarden (2009) T. Roughgarden. Intrinsic robustness of the price of anarchy. In Proceedings of the 41st annual ACM Symposium on Theory of Computing, 2009.
- Salimans et al. (2016) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. arXiv preprint arXiv:1606.03498, 2016.
- Schaible and Luc (1996) S. Schaible and D. Luc. Generalized monotone nonsmooth maps. Journal of Convex Analysis, 1996.
- Scutari et al. (2010) G. Scutari, D. P. Palomar, F. Facchinei, and J. Pang. Convex optimization, game theory, and variational inequality theory. IEEE Signal Processing Magazine, 2010.
- Thomas (2014) P. Thomas. Genga: A generalization of natural gradient ascent with positive and negative convergence results. In International Conference on Machine Learning, 2014.
- Uehara et al. (2016) M. Uehara, I. Sato, M. Suzuki, K. Nakayama, and Y. Matsuo. Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920, 2016.
- Yousefian et al. (2014) F. Yousefian, A. Nedić, and U. V. Shanbhag. Optimal robust smoothing extragradient algorithms for stochastic variational inequality problems. In IEEE 53rd Annual Conference on Decision and Control (CDC), 2014.
- Zhao et al. (2016) J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
Appendix A Appendix
a.1 A Survey of Candidate Theories Continued
a.1.1 Algorithmic Game Theory
Algorithmic Game Theory (AGT) offers results on convergence to equilibria when a game, possibly online, is convex Gordon et al. , socially-convex Even-Dar et al. , or smooth Roughgarden . A convex game is one in which all player losses are convex in their respective variables, i.e. is convex in . A socially-convex game adds the additional requirements that 1) there exists a strict convex combination of the player losses that is convex and 2) each player’s loss is concave in the variables of each of the other players. In other words, the players as a whole are cooperative, yet individually competitive. Lastly, smoothness ensures that “the externality imposed on any one player by the actions of the others is bounded” Roughgarden . In a zero-sum game such as (1), one player’s gain is exactly the other player’s loss making smoothness an unlikely fit for studying GANs. See Gemp and Mahadevan  for examples where the three properties above overlap with monotonicity in VIs.
a.1.2 Differential Games
Differential games Basar and Olsder , Friesz  consider more general dynamics such as , not just first order ODEs, however, the focus is on systems that separate control, , and state , i.e. . More specific to our interests, Differential Nash Games can be expressed as Differential VIs, a specific class of infinite dimensional VIs with explicit state dynamics and explicit controls; these, in turn, can be framed as infinite dimensional VIs without an explicit state.
a.2 Nash Equilibrium vs VI Solution
Repeated from Cavazzuti et al. . Let be a cost minimization game with player cost functions and feasible set . Let be a Nash equilibrium. Let . Then
where is the internal cone at . When is pseudoconvex in for all , this condition is also sufficient. Note that this is implied if is pseudomonotone, i.e. pseudomonotonicity of is a stronger condition.
a.3 Table of Maps Considered in Analysis
All maps corresponding to the ()-subsystem in Table 4 maintain the desired unique fixed point, , where .
For the ()-subsystem, all maps except with certain settings of () and maintain the desired unique fixed point, . introduces an additional spurious fixed point at
is a special case of where , , and .
a.4 Minimax Solution to Constrained Multivariate LQ-GAN is Unique
Assume and are both in . If is constrained to be symmetric and is constrained to be of Cholesky form, i.e., lower triangular with positive diagonal, then the unique minimax solution to Equation (5) is where is the unique, non-negative square root of .
Taking derivatives and setting equal to zero, we find that the fixed point at the interior is unique.