1 Introduction
When minimizing over , it is known that decreases fastest if moves in the direction . In addition, any direction orthogonal to will leave
unchanged. In this work, we show that these orthogonal directions that are ignored by gradient descent can be critical in equilibrium problems, which are central to game theory. If each player
in a game updates with , can follow a cyclical trajectory, similar to a person riding a merrygoround (see Figure 1). This toy scenario actually perfectly reflects an aspect of training for a particular machine learning model mentioned below, and is depicted more technically later on in Figure 2. To arrive at the equilibrium point, a person riding the merrygoround should walk perpendicularly to their direction of travel, taking them directly to the center.Equilibrium problems have drawn heightened attention in machine learning due to the emergence of the Generative Adversarial Network (GAN) Goodfellow et al. (2014). GANs have served a variety of applications including generating novel images Karras et al. (2017), simulating particle physics de Oliveira et al. (2017)
, and imitating expert policies in reinforcement learning
Ho and Ermon (2016). Despite this plethora of successes, GAN training remains heuristic.
Deep learning has benefited from an understanding of simpler, more fundamental techniques. For example, multinomial logistic regression formulates learning a multiclass classifier as minimizing the crossentropy of a loglinear model where class probabilities are recovered via a softmax
. The minimization problem is convex and is solved efficiently with guarantees using stochastic gradient descent (SGD). Unsurprisingly, the majority of deep classifiers incorporate a
softmax at the final layer, minimize a crossentropy loss, and train with a variant of SGD. This progression from logistic regression to classification with deep neural nets is not mirrored in GANs. In contrast, from their inception, GANs were architected with deep nets. Only recently has the Wasserstein LinearQuadratic GAN (LQGAN) Feizi et al. (2017); Nagarajan and Kolter (2017) been proposed as a minimal model for understanding GANs.In this work, we analyze the convergence of several GAN training algorithms in the LQGAN setting. We survey several candidate theories for understanding convergence in GANs, naturally leading us to select Variational Inequalities, an intuitive generalization of the widely reliedupon theories from Convex Optimization. According to our analyses, none of the current GAN training algorithms is globally convergent in this setting. We propose a new technique, CrossingtheCurl, for training GANs that converges with high probability in the Ndimensional (Nd) LQGAN setting.
This work makes the following contributions (proofs can be found in the supplementary material):

[leftmargin=*]

The first global convergence analysis of several GAN training methods for the Nd LQGAN,

CrossingtheCurl, the first technique with stochastic convergence for the Nd LQGAN,

An empirical demonstration of CrossingtheCurl
in the multivariate LQGAN setting as well as some common neural network driven settings in Appendix
A.16.
2 Generative Adversarial Networks
The Generative Adversarial Network (GAN) Goodfellow et al. (2014) formulates learning a generative model of data as finding a Nash equilibrium of a minimax game. The generator (
player) aims to synthesize realistic data samples by transforming vectors drawn from a fixed source distribution, e.g.,
. The discriminator ( player) attempts to learn a scoring function that assigns low scores to synthetic data and high scores to samples drawn from the true dataset. The generator’s transformation function, , and discriminator’s scoring function, , are typically chosen to be neural networks parameterized by weights and respectively. The minimax objective of the original GAN Goodfellow et al. (2014) is(1) 
where is the source distribution, is the true data distribution, and .
In practice, finding the solution to (1) consists of local updates, e.g., SGD, to and . This continues until 1) has stabilized, 2) the generated data is judged qualitatively accurate, or 3) training has destabilized and appears irrecoverable, at which point, training is restarted. The difficulty of training GANs has spurred research that includes reformulating the minimax objective Arjovsky et al. (2017); Mao et al. (2017); Mroueh and Sercu (2017); Mroueh et al. (2017); Nowozin et al. (2016); Uehara et al. (2016); Zhao et al. (2016), devising training heuristics Gulrajani et al. (2017); Karras et al. (2017); Salimans et al. (2016); Roth et al. (2017), proving the existence of equilibria Arora et al. (2017), and conducting local stability analyses Gidel et al. (2018); Mescheder et al. (2017, 2018); Nagarajan and Kolter (2017).
We acknowledge here that our algorithm, CrossingtheCurl, was independently proposed in Balduzzi et al. (2018) as Symplectic Gradient Adjustment (SGA). In contrast to that work, this paper specifies a nontrivial application of this algorithm to LQGAN which obtains global convergence with high probability.
Recent work has studied a simplified setting, the Wasserstein LQGAN, where is a linear function, is a quadratic function, , and is Gaussian Feizi et al. (2017); Nagarajan and Kolter (2017). Followup research has shown that, in this setting, the optimal generator distribution is a rank Gaussian containing the top principal components of the data Feizi et al. (2017). Furthermore, it is shown that if the dimensionality of matches that of
, LQGAN is equivalent to maximum likelihood estimation of the generator’s resulting Gaussian distribution. To our knowledge, no GAN training algorithm with guaranteed convergence is currently known for this setting. We revisit the LQGAN in more detail in Section
4.3 Convergence of Equilibrium Dynamics
In this section, we review Variational Inequalities (VIs) and compare it to the ODE Method leveraged in recent work Nagarajan and Kolter (2017). See A.1.2 and A.1.1 for a discussion of two additional theories. Throughout the paper, refers to a convex set and refers to a vector field operator (or map) from to , although many of the results for VIs apply to setvalued maps, e.g., subdifferentials, as well. Here, we will cover the basics of the theories and introduce select theorems when necessary later on.
3.1 Variational Inequalities
Variational Inequalities (VIs) are used to study equilibrium problems in a number of domains including mechanics, traffic networks, economics, and game theory Dafermos (1980); Facchinei and J. (2003); Hartman and Stampacchia (1966); Nagurney and Zhang (1996). The Variational Inequality problem, VI, is to find an such that for all in the feasible set , . Under mild conditions (see Appendix A.2), constitutes a Nash equilibrium point. For readers familiar with convex optimization, note the consistent similarity throughout this subsection for when . In game theory, often maps to the set of player gradients. For example, the map corresponding to the minimax game in Equation (1) is .
A map, , is monotone Aslam Noor (1998) if for all and . Alternatively, if the Jacobian matrix of is positive semidefinite (PSD), then is monotone Nagurney and Zhang (1996); Schaible and Luc (1996). A matrix, , is PSD if for all , , or equivalently, is PSD if .
As in convex optimization, a hierarchy of monotonicity exists. For all and , is
monotone iff  (2)  
pseudomonotone iff  
and quasimonotone iff  (3) 
If, in Equation (2), “” is replaced by “”, then is strictlymonotone; if “” is replaced by “”, then is stronglymonotone. If is a gradient, then replace monotone with convex.
Table 1 cites algorithms with convergence rates for several settings. Whereas gradient descent achieves optimal convergence rates for various convex optimization settings, extragradient Korpelevich (1977) achieves optimal rates for VIs. Results have been extended to the online learning setting as well Gemp and Mahadevan (2016, 2017).
StronglyMonotone  (Smooth/Sharp+)Monotone  Pseudomonotone  

Deterministic  Cavazzuti et al. (2002)  ( Nemirovski (2004); Cai et al. (2014)) Juditsky et al. (2011)  Dang and Lan (2015) 
Stochastic  Kannan and Shanbhag (2017)  ( Yousefian et al. (2014); Kannan and Shanbhag (2017)) Juditsky et al. (2011)  Iusem et al. (2017) 
3.2 The ODE Method & Hurwitz Jacobians
Recently, Nagarajan and Kolter (2017) performed a local stability analysis of the gradient dynamics of Equation (1), proving that the Jacobian of evaluated at is Hurwitz^{1}^{1}1Our definition of Hurwitz is equivalent to the more standard: is Hurwitz if . Borkar (2008); Borkar and Meyn (2000); Khalil (1996)
, i.e., the real parts of its eigenvalues are strictly positive. This means that if simultaneous gradient descent using a “squaresummable, not summable” step sequence enters an
ball with a low enough step size, it will converge to the equilibrium. This applies only in the deterministic setting because stochastic gradients can cause the iterates to exit this ball and diverge. Note that while the real parts of eigenvalues reveal exponential growth or decay of trajectories, the imaginary parts reflect any rotation in the system^{2}^{2}2Linearized Dynamical System: ; Euler’s formula: ..The Hurwitz and monotonicity properties are complementary (see A.8). To summarize, Hurwitz encompasses dynamics with exponentially stable trajectories and with arbitrary rotation, while monotonicity includes cycles (Jacobians with zero eigenvalues) and is similar to convex optimization.
Given the preceding discussion, we believe VIs and monotone operator theory will serve as a strong foundation for deriving fundamental convergence results for GANs; this theory is

[leftmargin=*]

Similar to convexity suggesting its adoption by the GAN community should be smooth,

Mature with natural mechanisms for handling constraints, subdifferentials, and online scenarios,

Rich with algorithms with finite sample convergence for a hierarchy of monotone operators.
Finally, we suggest Scutari et al. (2010) for a lucid comparison of convex optimization, game theory, and VIs.
4 The Wasserstein Linear Quadratic GAN
In the Wasserstein LinearQuadratic GAN, the generator and discriminator are restricted to be linear and quadratic respectively: and . Equation (1) becomes
(4) 
Let , , , and . If is constrained to be lower triangular with positive diagonal, i.e., of Cholesky form, then is the unique minimax solution (see Proposition 9). The majority of this work focuses on the case where and are 1d distributions. Equation (4) simplifies to
(5) 
The map associated with this zerosum game is constructed by concatenating the gradients of the two players’ losses ():
5 CrossingtheCurl
In this section, we will derive our proposed technique, CrossingtheCurl, motivated by an examination of the ()subsystem of LQGAN, i.e., fixed at for any . The results discussed here hold for the Ndimensional case as well. The map associated with this subsystem is plotted in Figure 2 and formally stated in Equation (6).
The Jacobian of is not Hurwitz, and simultaneous gradient descent, defined in Equation (7), will diverge for this problem (see A.5). However, is monotone and Lipschitz in the sense that . Table 1 offers an extragradient method (see Figure 2) with convergence rate, which is optimal for worst case monotone maps.
Nevertheless, an algorithm that travels perpendicularly to the vector field will proceed directly to the equilibrium. The intuition is to travel in the direction that is perpendicular to both and the axis of rotation. For a 2d system, the axis of rotation can be obtained by taking the curl of the vector field. To derive a direction perpendicular to both and the axis of rotation, we can take their cross product:
where is Feynman notation for the gradient with respect to only and means evaluate the expression at . The factor ensures the algorithm moves toward regions of “tighter cycles” and simplifies notation. It may be sensible to perform some linear combination of simultaneous gradient descent and CrossingtheCurl, so we will refer to as .
Note that the fixed point of remains the same as the original field . Furthermore, the reader may recognize as the gradient of the function , which is strongly convex, allowing an convergence rate in the deterministic setting. is derived from intuition in 2d, however, we discuss reasons in the next subsection for why this approach generalizes to higher dimensions.
5.1 Discussion & Relation to Other Methods
For the ()subsystem, CrossingtheCurl is equivalent to two other methods: the consensus algorithm Mescheder et al. (2017) and a Taylor series approximation to extragradient Korpelevich (1977).
These equivalences occur because the Jacobian is skewsymmetric (
) for the ()subsystem. In the more general case, where is not necessarily skewsymmetric, CrossingtheCurl represents a combination of the two techniques. Extragradient (EG) is key to solving VIs and the consensus algorithm has delivered impressive results for GANs, so this is promising for . To our knowledge, is novel and has not appeared in the Variational Inequality literature.CrossingtheCurl stands out in many ways though. Observe that in higher dimensions, the subspace orthogonal to is dimensional, which means is no longer the unique direction orthogonal to . However, every matrix can be decomposed into a symmetric part with real eigenvalues, , and a skewsymmetric part with purely imaginary eigenvalues, . Notice that for an optimization problem, where is the Hessian.^{3}^{3}3Assuming the objective function has continuous second partial derivatives—see Schwarz’s theorem. It is the imaginary eigenvalues, i.e., rotation, that set equilibrium problems apart from optimization and necessitate the development of new algorithms like extragradient. It is reassuring that this matrix appears explicitly in . In addition, reduces to gradient descent when applied to an optimization problem making the map agnostic to the type of problem at hand: optimization or equilibration.
The curl also shares close relation to the gradient. The gradient is applied to a scalar function and the curl is crossed with a vector function. Furthermore, under mild conditions, every vector field, , admits a Helmholdtz decomposition: where is a scalar function and is a vector function suggesting the gradient and curl are both fundamental components.
Consider the perspective of as preconditioning by a skewsymmetric matrix. Preconditioning with a positive definite matrix dates back to Newton’s method and has reappeared in machine learning with natural gradient Amari (1998). Dafermos (1983) considered asymmetric positive definite preconditioning matrices for VIs. Thomas (2014) extended the analysis of natural gradient to PSD matrices. We are not aware of any work using skewsymmetric matrices for preconditioning. The scalar for any skewsymmetric matrix , so calling a PSD matrix is not adequately descriptive.
Note that CrossingtheCurl does not always improve convergence; this technique can transform a stronglymonotone field into a saddle and an unstable fixed point (nonmonotone) into a stronglymonotone field (see A.9 for examples), so this technique should generally be used with caution.
Lastly, CrossingtheCurl is inexpensive to compute. The Jacobianvector product, , can be approximated accurately and efficiently with finite differences. Likewise, can be computed efficiently with double backprop Drucker and Le Cun (1992) by taking the gradient of . In total, three backprops are required, one for , one for , and one for .
In our analysis, we also consider the gradient regularization proposed in Nagarajan and Kolter (2017), , the Unrolled GAN proposed in Metz et al. (2016), , alternating gradient descent, , as well as any linear combination of , , and , deemed , which forms a family of maps that includes , , and :
Keep in mind that we are proposing as a generalization of CrossingtheCurl. We state our main results here for the subsystem.
Proposition 1.
For any , with at least one of and positive and both nonnegative is strongly monotone. Also, its Jacobian is Hurwitz. See Proposition 13.
Corollary 1.
, , , and with are stronglymonotone with Hurwitz Jacobians. See Proposition 1.
6 Analysis of the Full System
Here, we analyze the maps for each of the algorithms discussed above, testing for quasimonotonicity (the weakest monotone property) and whether the Jacobian is Hurwitz for the full LQGAN system.
Proving quasiconvexity of 4th degree polynomials has been proven strongly NPHard Ahmadi et al. (2013). This implies that proving monotonicity of 3rd degree maps is strongly NPHard. The original contains quadratic terms suggesting it may welcome a quasimonotone analysis, however, the remaining maps all contain 3rd degree terms. Unsurprisingly, analyzing quasimonotonicity for represents the most involved of our proofs given in Appendix A.11.
The definition stated in (3) suggests checking the truth of an expression depending on four separate variables: , , , . While we used this definition for certain cases, the following alternate requirements proposed in Crouzeix and Ferland (1996) made the complete analysis of the system tractable. We restate simplified versions of the requirements we leveraged for convenience.
Consider the following conditions:

[label=()]

For all and such that we have .

For all and such that , we have that .
Theorem 1 (Crouzeix and Ferland (1996), Theorem 3).
Let be differentiable on the open convex set .

[leftmargin=*,label=()]

is quasimonotone on only if (A) holds, i.e. (A) is necessary but not sufficient.

is pseudomonotone on if (A) and (B) hold, i.e. (A) and (B) are sufficient but not necessary.
Condition (A) says that for a map to be quasimonotone, the map must be monotone along directions orthogonal to the vector field. In addition to this, condition (B) says that for a map to be pseudomonotone, the dynamics, , must not be leading away from the equilibrium anywhere.
Equipped with these definitions, we can conclude the following:
Proposition 3.
Proposition 4.
6.1 Learning the Variance: The ()Subsystem
Results from the previous section suggest that we cannot solve the full LQGAN, but given that we can solve the ()subsystem, we shift focus to the ()subsystem assuming the mean has already been learned exactly, i.e., . We will revisit this assumption later.
We can conclude the following for the ()subsystem:
Proposition 5.
Proposition 6.
Proposition 7.
No monotone exists. See Proposition 26.
These results are not purely theoretical. Figure 4 displays trajectories resulting from each of the maps.
We can further improve upon and by rescaling with : (12)(13) and (14)(15) respectively. This results in stronglymonotone and stronglyconvex systems respectively, improving the stochastic convergence rate to . In deriving these results, we assumed the mean was given. We can relax this assumption and analyze the ()subsystem under the assumption that the mean is “close enough”. Using a Hoeffding bound, we find that iterations of are required to achieve a probability of the mean being accurate enough to ensure the (
)subsystem is stronglymonotone. Note that this approach of first learning the mean, then the variance retains the overall
stochastic rate. We summarize the main points here.Claim 1.
A nonlinear scaling of and results in strictly monotone and strongly monotone subsystems respectively. See Proposition 29.
Claim 2.
Proposition 8.
and are not quasimonotone for the 2d LQGAN system (with and without scaling). See Proposition 32.
Several takeaways emerge. One is that the stability of the system is highly dependent on the mean first being learned. In other words, batch norm is required for the monotonicity of LQGAN, so it is not surprising that GANs typically fail without these specialized layers.
Second is that stability is achieved by first learning a simple subsystem, (), then learning the more complex, ()subsystem. This theoretically confirms the intuition behind progressive training of GANs Karras et al. (2017), which have generated the highest quality images to date.
Thirdly, because is symmetric (and ), we can integrate to discover the convex function it is implicitly descending via gradient descent: . Compare this to KLdivergence: . In contrast to , is convex in and may be a desirable alternative due to less extreme gradients near .
Subsystem  
()  M,  M,  M,  M,H  SC,H  SC,H  SC,H  NA  NA 
()  ,  ,  ,  ,  ,  PM,  PM,  sM,H  SC,H 
6.2 Learning the Covariance: The ()OffDiagonal Subsystem
After learning both the mean and variance of each dimension, the covariance of separate dimensions can be learned. Proposition A.14 in the Appendix states that the subsystem relevant to learning each row of is strictly monotone when all other rows are held fixed. In fact, the maps for these subsystems are affine and skewsymmetric just like the ()subsystem. This implies that CrossingtheCurl applied successively to each row of can solve for ; pseudocode is presented in Algorithm 1 in Appendix A.15. Note that this procedure is reminiscent of the Cholesky–Banachiewicz algorithm which computes row by row, beginning with the first row. The resulting algorithm is .
7 Experiments
Our theoretical analysis proves convergence of the stagewise procedure using CrossingtheCurl for the Nd LQGAN. Experiments solving the ()subsystem alone for randomly generated support the analysis of Subsection 6.1—see the first row of Table 3. Not listed in the first row of the table are and which converge in and steps on average respectively with a constant step size of . Our novel maps, and , converge in a quarter of the iterations of the next best method (), and and in nearly a quarter of their parent counterparts. These experiments used analytical results of the expectations, i.e., the systems are deterministic.
Dim  
1 (2)  (0)  (0.4)  (0.94)  (1)  (1)  (1) 
2 (6)  (0)  (0.05)  (0.68)  (1)  (1)  (1) 
4 (10)  (0)  (0.01)  (0.23)  (0.7)  (0.67)  (0.68) 
The second and third rows of the table reveal that convergence slows considerably for higher dimensions. However, the stagewise procedure discussed in Subsection 6.2 is guaranteed to converge given the mean has been learned to a given accuracy. This procedure solves the 4d deterministic LQGAN in iterations with a success rate. For the 4d stochastic LQGAN using twosample minibatch estimates, this procedure achieves in 100,000 iterations with a 0.75 success rate.
8 Conclusion
In this work, we performed the first global convergence analysis for a variety of GAN training algorithms. According to Variational Inequality theory, none of the current GAN training algorithms is globally convergent for the LQGAN. We proposed an intuitive technique, CrossingtheCurl, with the first global convergence guarantees for any generative adversarial network. As a byproduct of our analysis, we extract highlevel explanations for why the use of batch norm and progressive training schedules for GANs are critical to training. In experiments with the multivariate LQGAN, CrossingtheCurl achieves performance superior to any existing GAN training algorithm.
For future work, we will investigate alternate parameterizations of the discriminator such as . We will also work on devising heuristics for setting the coefficients of .
9 Acknowledgments
CrossingtheCurl was independently proposed in Balduzzi et al. (2018) called Symplectic Gradient Adjustment (SGA). Like CrossingtheCurl, this algorithm is motivated by attacking the challenges of rotation in differentiable games, however, it is derived by performing gradient descent on the Hamiltonian as opposed to generalizing a particular perpendicular direction selected from intuition in 2d. Given the equivalence between SGA and CrossingtheCurl, our work can also be viewed as proving that a nontrivial application of this algorithm can be used to solve the LQGAN. On the other hand, we have also proven in Proposition 7 that a naive application of this algorithm is insufficient for solving LQGAN suggesting more research is required to understand and more efficiently solve this complex problem.
References
 Ahmadi et al. (2013) A. A. Ahmadi, A. Olshevsky, P. A. Parrilo, and J. N. Tsitsiklis. Nphardness of deciding convexity of quartic polynomials and related problems. Mathematical Programming, 2013.
 Amari (1998) S. I. Amari. Natural gradient works efficiently in learning. Neural Computation, 1998.
 Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 Arora et al. (2017) S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.
 Aslam Noor (1998) M. Aslam Noor. Generalized setvalued variational inequalities. Le Matematiche, 1998.
 Balduzzi et al. (2018) David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of nplayer differentiable games. arXiv preprint arXiv:1802.05642, 2018.
 Basar and Olsder (1999) T. Basar and G. J. Olsder. Dynamic noncooperative game theory. SIAM, 1999.
 Borkar (2008) V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008.
 Borkar and Meyn (2000) V. S. Borkar and S. P. Meyn. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 2000.
 Cai et al. (2014) X. Cai, G. Gu, and B. He. On the o(1/t) convergence rate of the projection and contraction methods for variational inequalities with lipschitz continuous monotone operators. Computational Optimization and Applications, 2014.
 Cavazzuti et al. (2002) E. Cavazzuti, M. Pappalardo, and M. Passacantando. Nash equilibria, variational inequalities, and dynamical systems. Journal of Optimization Theory and Applications, 2002.
 Crouzeix and Ferland (1996) J. P. Crouzeix and J. A. Ferland. Criteria for differentiable generalized monotone maps. Mathematical Programming, 1996.
 Dafermos (1980) S. Dafermos. Traffic equilibria and variational inequalities. Transportation Science, 1980.
 Dafermos (1983) S. Dafermos. An iterative scheme for variational inequalities. Mathematical Programming, 1983.
 Dang and Lan (2015) C. D. Dang and G. Lan. On the convergence properties of noneuclidean extragradient methods for variational inequalities with generalized monotone operators. Computational Optimization and Applications, 2015.
 de Oliveira et al. (2017) L. de Oliveira, M. Paganini, and B. Nachman. Learning particle physics by example: locationaware generative adversarial networks for physics synthesis. Computing and Software for Big Science, 2017.
 Descartes (1886) René Descartes. La géométrie de René Descartes. A. Hermann, 1886.

Drucker and Le Cun (1992)
H. Drucker and Y. Le Cun.
Improving generalization performance using double backpropagation.
IEEE Transactions on Neural Networks, 1992. 
EvenDar et al. (2009)
E. EvenDar, Y. Mansour, and U. Nadav.
On the convergence of regret minimization dynamics in concave games.
In
Proceedings of the 41st Annual ACM symposium on Theory of Computing
, 2009.  Facchinei and J. (2003) F. Facchinei and Pang J. FiniteDimensional Variational Inequalities and Complimentarity Problems. Springer, 2003.
 Feizi et al. (2017) S. Feizi, C. Suh, F. Xia, and D. Tse. Understanding gans: the lqg setting. arXiv preprint arXiv:1710.10793, 2017.
 Friesz (2010) T. L. Friesz. Dynamic optimization and differential games. Springer Science & Business Media, 2010.
 Gemp and Mahadevan (2016) I. Gemp and S. Mahadevan. Online monotone optimization. arXiv preprint arXiv:1608.07888, 2016.
 Gemp and Mahadevan (2017) I. Gemp and S. Mahadevan. Online monotone games. arXiv preprint arXiv:1710.07328, 2017.
 Gidel et al. (2018) Gauthier Gidel, Hugo Berard, Pascal Vincent, and Simon LacosteJulien. A variational inequality perspective on generative adversarial nets. arXiv preprint arXiv:1802.10551, 2018.
 Goodfellow et al. (2014) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
 Gordon et al. (2008) G. J. Gordon, A. Greenwald, and C. Marks. Noregret learning in convex games. In Proceedings of the 25th International Conference on Machine learning, 2008.
 Gulrajani et al. (2017) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, 2017.
 Hartman and Stampacchia (1966) P. Hartman and G. Stampacchia. On some nonlinear elliptic differential functional equations. Acta Mathematica, 1966.
 Higham et al. (2000) D. J. Higham, A. R. Humphries, and R. J. Wain. Phase space error control for dynamical systems. SIAM Journal on Scientific Computing, 2000.
 Ho and Ermon (2016) J. Ho and S. Ermon. Generative adversarial imitation learning. arXiv preprint arXiv:1606.03476, 2016.
 Iusem et al. (2017) A. N. Iusem, A. Jofré, R. I. Oliveira, and P. Thompson. Extragradient method with variance reduction for stochastic variational inequalities. SIAM Journal on Optimization, 2017.
 Juditsky et al. (2011) A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirrorprox algorithm. Stochastic Systems, 2011.
 Kannan and Shanbhag (2017) A. Kannan and U. V. Shanbhag. Optimal stochastic extragradient schemes for pseudomonotone stochastic variational inequality problems and their variants. arXiv preprint arXiv:1410.1628, 2017.
 Karras et al. (2017) T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
 Khalil (1996) H. K. Khalil. Nonlinear Systems. PrenticeHall, New Jersey, 1996.
 Korpelevich (1977) G. Korpelevich. The extragradient method for finding saddle points and other problems. 1977.

Mao et al. (2017)
X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley.
Least squares generative adversarial networks.
In
IEEE International Conference on Computer Vision (ICCV)
, 2017.  Mescheder et al. (2017) L. Mescheder, S. Nowozin, and A. Geiger. The numerics of gans. In Advances in Neural Information Processing Systems, 2017.
 Mescheder et al. (2018) Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In International Conference on Machine Learning, pages 3478–3487, 2018.
 Metz et al. (2016) L. Metz, B. Poole, D. Pfau, and J. SohlDickstein. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016.
 Mroueh and Sercu (2017) Y. Mroueh and T. Sercu. Fisher gan. In Advances in Neural Information Processing Systems, 2017.
 Mroueh et al. (2017) Y. Mroueh, T. Sercu, and V. Goel. Mcgan: Mean and covariance feature matching gan. arXiv preprint arXiv:1702.08398, 2017.
 Nagarajan and Kolter (2017) V. Nagarajan and J. Z. Kolter. Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems, 2017.
 Nagurney and Zhang (1996) A. Nagurney and D. Zhang. Projected Dynamical Systems and Variational Inequalities with Applications. Kluwer Academic Press, 1996.
 Nemirovski (2004) A. Nemirovski. Proxmethod with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convexconcave saddle point problems. SIAM Journal on Optimization, 2004.
 Nowozin et al. (2016) S. Nowozin, B. Cseke, and R. Tomioka. fgan: Training generative neural samplers using variational divergence minimization. arXiv preprint arXiv:1606.00709, 2016.
 Roth et al. (2017) Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization. In Advances in Neural Information Processing Systems, pages 2018–2028, 2017.
 Roughgarden (2009) T. Roughgarden. Intrinsic robustness of the price of anarchy. In Proceedings of the 41st annual ACM Symposium on Theory of Computing, 2009.
 Salimans et al. (2016) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. arXiv preprint arXiv:1606.03498, 2016.
 Schaible and Luc (1996) S. Schaible and D. Luc. Generalized monotone nonsmooth maps. Journal of Convex Analysis, 1996.
 Scutari et al. (2010) G. Scutari, D. P. Palomar, F. Facchinei, and J. Pang. Convex optimization, game theory, and variational inequality theory. IEEE Signal Processing Magazine, 2010.
 Thomas (2014) P. Thomas. Genga: A generalization of natural gradient ascent with positive and negative convergence results. In International Conference on Machine Learning, 2014.
 Uehara et al. (2016) M. Uehara, I. Sato, M. Suzuki, K. Nakayama, and Y. Matsuo. Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920, 2016.
 Yousefian et al. (2014) F. Yousefian, A. Nedić, and U. V. Shanbhag. Optimal robust smoothing extragradient algorithms for stochastic variational inequality problems. In IEEE 53rd Annual Conference on Decision and Control (CDC), 2014.
 Zhao et al. (2016) J. Zhao, M. Mathieu, and Y. LeCun. Energybased generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
Appendix A Appendix
a.1 A Survey of Candidate Theories Continued
a.1.1 Algorithmic Game Theory
Algorithmic Game Theory (AGT) offers results on convergence to equilibria when a game, possibly online, is convex Gordon et al. [2008], sociallyconvex EvenDar et al. [2009], or smooth Roughgarden [2009]. A convex game is one in which all player losses are convex in their respective variables, i.e. is convex in . A sociallyconvex game adds the additional requirements that 1) there exists a strict convex combination of the player losses that is convex and 2) each player’s loss is concave in the variables of each of the other players. In other words, the players as a whole are cooperative, yet individually competitive. Lastly, smoothness ensures that “the externality imposed on any one player by the actions of the others is bounded” Roughgarden [2009]. In a zerosum game such as (1), one player’s gain is exactly the other player’s loss making smoothness an unlikely fit for studying GANs. See Gemp and Mahadevan [2017] for examples where the three properties above overlap with monotonicity in VIs.
a.1.2 Differential Games
Differential games Basar and Olsder [1999], Friesz [2010] consider more general dynamics such as , not just first order ODEs, however, the focus is on systems that separate control, , and state , i.e. . More specific to our interests, Differential Nash Games can be expressed as Differential VIs, a specific class of infinite dimensional VIs with explicit state dynamics and explicit controls; these, in turn, can be framed as infinite dimensional VIs without an explicit state.
a.2 Nash Equilibrium vs VI Solution
Theorem 2.
Repeated from Cavazzuti et al. [2002]. Let be a cost minimization game with player cost functions and feasible set . Let be a Nash equilibrium. Let . Then
(16)  
(17) 
where is the internal cone at . When is pseudoconvex in for all , this condition is also sufficient. Note that this is implied if is pseudomonotone, i.e. pseudomonotonicity of is a stronger condition.
a.3 Table of Maps Considered in Analysis
Name  Map 

are hyperparameters. Notice that all maps require an unbiased estimate of the mean (
) or the variance () of which can be obtained with one (mean) or two (variance) samples.All maps corresponding to the ()subsystem in Table 4 maintain the desired unique fixed point, , where .
For the ()subsystem, all maps except with certain settings of () and maintain the desired unique fixed point, . introduces an additional spurious fixed point at
(18)  
(19) 
is a special case of where , , and .
a.4 Minimax Solution to Constrained Multivariate LQGAN is Unique
Proposition 9.
Assume and are both in . If is constrained to be symmetric and is constrained to be of Cholesky form, i.e., lower triangular with positive diagonal, then the unique minimax solution to Equation (5) is where is the unique, nonnegative square root of .
Proof.
(20)  
(21)  
(22) 
Taking derivatives and setting equal to zero, we find that the fixed point at the interior is unique.
(23)  
(24)  
(25)  
(26) 
(27)  
Comments
There are no comments yet.