# Global Convergence to the Equilibrium of GANs using Variational Inequalities

In optimization, the negative gradient of a function denotes the direction of steepest descent. Furthermore, traveling in any direction orthogonal to the gradient maintains the value of the function. In this work, we show that these orthogonal directions that are ignored by gradient descent can be critical in equilibrium problems. Equilibrium problems have drawn heightened attention in machine learning due to the emergence of the Generative Adversarial Network (GAN). We use the framework of Variational Inequalities to analyze popular training algorithms for a fundamental GAN variant: the Wasserstein Linear-Quadratic GAN. We show that the steepest descent direction causes divergence from the equilibrium, and guaranteed convergence to the equilibrium is achieved through following a particular orthogonal direction. We call this successful technique Crossing-the-Curl, named for its mathematical derivation as well as its intuition: identify the game's axis of rotation and move "across" space in the direction towards smaller "curling".

## Authors

• 5 publications
• 11 publications
• ### Gradient descent GAN optimization is locally stable

Despite the growing prominence of generative adversarial networks (GANs)...
06/13/2017 ∙ by Vaishnavh Nagarajan, et al. ∙ 0

• ### First Order Generative Adversarial Networks

GANs excel at learning high dimensional distributions, but they can upda...
02/13/2018 ∙ by Calvin Seward, et al. ∙ 0

• ### Negative Momentum for Improved Game Dynamics

Games generalize the optimization paradigm by introducing different obje...
07/12/2018 ∙ by Gauthier Gidel, et al. ∙ 10

• ### Stabilizing Training of Generative Adversarial Nets via Langevin Stein Variational Gradient Descent

Generative adversarial networks (GANs), famous for the capability of lea...
04/22/2020 ∙ by Dong Wang, et al. ∙ 4

• ### Understanding the Effectiveness of Lipschitz Constraint in Training of GANs via Gradient Analysis

This paper aims to bring a new perspective for understanding GANs, by de...
07/02/2018 ∙ by Zhiming Zhou, et al. ∙ 4

• ### Convergence of gradient descent-ascent analyzed as a Newtonian dynamical system with dissipation

A dynamical system is defined in terms of the gradient of a payoff funct...
03/05/2019 ∙ by H. Sebastian Seung, et al. ∙ 0

• ### Deconstructing Generative Adversarial Networks

We deconstruct the performance of GANs into three components: 1. Formu...
01/27/2019 ∙ by Banghua Zhu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

When minimizing over , it is known that decreases fastest if moves in the direction . In addition, any direction orthogonal to will leave

unchanged. In this work, we show that these orthogonal directions that are ignored by gradient descent can be critical in equilibrium problems, which are central to game theory. If each player

in a game updates with , can follow a cyclical trajectory, similar to a person riding a merry-go-round (see Figure 1). This toy scenario actually perfectly reflects an aspect of training for a particular machine learning model mentioned below, and is depicted more technically later on in Figure 2. To arrive at the equilibrium point, a person riding the merry-go-round should walk perpendicularly to their direction of travel, taking them directly to the center.

Equilibrium problems have drawn heightened attention in machine learning due to the emergence of the Generative Adversarial Network (GAN) Goodfellow et al. (2014). GANs have served a variety of applications including generating novel images Karras et al. (2017), simulating particle physics de Oliveira et al. (2017)

, and imitating expert policies in reinforcement learning

Ho and Ermon (2016)

. Despite this plethora of successes, GAN training remains heuristic.

Deep learning has benefited from an understanding of simpler, more fundamental techniques. For example, multinomial logistic regression formulates learning a multiclass classifier as minimizing the cross-entropy of a log-linear model where class probabilities are recovered via a softmax

. The minimization problem is convex and is solved efficiently with guarantees using stochastic gradient descent (SGD). Unsurprisingly, the majority of deep classifiers incorporate a

softmax at the final layer, minimize a cross-entropy loss, and train with a variant of SGD. This progression from logistic regression to classification with deep neural nets is not mirrored in GANs. In contrast, from their inception, GANs were architected with deep nets. Only recently has the Wasserstein Linear-Quadratic GAN (LQ-GAN) Feizi et al. (2017); Nagarajan and Kolter (2017) been proposed as a minimal model for understanding GANs.

In this work, we analyze the convergence of several GAN training algorithms in the LQ-GAN setting. We survey several candidate theories for understanding convergence in GANs, naturally leading us to select Variational Inequalities, an intuitive generalization of the widely relied-upon theories from Convex Optimization. According to our analyses, none of the current GAN training algorithms is globally convergent in this setting. We propose a new technique, Crossing-the-Curl, for training GANs that converges with high probability in the N-dimensional (N-d) LQ-GAN setting.

This work makes the following contributions (proofs can be found in the supplementary material):

• [leftmargin=*]

• The first global convergence analysis of several GAN training methods for the N-d LQ-GAN,

• Crossing-the-Curl, the first technique with stochastic convergence for the N-d LQ-GAN,

• An empirical demonstration of Crossing-the-Curl

in the multivariate LQ-GAN setting as well as some common neural network driven settings in Appendix

A.16.

The Generative Adversarial Network (GAN) Goodfellow et al. (2014) formulates learning a generative model of data as finding a Nash equilibrium of a minimax game. The generator (

player) aims to synthesize realistic data samples by transforming vectors drawn from a fixed source distribution, e.g.,

. The discriminator ( player) attempts to learn a scoring function that assigns low scores to synthetic data and high scores to samples drawn from the true dataset. The generator’s transformation function, , and discriminator’s scoring function, , are typically chosen to be neural networks parameterized by weights and respectively. The minimax objective of the original GAN Goodfellow et al. (2014) is

 (1)

where is the source distribution, is the true data distribution, and .

In practice, finding the solution to (1) consists of local updates, e.g., SGD, to and . This continues until 1) has stabilized, 2) the generated data is judged qualitatively accurate, or 3) training has de-stabilized and appears irrecoverable, at which point, training is restarted. The difficulty of training GANs has spurred research that includes reformulating the minimax objective Arjovsky et al. (2017); Mao et al. (2017); Mroueh and Sercu (2017); Mroueh et al. (2017); Nowozin et al. (2016); Uehara et al. (2016); Zhao et al. (2016), devising training heuristics Gulrajani et al. (2017); Karras et al. (2017); Salimans et al. (2016); Roth et al. (2017), proving the existence of equilibria Arora et al. (2017), and conducting local stability analyses Gidel et al. (2018); Mescheder et al. (2017, 2018); Nagarajan and Kolter (2017).

We acknowledge here that our algorithm, Crossing-the-Curl, was independently proposed in Balduzzi et al. (2018) as Symplectic Gradient Adjustment (SGA). In contrast to that work, this paper specifies a non-trivial application of this algorithm to LQ-GAN which obtains global convergence with high probability.

Recent work has studied a simplified setting, the Wasserstein LQ-GAN, where is a linear function, is a quadratic function, , and is Gaussian Feizi et al. (2017); Nagarajan and Kolter (2017). Follow-up research has shown that, in this setting, the optimal generator distribution is a rank- Gaussian containing the top- principal components of the data Feizi et al. (2017). Furthermore, it is shown that if the dimensionality of matches that of

, LQ-GAN is equivalent to maximum likelihood estimation of the generator’s resulting Gaussian distribution. To our knowledge, no GAN training algorithm with guaranteed convergence is currently known for this setting. We revisit the LQ-GAN in more detail in Section

4.

## 3 Convergence of Equilibrium Dynamics

In this section, we review Variational Inequalities (VIs) and compare it to the ODE Method leveraged in recent work Nagarajan and Kolter (2017). See A.1.2 and A.1.1 for a discussion of two additional theories. Throughout the paper, refers to a convex set and refers to a vector field operator (or map) from to , although many of the results for VIs apply to set-valued maps, e.g., subdifferentials, as well. Here, we will cover the basics of the theories and introduce select theorems when necessary later on.

### 3.1 Variational Inequalities

Variational Inequalities (VIs) are used to study equilibrium problems in a number of domains including mechanics, traffic networks, economics, and game theory Dafermos (1980); Facchinei and J. (2003); Hartman and Stampacchia (1966); Nagurney and Zhang (1996). The Variational Inequality problem, VI, is to find an such that for all in the feasible set , . Under mild conditions (see Appendix A.2), constitutes a Nash equilibrium point. For readers familiar with convex optimization, note the consistent similarity throughout this subsection for when . In game theory, often maps to the set of player gradients. For example, the map corresponding to the minimax game in Equation (1) is .

A map, , is monotone Aslam Noor (1998) if for all and . Alternatively, if the Jacobian matrix of is positive semidefinite (PSD), then is monotone Nagurney and Zhang (1996); Schaible and Luc (1996). A matrix, , is PSD if for all , , or equivalently, is PSD if .

As in convex optimization, a hierarchy of monotonicity exists. For all and , is

 monotone iff ⟨F(x)−F(x′),x−x′⟩≥0, (2) pseudomonotone iff ⟨F(x′),x−x′⟩≥0⟹⟨F(x),x−x′⟩≥0, and quasimonotone iff ⟨F(x′),x−x′⟩>0⟹⟨F(x),x−x′⟩≥0. (3)

If, in Equation (2), “” is replaced by “”, then is strictly-monotone; if “” is replaced by “”, then is -strongly-monotone. If is a gradient, then replace monotone with convex.

Table 1 cites algorithms with convergence rates for several settings. Whereas gradient descent achieves optimal convergence rates for various convex optimization settings, extragradient Korpelevich (1977) achieves optimal rates for VIs. Results have been extended to the online learning setting as well Gemp and Mahadevan (2016, 2017).

### 3.2 The ODE Method & Hurwitz Jacobians

Recently, Nagarajan and Kolter (2017) performed a local stability analysis of the gradient dynamics of Equation (1), proving that the Jacobian of evaluated at is Hurwitz111Our definition of Hurwitz is equivalent to the more standard: is Hurwitz if . Borkar (2008); Borkar and Meyn (2000); Khalil (1996)

, i.e., the real parts of its eigenvalues are strictly positive. This means that if simultaneous gradient descent using a “square-summable, not summable” step sequence enters an

-ball with a low enough step size, it will converge to the equilibrium. This applies only in the deterministic setting because stochastic gradients can cause the iterates to exit this ball and diverge. Note that while the real parts of eigenvalues reveal exponential growth or decay of trajectories, the imaginary parts reflect any rotation in the system222Linearized Dynamical System: ; Euler’s formula: ..

The Hurwitz and monotonicity properties are complementary (see A.8). To summarize, Hurwitz encompasses dynamics with exponentially stable trajectories and with arbitrary rotation, while monotonicity includes cycles (Jacobians with zero eigenvalues) and is similar to convex optimization.

Given the preceding discussion, we believe VIs and monotone operator theory will serve as a strong foundation for deriving fundamental convergence results for GANs; this theory is

1. [leftmargin=*]

2. Similar to convexity suggesting its adoption by the GAN community should be smooth,

3. Mature with natural mechanisms for handling constraints, subdifferentials, and online scenarios,

4. Rich with algorithms with finite sample convergence for a hierarchy of monotone operators.

Finally, we suggest Scutari et al. (2010) for a lucid comparison of convex optimization, game theory, and VIs.

## 4 The Wasserstein Linear Quadratic GAN

In the Wasserstein Linear-Quadratic GAN, the generator and discriminator are restricted to be linear and quadratic respectively: and . Equation (1) becomes

 minA,bmaxW2,w1{V(W2,w1,A,b)=Ey∼p(y)[D(y)]−Ez∼p(z)[D(G(z)]}. (4)

Let , , , and . If is constrained to be lower triangular with positive diagonal, i.e., of Cholesky form, then is the unique minimax solution (see Proposition 9). The majority of this work focuses on the case where and are 1-d distributions. Equation (4) simplifies to

 mina>0,bmaxw2,w1{V(w2,w1,a,b)=w2(σ2+μ2−a2−b2)+w1(μ−b)}. (5)

The map associated with this zero-sum game is constructed by concatenating the gradients of the two players’ losses ():

 F =[∂fD∂w2,∂fD∂w1,∂fG∂a,∂fG∂b]⊤=[a2+b2−σ2−μ2,b−μ,−2w2a,−2w2b−w1]⊤.

## 5 Crossing-the-Curl

In this section, we will derive our proposed technique, Crossing-the-Curl, motivated by an examination of the ()-subsystem of LQ-GAN, i.e., fixed at for any . The results discussed here hold for the N-dimensional case as well. The map associated with this subsystem is plotted in Figure 2 and formally stated in Equation (6).

The Jacobian of is not Hurwitz, and simultaneous gradient descent, defined in Equation (7), will diverge for this problem (see A.5). However, is monotone and Lipschitz in the sense that . Table 1 offers an extragradient method (see Figure 2) with convergence rate, which is optimal for worst case monotone maps.

Nevertheless, an algorithm that travels perpendicularly to the vector field will proceed directly to the equilibrium. The intuition is to travel in the direction that is perpendicular to both and the axis of rotation. For a 2-d system, the axis of rotation can be obtained by taking the curl of the vector field. To derive a direction perpendicular to both and the axis of rotation, we can take their cross product:

 Fcc=−12(curl∇×F)×F =−12{∇F(v⋅F)−(v⋅∇)F}∣∣v=F=−(J−J⊤2)F=[w1b−μ]

where is Feynman notation for the gradient with respect to only and means evaluate the expression at . The  factor ensures the algorithm moves toward regions of “tighter cycles” and simplifies notation. It may be sensible to perform some linear combination of simultaneous gradient descent and Crossing-the-Curl, so we will refer to as .

Note that the fixed point of remains the same as the original field . Furthermore, the reader may recognize as the gradient of the function , which is strongly convex, allowing an convergence rate in the deterministic setting. is derived from intuition in 2-d, however, we discuss reasons in the next subsection for why this approach generalizes to higher dimensions.

### 5.1 Discussion & Relation to Other Methods

For the ()-subsystem, Crossing-the-Curl is equivalent to two other methods: the consensus algorithm Mescheder et al. (2017) and a Taylor series approximation to extragradient Korpelevich (1977).

These equivalences occur because the Jacobian is skew-symmetric (

) for the ()-subsystem. In the more general case, where is not necessarily skew-symmetric, Crossing-the-Curl represents a combination of the two techniques. Extragradient (EG) is key to solving VIs and the consensus algorithm has delivered impressive results for GANs, so this is promising for . To our knowledge, is novel and has not appeared in the Variational Inequality literature.

Crossing-the-Curl stands out in many ways though. Observe that in higher dimensions, the subspace orthogonal to is dimensional, which means is no longer the unique direction orthogonal to . However, every matrix can be decomposed into a symmetric part with real eigenvalues, , and a skew-symmetric part with purely imaginary eigenvalues, . Notice that for an optimization problem, where is the Hessian.333Assuming the objective function has continuous second partial derivatives—see Schwarz’s theorem. It is the imaginary eigenvalues, i.e., rotation, that set equilibrium problems apart from optimization and necessitate the development of new algorithms like extragradient. It is reassuring that this matrix appears explicitly in . In addition, reduces to gradient descent when applied to an optimization problem making the map agnostic to the type of problem at hand: optimization or equilibration.

The curl also shares close relation to the gradient. The gradient is applied to a scalar function and the curl is crossed with a vector function. Furthermore, under mild conditions, every vector field, , admits a Helmholdtz decomposition: where is a scalar function and is a vector function suggesting the gradient and curl are both fundamental components.

Consider the perspective of as preconditioning by a skew-symmetric matrix. Preconditioning with a positive definite matrix dates back to Newton’s method and has reappeared in machine learning with natural gradient Amari (1998). Dafermos (1983) considered asymmetric positive definite preconditioning matrices for VIs. Thomas (2014) extended the analysis of natural gradient to PSD matrices. We are not aware of any work using skew-symmetric matrices for preconditioning. The scalar for any skew-symmetric matrix , so calling a PSD matrix is not adequately descriptive.

Note that Crossing-the-Curl does not always improve convergence; this technique can transform a strongly-monotone field into a saddle and an unstable fixed point (non-monotone) into a strongly-monotone field (see A.9 for examples), so this technique should generally be used with caution.

Lastly, Crossing-the-Curl is inexpensive to compute. The Jacobian-vector product, , can be approximated accurately and efficiently with finite differences. Likewise, can be computed efficiently with double backprop Drucker and Le Cun (1992) by taking the gradient of . In total, three backprops are required, one for , one for , and one for .

In our analysis, we also consider the gradient regularization proposed in Nagarajan and Kolter (2017), , the Unrolled GAN proposed in Metz et al. (2016), , alternating gradient descent, , as well as any linear combination of , , and , deemed , which forms a family of maps that includes , , and :

 Freg=[FD;FG+η∇G||FD||2]⊤,Flin=(ρI+βJ⊤−γJ)F.

Keep in mind that we are proposing as a generalization of Crossing-the-Curl. We state our main results here for the -subsystem.

###### Proposition 1.

For any , with at least one of and positive and both non-negative is strongly monotone. Also, its Jacobian is Hurwitz. See Proposition 13.

###### Corollary 1.

, , , and with are strongly-monotone with Hurwitz Jacobians. See Proposition 1.

###### Proposition 2.

, , , and with any are monotone, but not strictly monotone. Of these maps, only ’s Jacobian is Hurwitz. See Propositions 12 and 13.

## 6 Analysis of the Full System

Here, we analyze the maps for each of the algorithms discussed above, testing for quasimonotonicity (the weakest monotone property) and whether the Jacobian is Hurwitz for the full LQ-GAN system.

Proving quasiconvexity of 4th degree polynomials has been proven strongly NP-Hard Ahmadi et al. (2013). This implies that proving monotonicity of 3rd degree maps is strongly NP-Hard. The original contains quadratic terms suggesting it may welcome a quasimonotone analysis, however, the remaining maps all contain 3rd degree terms. Unsurprisingly, analyzing quasimonotonicity for represents the most involved of our proofs given in Appendix A.11.

The definition stated in (3) suggests checking the truth of an expression depending on four separate variables: , , , . While we used this definition for certain cases, the following alternate requirements proposed in Crouzeix and Ferland (1996) made the complete analysis of the system tractable. We restate simplified versions of the requirements we leveraged for convenience.

Consider the following conditions:

1. [label=()]

2. For all and such that we have .

3. For all and such that , we have that .

###### Theorem 1 (Crouzeix and Ferland (1996), Theorem 3).

Let be differentiable on the open convex set .

1. [leftmargin=*,label=()]

2. is quasimonotone on only if (A) holds, i.e. (A) is necessary but not sufficient.

3. is pseudomonotone on if (A) and (B) hold, i.e. (A) and (B) are sufficient but not necessary.

Condition (A) says that for a map to be quasimonotone, the map must be monotone along directions orthogonal to the vector field. In addition to this, condition (B) says that for a map to be pseudomonotone, the dynamics, , must not be leading away from the equilibrium anywhere.

Equipped with these definitions, we can conclude the following:

###### Proposition 3.

None of the maps, including with any setting of coefficients, is quasimonotone for the full LQ-GAN. See Corollary 5 and Propositions 15 through 17.

###### Proposition 4.

None of the maps, including with any setting of coefficients, has a Hurwitz Jacobian for the full LQ-GAN. See Propositions 27 and 15 through 17.

### 6.1 Learning the Variance: The (w2,a)-Subsystem

Results from the previous section suggest that we cannot solve the full LQ-GAN, but given that we can solve the ()-subsystem, we shift focus to the ()-subsystem assuming the mean has already been learned exactly, i.e., . We will revisit this assumption later.

We can conclude the following for the ()-subsystem:

###### Proposition 5.

, , , , and are not quasimonotone. Also, their Jacobians are not Hurwitz. See Propositions 14 through 19.

###### Proposition 6.

and are pseudomonotone which implies an stochastic convergence rate. See Propositions 21 and 24. Their Jacobians are not Hurwitz. See Proposition 27.

###### Proposition 7.

No monotone exists. See Proposition 26.

These results are not purely theoretical. Figure 4 displays trajectories resulting from each of the maps.

We can further improve upon and by rescaling with : (12)(13) and (14)(15) respectively. This results in strongly-monotone and strongly-convex systems respectively, improving the stochastic convergence rate to . In deriving these results, we assumed the mean was given. We can relax this assumption and analyze the ()-subsystem under the assumption that the mean is “close enough”. Using a Hoeffding bound, we find that iterations of are required to achieve a probability of the mean being accurate enough to ensure the (

)-subsystem is strongly-monotone. Note that this approach of first learning the mean, then the variance retains the overall

stochastic rate. We summarize the main points here.

###### Claim 1.

A nonlinear scaling of and results in strictly monotone and -strongly monotone subsystems respectively. See Proposition 29.

###### Claim 2.

If the mean is first well approximated, i.e., , then remains 1) -strongly-monotone if the ()-subsystem is “shut off” or 2) strictly-monotone if the -subsystem is re-weighted with a high coefficient. See Propositions 30 and 31.

###### Proposition 8.

and are not quasimonotone for the 2-d LQ-GAN system (with and without scaling). See Proposition 32.

Several takeaways emerge. One is that the stability of the system is highly dependent on the mean first being learned. In other words, batch norm is required for the monotonicity of LQ-GAN, so it is not surprising that GANs typically fail without these specialized layers.

Second is that stability is achieved by first learning a simple subsystem, (), then learning the more complex, ()-subsystem. This theoretically confirms the intuition behind progressive training of GANs Karras et al. (2017), which have generated the highest quality images to date.

Thirdly, because is symmetric (and ), we can integrate to discover the convex function it is implicitly descending via gradient descent: . Compare this to KL-divergence: . In contrast to , is convex in and may be a desirable alternative due to less extreme gradients near .

### 6.2 Learning the Covariance: The (W2,a)-Off-Diagonal Subsystem

After learning both the mean and variance of each dimension, the covariance of separate dimensions can be learned. Proposition A.14 in the Appendix states that the subsystem relevant to learning each row of is strictly monotone when all other rows are held fixed. In fact, the maps for these subsystems are affine and skew-symmetric just like the ()-subsystem. This implies that Crossing-the-Curl applied successively to each row of can solve for ; pseudocode is presented in Algorithm 1 in Appendix A.15. Note that this procedure is reminiscent of the Cholesky–Banachiewicz algorithm which computes row by row, beginning with the first row. The resulting algorithm is .

## 7 Experiments

Our theoretical analysis proves convergence of the stagewise procedure using Crossing-the-Curl for the N-d LQGAN. Experiments solving the ()-subsystem alone for randomly generated support the analysis of Subsection 6.1—see the first row of Table 3. Not listed in the first row of the table are and which converge in and steps on average respectively with a constant step size of . Our novel maps, and , converge in a quarter of the iterations of the next best method (), and and in nearly a quarter of their parent counterparts. These experiments used analytical results of the expectations, i.e., the systems are deterministic.

The second and third rows of the table reveal that convergence slows considerably for higher dimensions. However, the stagewise procedure discussed in Subsection 6.2 is guaranteed to converge given the mean has been learned to a given accuracy. This procedure solves the 4-d deterministic LQ-GAN in iterations with a success rate. For the 4-d stochastic LQ-GAN using two-sample minibatch estimates, this procedure achieves in 100,000 iterations with a 0.75 success rate.

## 8 Conclusion

In this work, we performed the first global convergence analysis for a variety of GAN training algorithms. According to Variational Inequality theory, none of the current GAN training algorithms is globally convergent for the LQ-GAN. We proposed an intuitive technique, Crossing-the-Curl, with the first global convergence guarantees for any generative adversarial network. As a by-product of our analysis, we extract high-level explanations for why the use of batch norm and progressive training schedules for GANs are critical to training. In experiments with the multivariate LQ-GAN, Crossing-the-Curl achieves performance superior to any existing GAN training algorithm.

For future work, we will investigate alternate parameterizations of the discriminator such as . We will also work on devising heuristics for setting the coefficients of .

## 9 Acknowledgments

Crossing-the-Curl was independently proposed in Balduzzi et al. (2018) called Symplectic Gradient Adjustment (SGA). Like Crossing-the-Curl, this algorithm is motivated by attacking the challenges of rotation in differentiable games, however, it is derived by performing gradient descent on the Hamiltonian as opposed to generalizing a particular perpendicular direction selected from intuition in 2-d. Given the equivalence between SGA and Crossing-the-Curl, our work can also be viewed as proving that a non-trivial application of this algorithm can be used to solve the LQ-GAN. On the other hand, we have also proven in Proposition 7 that a naive application of this algorithm is insufficient for solving LQ-GAN suggesting more research is required to understand and more efficiently solve this complex problem.

## Appendix A Appendix

### a.1 A Survey of Candidate Theories Continued

#### a.1.1 Algorithmic Game Theory

Algorithmic Game Theory (AGT) offers results on convergence to equilibria when a game, possibly online, is convex Gordon et al. [2008], socially-convex Even-Dar et al. [2009], or smooth Roughgarden [2009]. A convex game is one in which all player losses are convex in their respective variables, i.e. is convex in . A socially-convex game adds the additional requirements that 1) there exists a strict convex combination of the player losses that is convex and 2) each player’s loss is concave in the variables of each of the other players. In other words, the players as a whole are cooperative, yet individually competitive. Lastly, smoothness ensures that “the externality imposed on any one player by the actions of the others is bounded” Roughgarden [2009]. In a zero-sum game such as (1), one player’s gain is exactly the other player’s loss making smoothness an unlikely fit for studying GANs. See Gemp and Mahadevan [2017] for examples where the three properties above overlap with monotonicity in VIs.

#### a.1.2 Differential Games

Differential games Basar and Olsder [1999], Friesz [2010] consider more general dynamics such as , not just first order ODEs, however, the focus is on systems that separate control, , and state , i.e. . More specific to our interests, Differential Nash Games can be expressed as Differential VIs, a specific class of infinite dimensional VIs with explicit state dynamics and explicit controls; these, in turn, can be framed as infinite dimensional VIs without an explicit state.

### a.2 Nash Equilibrium vs VI Solution

###### Theorem 2.

Repeated from Cavazzuti et al. [2002]. Let be a cost minimization game with player cost functions and feasible set . Let be a Nash equilibrium. Let . Then

 ⟨F(x∗),x−x∗⟩≥0 (16) ∀x∈({x∗+IK(x∗)}∩K)⊆K (17)

where is the internal cone at . When is pseudoconvex in for all , this condition is also sufficient. Note that this is implied if is pseudomonotone, i.e. pseudomonotonicity of is a stronger condition.

### a.3 Table of Maps Considered in Analysis

All maps corresponding to the ()-subsystem in Table 4 maintain the desired unique fixed point, , where .

For the ()-subsystem, all maps except with certain settings of () and maintain the desired unique fixed point, . introduces an additional spurious fixed point at

 a = ⎷−3+√9+32σ2β216β2, (18) w2 =σ2−a24βa2. (19)

is a special case of where , , and .

### a.4 Minimax Solution to Constrained Multivariate LQ-GAN is Unique

###### Proposition 9.

Assume and are both in . If is constrained to be symmetric and is constrained to be of Cholesky form, i.e., lower triangular with positive diagonal, then the unique minimax solution to Equation (5) is where is the unique, non-negative square root of .

###### Proof.
 V(G,D) =Ey∼N(μ,Σ)[y⊤W2y+w⊤1y]+Ez∼N(0,In)[−(Az+b)⊤W2(Az+b)−w⊤1(Az+b)] (20) =Ey∼N(μ,Σ)[∑i∑jW2ijyiyj+∑iw1iyi] (21) −Ez∼N(0,In)[∑i∑jW2ij(bi+∑kAikzk)(bj+∑kAjkzk)+∑iw1i(bi+∑kAikzk)] (22)

Taking derivatives and setting equal to zero, we find that the fixed point at the interior is unique.

 ˙W2 =Ey∼N(μ,Σ)[yy⊤]−Ez∼N(0,In)[(Az+b)(Az+b)⊤] (23) ˙w1 (24) ˙A =Ez∼N(0,In)[(W2+W⊤2)Azz⊤+(W2+W⊤2)bz⊤+w1z⊤] (25) ˙b =Ez∼N(0,In)[(W2+W⊤2)Az+(W2+W⊤2)b+w1] (26)
 ˙w1 =μ−b=0⇒b=μ (27) ˙W2 =Ey∼N(μ,Σ)[(y−μ)(y−