Game Theoretic Optimization via Gradient-based Nikaido-Isoda Function

05/15/2019 ∙ by Arvind U. Raghunathan, et al. ∙ 0

Computing Nash equilibrium (NE) of multi-player games has witnessed renewed interest due to recent advances in generative adversarial networks. However, computing equilibrium efficiently is challenging. To this end, we introduce the Gradient-based Nikaido-Isoda (GNI) function which serves: (i) as a merit function, vanishing only at the first-order stationary points of each player's optimization problem, and (ii) provides error bounds to a stationary Nash point. Gradient descent is shown to converge sublinearly to a first-order stationary point of the GNI function. For the particular case of bilinear min-max games and multi-player quadratic games, the GNI function is convex. Hence, the application of gradient descent in this case yields linear convergence to an NE (when one exists). In our numerical experiments, we observe that the GNI formulation always converges to the first-order stationary point of each player's optimization problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this work, we consider the general -player game:

(1)

where , , , denotes the collection of all ’s, while denotes the collection of all ’s except for index , i.e. . Observe that the choice of are specified when performing the minimization in (1) for player .

A point satisfying (1) is called a Nash Equilibrium (NE). We denote by the set of all NE points, i.e., . In the absence of convexity for the functions we may not be able to obtain a minimizer in (1) and have to settle for a first-order stationary point. Accordingly, define to be the set of all Stationary Nash Points, i.e., where denotes the derivative of function w.r.t. .

There has been renewed interest in Nash equilibrium computation for games owing to the success of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014)

. GANs have been successful in learning probability distributions and have found application in tasks including image-to-image translation 

(Isola et al., 2016), domain adaptation (Tzeng et al., 2017), probabilistic inference (Dumoulin et al., 2016; Mescheder et al., 2017) among others. Despite their popularity, GANs are known to be difficult to train. In order to stabilize training recent approaches have resorted to carefully designed models, either by adapting an architecture (Radford et al., 2015) or by selecting an easy-to-optimize objective function (Salimans et al., 2016; Arjovsky et al., 2017; Gulrajani et al., 2017).

The Nikaido-Isoda (NI) function (Nikaido & Isoda, 1955) (formally introduced in §3) is popular in equilibrium computation (Uryasev & Rubinstein, 1994; Contreras et al., 2004; Facchinei & Kanzow, 2007; von Heusinger & Kanzow, 2009a, b) and often used as a merit function for NE. The evaluation of the NI function requires optimizing each player’s problem globally which can be intractable for non-convex objectives.

In this paper, we introduce Gradient-based Nikaido-Isoda (GNI) function which allows us to computationally simplify the original NI formulation. Instead of computing a globally optimal solution, every player can locally improve their objectives using the steepest descent direction. The proposed GNI function simplifies the original NI formulation by relaxing the requirement on optimizing individual player’s objective globally. We prove that GNI is a valid merit function for multi-player games and vanishes only at the first-order stationary points of each player’s optimization problem (§3). The GNI function is shown to be locally stable in a neighborhood of a stationary Nash point (§3.1) and convex when the player’s objective function is quadratic (§3.2). The gradient descent algorithm applied to the GNI function converges to a stationary Nash point (§4). In addition, if each of the player’s objective is convex in the player’s variables () then the algorithm converges to the NE point as long as one exists (§4). A secant approximation is provided to simplify the computation of the gradient of the GNI function and the convergence of the modified algorithm is also analyzed (§5). Numerical experiments in §6 show that the proposed algorithm is effective in converging to stationary Nash points of the games.

We believe our proposed GNI formulation could be an effective approach for training GANs. However, we emphasize that the focus of this paper is to provide a rigorous analysis of the GNI formulation for games and explore its properties in a non-stochastic setting. The adaptation of our proposed formulations to a stochastic setting (which is the typical framework commonly used in GANs) will need additional results, which will be explored in a future paper.

2 Related Work

Nash Equilibrium (NE) computation, a key area in algorithmic game theory, has seen a number of developments since the pioneering work of John von Neumann (Basar & Olsder, 1999). It is well known that the Nash equilibrium problem can be reformulated as a variational inequality problem, VIP for short, see, for example, (Facchinei & Pang, 2003a). The VIP is a generalization of the first-order optimality condition in to the case where the decision variables of player ’s are constrained to be in a convex set. Facchinei & Kanzow (2010) proposed penalty methods for the solution of generalized Nash equilibrium problems (Nash equilibrium problems with joint constraints). Iusem et al. (2017) provides a detailed analysis of the extragradient algorithm for stochastic pseudomonotone variational inequalities (corresponding to games with pseudoconvex costs).

Nash Equilibrium computation has found renewed interest due to the emergence of Generative Adversarial Networks (GANs). It has been observed that the alternating stochastic gradient descent (SGD) is oscillatory when training GANs 

(Goodfellow, 2016). Several papers proposed to modify the GAN formulation in order to stabilize the convergence of the iterates. These include non-saturating GAN formulation of (Goodfellow et al., 2014; Fedus et al., 2018), the DCGAN formulation (Radford et al., 2015), the gradient penalty formulation for WGANs (Gulrajani et al., 2017). The authors in (Yadav et al., 2017) proposed a momentum based step on the generator in the alternating SGD for convex-concave saddle point problems. Daskalakis et al. (Daskalakis et al., 2018) proposed the optimistic mirror descent (OMD) algorithm, and showed convergence for bilinear games and divergence of the gradient descent iterates. In a subsequent work, Daskalakis et al. (Daskalakis & Panageas, 2018) analyzed the limit points of gradient descent and OMD, and showed that the limit points of OMD is a superset of alternating gradient descent. Mertikopoulos et al. (2019) generalized and extended the work of Daskalakis et al. (2018) for bilinear games. Li et al. (2017) dualize the GAN objective to reformulate it as a maximization problem and Mescheder et al. (2017) add the norm of the gradient in the objective. The norm of the gradient is shown to locally stabilize the gradient descent iterations in Nagarajan & Kolter (2017). Gidel et al. (2018) formulate the GAN equilibrium as a VIP and propose an extrapolation technique to prevent oscillations. The authors show convergence of stochastic algorithm under the assumption of monotonicity of VIP, which is stronger than the convex-concave assumption in min-max games. Finally, the convergence of stochastic gradient descent in non-convex games has also been studied in Bervoets et al. (2018); Mertikopoulos & Zhou (2019).

In contrast to existing approaches, the GNI approach does not assume monotonicity in the game formulations. The GNI approach is also closely related to the idea of minimizing residuals (Facchinei & Pang, 2003a, b).

3 Gradient-based Nikaido-Isoda Function

The Nikaido-Isoda (NI) function introduced in (Nikaido & Isoda, 1955) is defined as

From the definition of NI function , it is easy to show that for all . Further, is the global minimum which is only achieved if the NE point occurs at points where are global minimizers of the respective optimization problems in (1). A number of papers (Uryasev & Rubinstein, 1994; von Heusinger & Kanzow, 2009a, b) have proposed algorithms that minimize to compute NE points. However, the infimum needed to compute can be prohibitive for all but a handful of functions. For bilinear min-max games (i.e., ), the infimum is unbounded below and the approach of minimizing NI fails. To rectify this recent papers have proposed regularized variants (von Heusinger & Kanzow, 2009b). However, the cost of globally minimizing the nonlinear function can still be prohibitive.

To rectify the shortcoming of the NI function, we introduce the Gradient-based Nikaido-Isoda (GNI) function

(2)

where denotes the derivative of function w.r.t. .

The GNI function is obtained by replacing the infimum in the NI function for player with a point in the steepest descent direction. This provides a local measure of decrease that can be obtained in the objective for player . The point is similar in spirit to the Cauchy point that is used in trust-region methods (Nocedal & Wright, 2006). We will show that any point satisfying also satisfies . To show this, we first provide bounds on in terms of the distance from first-order optimality conditions for each of the players.

We make the following standing assumption.

Assumption 1.

The functions are at least twice continuously differentiable and gradients of (i.e., ) are Lipschitz continuous with constant .

Lemma 1.

for all and .

Proof.

Using the Taylor’s series expansion of around and substituting for , we obtain

where and , for . From the Lipschitz continuity of the gradient of , we have that , where is the (

) identity matrix. Substituting in the above and using

yields the claim. ∎

We now state our main result relating the zeros of and the first-order critical points of the players’s optimization problems.

Theorem 1.

The global minimizers of are all stationary Nash points, i.e., for all . If the individual functions are convex, then the global minimizers of are precisely the set .

Proof.

The nonnnegativity of follows from Lemma 1. Further, if and only if . This proves the claim. The second claim follows by noting that , if the functions are convex. ∎

Theorem 1 shows that the function can be employed as a merit function for obtaining a stationary Nash point. When are non-convex, the convergence to first-order point is possibly the best that one can hope for.

We provide the expressions for the gradient and Hessian of

next. These expressions follow from the chain rule of differentiation. The gradient of

is

(3)

where with defined as , , and are identity matrices. The Hessian of is given by

(4)

where is the action of the third derivative along the direction . These expressions will come useful in our analysis to follow.

3.1 GNI is Locally Stable

GAN formulations typically result in objective functions that are not convex. Nagarajan and Kolter (Nagarajan & Kolter, 2017) showed that the gradient descent for min-max games is not stable for Wasserstein GANs. This is due to the concave-concave nature of Wasserstein GAN around stationary Nash points (Nagarajan & Kolter, 2017). Daskalakis et al. (Daskalakis et al., 2018) showed that the gradient descent diverges for simple bilinear min-max games, while the optimistic gradient decent algorithm of Rakhlin and Sridharan (Rakhlin & Sridharan, 2013) was shown to be convergent. Daskalakis and Pangeas (Daskalakis & Panageas, 2018) further analyzed the limit points of gradient descent and optimistic gradient descent using dynamical systems theory.

In this section, we show that at every stationary Nash point, the Hessian of is positive semidefinite. This ensures that the points in are all stable limit points for the gradient descent algorithm on .

Lemma 2.

For , is positive semidefinite for all .

Proof.

Let . Since , we have that and . Substituting in the expression for in (4) and simplifying, we obtain

(5)

From the Lipschitz continuity of we have that . Substituting into (5), we obtain

where the final simplification follows from and . The claim follows from the positive semidefiniteness of . Since is the sum of positive semidefinite matrices the claim holds. ∎

3.2 Convexity Properties of GNI: An Example

In this section, we present an example NE reformulation of a (non-) convex game using the GNI setup. Suppose the player’s objective is quadratic, i.e., . Then, the GNI function is

(6)

where . Suppose and let , then

(7)

where the positive semidefiniteness holds since for all = . Hence, when is quadratic, the GNI function is a convex, quadratic function. Note that the convexity of GNI function holds regardless of the convexity of the original function . However, for general nonlinear functions , the GNI function does not preserve convexity.

4 Descent Algorithm for GNI

Consider the gradient descent iteration minimizing

(8)

where is a stepsize. The restrictions on , if any, are provided in subsequent discussions.

Theorem 2 proves sublinear convergence of to a stationary point of GNI function based on standard analysis. Linear convergence to a stationary point point is shown under the assumption of the Polyak-Łojasiewicz inequality (Łojasiewicz, 1963; Polyak, 1963; Karimi et al., 2018). Luo & Tseng (1993) employed similar error bound conditions in the context of descent algorithms of variational inequalities.

Theorem 2.

Suppose is -Lipschitz continuous. Let for . Then, the generated by (8) converges sublinearly to a first-order stationary point of , i.e. . If then the sequence converges linearly to 0, i.e., converges to .

Proof.

From Lipschitz continuity of

(9)

where . Telescoping the sum for , we obtain

(10)

Since is bounded below by , we have that

This proves the claim on sublinear convergence to a first-order stationary point of . Suppose holds. Substituting in (9) obtain

(11)

which proves the claim on linear convergence of to 0. By Theorem 1, converges to . ∎

4.1 Quadratic Objectives

In the following, we explore a popular setting of quadratic objective function and explore the implication of Theorem 2. Note that the bilinear case is a special case of the quadratic objective. Consider the ’s to be quadratic. For this setting §3.2 showed that GNI function is a convex quadratic function. This proves that has -Lipschitz continuous gradient. It is well known that for a composition of a linear function with a strongly convex function, we have that Polyak-Łojasiewicz inequality holds (Luo & Tseng, 1993), i.e., there exists such that holds. Hence, we can state the following stronger result for quadratic objective functions.

Corollary 1.

Suppose are quadratic and player convex, i.e. is convex in . Let . Then, the sequence converges linearly to 0, i.e. converges to .

5 Modified Descent Algorithm for GNI

The evaluation of the gradient requires the computation of the Hessian of the functions (see (3)) which can be prohibitive to compute. A close examination of the expression of in (3) reveals that we only require the action of the Hessian in a particular direction, i.e. . This immediately suggests the use of an approximation for this term inspired by secant methods (Nocedal & Wright, 2006)

(12)

Substituting (12) for the term involving the Hessian in and simplifying obtain the direction :

(13)

Substituting (12) in the gradient descent iteration (9), we obtain the modified iteration

(14)

where . We assume that the following bound on the error in the approximation

(15)

for some . Such a bound on the error in the gradients has also been used in Luo & Tseng (1993).

Theorem 3.

Suppose is -Lipschitz continuous. Let for and (15) holds. Then, the generated by (14) converges sublinearly to a first-order stationary point of , i.e., . If , then the sequence converges linearly to 0, i.e., converges to .

Proof.

Let . From (15), . Applying the triangle inequality to and use (15) obtain

(16)

The term can be upper bounded as

(17)

where the final inequality follows from (15). From Lipschitz continuity of

(18)

where , the third inequality is obtained by substituting (16) and (17), and the final inequality follows from the definition of in the statement of the theorem. By similar arguments to those in Theorem 2 obtain

This proves the claim on sublinear convergence to a first-order stationary point of . Suppose holds. Substituting in (18) obtain

(19)

which proves the claim on linear convergence of to 0. By Theorem 1, converges to . ∎

The approximation in (12) is in fact exact when the function is quadratic. Consequently, the claims on the convergence of the iterates continue to hold when the iterates are generated by (14).

6 Experiments

In this section, we present several empirical results on simulated data demonstrating the effectiveness of the proposed GNI formulation. To demonstrate the correctness of our theoretical results, we show numerical results on several simple game settings with known equilibrium. Specifically, we consider the following payoff functions: i) bilinear two-player games, ii) quadratic games with convex and non-convex payoffs, iii) linear GAN using a Dirac delta generator, and iv) a more general linear GAN with linear generator and discriminator. We compare our descent algorithm against several popular choices such as (i) gradient descent, (ii) gradient descent with Adam-style updates (Kingma & Ba, 2014), (iii) optimistic mirror descent (Rakhlin & Sridharan, 2013; Daskalakis et al., 2018), (iv) the extrapolation scheme (Gidel et al., 2018), and (v) the extra-gradient method (Korpelevich, 1976)

. For all these methods, we either follow the standard hyperparameter settings (e.g., in Adam), or we find the hyperparameters that lead to the best convergence. For each of these games, we observe convergence of the proposed algorithm to stationary Nash points and contrast the quality of solutions against what can be theoretically guaranteed. As discussed in Section 

3.2, the quadratic and bilinear cases lead to convex GNI function and thus, the game always converges to a NE. Refer to supplementary materials for extra experiments. Below, we detail each of the game settings.

6.1 Bi-Linear Two-player Game:

We consider the following two-player game:

(20)

where and are the player’s payoff functions – a setting explored in (Gidel et al., 2018). The GNI for this game leads to a convex objective. For GNI, we use a step-size , where , and , while for other methods we use a stepsize of 111Other values of did not seem to result in stable descent.. The methods are initialized randomly – the initialization is seen to have little impact on the convergence of GNI, however changed drastically for that of others.

Figure 1: (a) shows GNI against other methods for bilinear min-max game. (b) shows convergence trajectories for 1-dimensional players. For (b), the initial point is shown in red diamond.

In Figure 1(a), we plot the gradient convergence (using 10-d data). In this plot (and all subsequent plots of gradient convergence), the norm of the gradient . We see that GNI converges linearly. However, other methods, such as gradient descent and mirror descent iterates diverge, while the extragradient and Adam are seen to converge slowly. To understand the descent better, in Figure 1(b), we use , and plot them for every 100-th iteration starting from the same initial point (shown by the red-diamond). Interestingly, we find that the extragradient and mirror-descent methods show a circular trajectory, while Adam (with and ) takes a spiral convergence path. GNI takes a more straight trajectory steadily decreasing to optima (shown by the blue straight line).

(a) non-convex QP
(b) convex-QP
Figure 2: Convergence of GNI against other methods for Quadratic games. (a) Non-convex QP with indefinite Q matrices for each player, (b) convex QP with semi-definite Q matrices.

6.2 Two-Player Quadratic Games:

We consider two-player games (multiplayer extensions are trivial) with the payoff functions:

(21)

where is symmetric. We consider cases when each is indefinite (i.e., non-convex QP) and positive semi-definite. As with the bilinear case, all the QP payoffs result in convex GNI reformulations. We used 20-d data, the same stepsizes and for GNI, while using for other methods. The players are initialized from .

In Figure 2, we compare the descent on these quadratic games. We find that the competitive methods are difficult to optimize for the non-convex QP and almost all of them diverge, except Adam which converges slowly. GNI is found to converge to the stationary Nash point (as it is convex– in §3.2). For the convex case, all methods are found to converge. To gain insight, we plot the convergence trajectory for a 1-d convex quadratic game (i.e., ) in Figure 3. The initializations are random for both players and the parameters are equal. We see that all schemes follow similar trajectories, except for Adam and GNI – all converging to the same point.

Figure 3: Convergence of GNI against other methods on a convex d quadratic game. Left: the convergence achieved by different algorithms. Right: the trajectories of the two players to the NE.

6.3 Dirac Delta GAN

This is a one-dimensional GAN explored in (Gidel et al., 2018). In this case, the real data is assumed to follow a Dirac delta distribution (with a spike at say point -2). The payoff functions for the two players are:

(22)

where

is the location of the delta spike. Unlike other game settings described above, we do not have an analytical formula to find the Lipscitz constant for the payoffs. To this end, we did an empirical estimate (more details to follow). We used

, and initialized all players uniformly from .

Figure 4 shows the comparison of the convergence of the dirac delta GAN game to a stationary Nash point. The GNI achieves faster convergence than all other methods, albeit having a non-convex reformulation in contrast to the bilinear and QP cases discussed above. The game has multiple local solutions and the schemes may converge to varied points depending on their initialization (see supplementary material for details).

Figure 4: Convergence of GNI against other methods on the Dirac-Delta GAN.
(a) Varying ,
(b) varying ,
(c) varying ,
Figure 5: Study of the influence of the step sizes ( and ) on the convergence of GNI reformulations for the linear GAN game.
(a) Generator,
(b) Discriminator,
(c) Generator,
Figure 6: Convergence of GNI against other methods on the linear GAN two-player game. The real-data distribution is sampled from for (a) and (b), while we use for (c), where . Note that, when the optimization converges, the discriminator is expected to be confused between the real and fake data distributions (i.e., classification accuracy is 0.5).

6.4 Linear GAN

We now introduce a more general GAN setup – a variant of the non-saturating GAN described in (Goodfellow, 2016), however using a linear generator and discriminator. We designed this experiment to serve two key goals: (i) to exposit the influence of the GNI hyperparameters in a more general GAN setting, and (ii) show the performance of GNI on a setting for which it is harder to estimate a Lipschitz constant

. While, our proposed setting is not a neural network, it allows to understand the behavior of GNI when other non-linearities arising from the layers of a neural network are absent, and thereby study GNI in isolation.

Experimental Setup:

The payoff functions are:

(23)

where and

are the real and the noise data distributions, the latter being the standard normal distribution

. The operator returns a diagonal matrix with its argument as its diagonal. We consider two cases for : (i) for a mean and (ii) for a covariance matrix . In our experiments to follow, we use , being a

-dimensional vector (

) of all ones. We initialized for all the methods.

Evaluation Metrics:

To evaluate the performance on various hyper-parameters of GNI, we define two metrics: (i) discriminator-accuracy, and (ii) the distance-to-mean

. The discriminator-accuracy measures how well the learned discriminator classifies the two distributions, defined as:

where is the indicator function, is the number of data points sampled from the respective distributions, and is a threshold for the indicator function. We use . While measures the quality of the discriminator learned, it does not tell us anything on the convergence of the generator. To this end, we present another measure to evaluate the generator; specifically, the distance-to-mean

, that computes the distance of the generated distribution from the first moment of the true distribution, defined as:

(24)

Hyper-parameter Study:

The goal of this experiment is to analyze the descent trajectory of GNI-based gradient descent when the hyper-parameters are changed. To this end, we vary and separately in the range to in multiples of , while keeping the other parameter fixed (we use and as the base settings). In Figure 5, we plot the discriminator-accuracy and distance-to-mean against GNI iterations for the generator and discriminator separately. From Figures 5(a) and (b), it appears that higher value of biases the descents on the generator and discriminator separately. For example, leads to a sharp descent to the optimal solution of the discriminator, however, leads to a generator breakdown (Figure 5(a)). Similarly, a small value of , such as shows high distance-to-mean, i.e., generator is weak, while, leads to good descents for both the generator and the discriminator. We found that a higher

leads to unstable descent, skewing the plots and thus not shown. In short, we found that making the discriminator quickly converge to its optimum could lead to a better convergence trajectory for the generator for this linear GAN setup using the GNI scheme.

Comparisons to Other Algorithms:

In Figures 6(a) and (b), we plot the distance-to-mean and discriminator-accuracy of linear GAN using and , and compare it to all other descent schemes. Interestingly, we found that Adam shows a different pattern of convergence, with the distance-to-mean steadily decreasing to zero; on close inspection (Figure 6(b)), we see that the discriminator-accuracy simultaneously goes to zero as well, suggesting the non-optimality of the descent. In contrast, our GNI converges quickly. In Figure 6(c), we plot the convergence when using a real data distribution , where

; a d-dimensional uniform distribution and

is a randomly-sampled diagonal covariance matrix. The descent in this general setting also looks similar to the one in Figure 6(a).

7 Conclusions

We presented a novel formulation for Nash equilibrium computation in multi-player games by introducing the Gradient-based Nikaido-Isoda (GNI) function. The GNI formulation for games allows individual players to locally improve their objectives using steepest descent while preserving local stability and convergence guarantees. We showed that the GNI function is a valid merit function for multi-player games and presented an approximate descent algorithm. We compared our method against several popular descent schemes on multiple game settings and empirically demonstrated that our method outperforms all other techniques. Future research will explore the GNI method in stochastic settings, that may enable their applicability to GAN optimization.

References

Appendix A Residual Minimization

Lemma 1 (in the main paper) also suggests another possible function for minimization, namely . We can state a result that is analogous to Theorem 1.

Theorem 4.

The global minimizers of are all first-order NE points, i.e., . If the individual functions are convex then the global minimizers of are precisely the set .

Denote by the vector function of the first-order stationary conditions for each of the players. So . The gradient of is given by

(25)

The Hessian of the function is

(26)

Consider the gradient descent iteration for minimizing with stepsize

(27)

We can state the following convergence result for the gradient descent iterations.

Theorem 5.

Suppose is -Lipschitz continuous. Let . Then, the generated by (27) converges sublinearly to a first-order critical point of , . If then the sequence converges linearly to a .

Proof.

From Lipschitz continuity of

(28)