1 Introduction
In this paper we consider methods to solve smooth unconstrained minmax optimization problems. In the most classical setting, a minmax objective has the form
where is a smooth objective function with two inputs. The usual goal in such problems is to find a saddle point, also known as a minmax solution, which is a pair that satisfies
(1) 
for every and . Minmax problems have a long history, going back at least as far as neumann1928theorie
, which formed the basis of much of modern game theory, and including a great deal of work in the 1950s when algorithms such as
fictitious play were explored brown1951iterative ; robinson1951iterative .The convexconcave setting, where we assume is convex in and concave in
, is a classic minmax problem that has a number of different applications, such as solving constrained convex optimization problems. While a variety of tools have been developed for this setting, a very popular approach within the machine learning community has been the use of socalled
noregret algorithms cesa2006prediction ; hazan2016introduction . This trick, which was originally developed by hannan1957approximation and later emerged in the development of boosting freund1999adaptive , provides a simple computational method via repeated play: each of the inputs and are updated iteratively according to noregret learning protocols, and one can prove that the averageiterates converge to a minmax solution.Recently, interest in minmax optimization has surged due to the enormous popularity of Generative Adversarial Networks (GANs), whose training involves solving a nonconvex minmax problem where and correspond to the parameters of two different neural nets goodfellow2014generative . Due to the fundamentally nonconvex nature of this problem, it is infeasible to find a “global” solution of the minmax objective. Instead, the typical goal in GAN training is to find a local minmax, namely a pair that satisfies eq. 1 for all in some neighborhood of . Moreover, iterate averaging is no longer desirable in the nonconvex setting, as it lacks the theoretical guarantees present in the convexconcave setting, and in practice, averaging neural net parameters tends to hurt performance. Thus, even though GANs are often trained with noregret algorithms such as gradient descent, the final neural net parameters used after training are the last iterates for and , rather than the timeaveraged iterates. As such, provable guarantees for GAN training must include lastiterate convergence properties.
In this paper, we focus on proving lastiterate convergence rates for minmax problems. Provable convergence rates are useful because they allow for quantitative comparison of different algorithms and can aid in choosing learning rates and architectures to ensure fast convergence in practice. Yet despite the extensive amount of literature on convergence rates for convex optimization, very few lastiterate convergence rates have been proved for minmax problems. Standard analysis of noregret algorithms says essentially nothing about lastiterate convergence. In fact, widely used noregret algorithms, such as Simultaneous Gradient Descent/Ascent (SGDA), fail to converge even in the simple bilinear setting where for some arbitrary matrix . SGDA provably cycles in continuous time and diverges in discrete time (see for example daskalakis2018training ; mescheder2018training ). In fact, the full range of FollowTheRegularizedLeader (FTRL) algorithms provably do not converge in zerosum games with interior equilibria mertikopoulos2018cycles . This occurs because the iterates of the FTRL algorithms exhibit cyclic behavior, a phenomenon commonly observed when training GANs in practice as well.
Existing work on lastiterate convergence rates has been limited to the bilinear or convexstrongly concave settings tseng1995linear ; liang2018interaction ; du2019linear ; mokhtari2019unified . In particular, the following basic question is still open:
“What lastiterate convergence rates are achievable for convexconcave minmax problems?”
We give a partial answer for this question by proving linear lastiterate convergence rates for an algorithm called Hamiltonian Gradient Descent (HGD) under much weaker assumptions compared to previous results. HGD is gradient descent on the squared norm of the gradient, and it has been mentioned in mescheder2017numerics ; balduzzi2018mechanics . Our results are the first to show nonasymptotic convergence of an efficient algorithm in settings that are not bilinear or strongly convex in either player. We show that HGD has linear convergence in settings where there are no spurious critical points provided that a “sufficiently bilinear” condition on the secondorder derivatives holds.^{1}^{1}1The condition can be fulfilled by adding a large, wellconditioned bilinear term to any objective Our result implies that HGD achieves linear convergence in convexconcave settings that are “sufficiently bilinear,” which is surprising since general smooth convexconcave optimization can have rates no faster than due to lower bounds on smooth convex optimization agarwal2017lower ; arjevani2017oracle . On the practical side, while vanilla HGD has issues training GANs in practice, mescheder2017numerics show that a related algorithm known as Consensus Optimization (CO) can effectively train GANs in a variety of settings, including on CIFAR10 and celebA. We show that CO can be viewed as a perturbation of HGD, which implies that for some parameter settings, CO converges at the same rate as HGD.
We begin in Section 2 with background material and notation, including some of our key assumptions. In Section 3, we discuss Hamiltonian Gradient Descent (HGD), and we present our linear convergence rates for HGD in various settings. In Section 4, we present some of the key technical components used to prove our results from Section 3. Finally, in Section 5, we present our results for Consensus Optimization. The details of our proofs are in Appendix F.
2 Background
2.1 Preliminaries
In this section, we discuss some key definitions and notation. We will use
to denote the Euclidean norm for vectors or the operator norm for matrices or tensors. For a symmetric matrix
, we will use andto denote the smallest and largest eigenvalues of
. For a general real matrix , anddenote the smallest and largest singular values of
.Definition 2.1.
A critical point of is a point such that .
Definition 2.2 (Convexity / Strong convexity).
Let . A function is strongly convex if for any . When is twicedifferentiable, is stronglyconvex iff for all . If in either of the above definitions, is called convex.
Definition 2.3 (Monotone / Strongly monotone).
Let . A vector field is strongly monotone if for any . If , is called monotone.
Definition 2.4 (Smoothness).
A function is smooth if is differentiable everywhere and for all satisfies .
Notation
Since is a function of and , we will often consider and to be components of one vector . We will use superscripts to denote iterate indices. Following balduzzi2018mechanics , we use to denote the signed vector of partial derivatives. Under this notation, the Simultaneous Gradient Descent/Ascent (SGDA) algorithm can be written as follows:
We will write the Jacobian of as:
Note that unlike the Hessian in standard optimization, is not symmetric, due to the negative sign in . When clear from the context, we often omit dependence on when writing and other functions. Note that , and are defined for a given objective – we omit this dependence as well for notational clarity. We will always assume is sufficiently differentiable whenever we take derivatives. In particular, we assume secondorder differentiability in Section 3.
We will also use the following nonstandard definition for notational convenience:
Definition 2.5 (Higherorder Lipschitz).
A function is Lipschitz if for all , and , and for all , .
We will consider a variety of settings for minmax optimization based on properties of the objective function . In the convexconcave setting, is convex as a function of for any fixed and is concave as a function of for any fixed . We can form analogous definitions by replacing the words “convex” and “concave” with words such as “strongly convex/concave”, “linear”, or “nonconvex”. The bilinear setting refers to the case when for some matrix . The strongly monotone setting refers to the case when is a strongly monotone vector field, as in the case when is strongly convexstrongly concave.
2.2 Notions of convergence in minmax problems
The convergence rates in this paper will apply to minmax problems where satisfies the following assumption:
Assumption 2.6.
All critical points of the objective are global minmaxes (i.e. they satisfy eq. 1).
In other words, we prove convergence rates to minmaxes in settings where convergence to critical points is necessary and sufficient for convergence to minmaxes. This assumption is true for convexconcave settings, but also holds for some nonconvexnonconcave settings, as we discuss in Appendix D.
We will measure the convergence of our algorithms to approximate critical points, defined as follows:
Definition 2.7.
Let . A point is an approximate critical point if .
Convergence to approximate critical points is a common goal in standard convex and nonconvex optimization (see for example allen2016variance ; ghadimi2016accelerated ; carmon2017convex ), as it is a necessary condition for convergence to local or global minima. For minmax optimization, it makes even more sense to use to measure how close we are to convergence since the value of at a given point gives no information about how close we are to a minmax.
Our main convergence rate results focus on this firstorder notion of convergence, which is sufficient given creftypecap 2.6. We discuss notions of secondorder convergence and ways to adapt our results to the general nonconvex setting in Appendix A.
2.3 Related work
Asymptotic and local convergence
Several recent papers have given asymptotic or local convergence results for minmax problems. mertikopoulos2018optimistic show that the extragradient (EG) algorithm converges asymptotically in a broad class of problems known as coherent saddle point problems, which include quasiconvexquasiconcave problems. Quasiconvex functions are unimodal and have no local maxima, but they may have critical points that are not local minima. As such, coherent saddle point problems capture some settings for which creftypecap 2.6 does not hold.
For more general smooth nonconvex minmax problems, a number of different papers have given local stability or local asymptotic convergence results for various algorithms mescheder2017numerics ; daskalakis2018limit ; balduzzi2018mechanics ; letcher2018stable ; mazumdar2019finding . We discuss this more in Appendix A.
Nonasymptotic convergence rates
Compared to the work on asymptotic convergence, the work on nonasymptotic lastiterate convergence rates has been limited to much more restrictive settings. A classic result by rockafellar1976monotone shows a linear convergence rate for the proximal point method in the bilinear and strongly convexstrongly concave cases. Another classic result, by tseng1995linear , shows a linear convergence rate for the extragradient algorithm in the bilinear case. liang2018interaction show that a number of algorithms achieve a linear convergence rate in the bilinear case, including Optimistic Mirror Descent (OMD) and Consensus Optimization (CO). They also show that SGDA obtains a linear convergence rate in the strongly convexstrongly concave case. mokhtari2019unified show that OMD and EG obtain a linear rate for the strongly convexstrongly concave case, in addition to proving similar results for generalized versions of both algorithms. Finally, du2019linear show that SGDA achieves a linear convergence rate for a convexstrongly concave setting with a full column rank linear interaction term.^{2}^{2}2Specifically, they assume , where is smooth and convex, is smooth and strongly convex, and has full column rank. We make a brief comparison of our work to that of du2019linear for the convexstrongly concave setting in Appendix C..
3 Hamiltonian Gradient Descent
Our main algorithm for finding saddle points of is called Hamiltonian Gradient Descent (HGD). HGD consists of performing gradient descent on a particular objective function that we refer to as the Hamiltonian, following the terminology of balduzzi2018mechanics .^{3}^{3}3We note that the function is not the Hamiltonian as in the sense of classical physics, as we do not use the symplectic structure in our analysis, but rather we only perform gradient descent on . If we let be the vector of (appropriatelysigned) partial derivatives, then the Hamiltonian is:
Since a critical point occurs when , we can find a (approximate) critical point by finding a (approximate) minimizer of . Moreover, under creftypecap 2.6, finding a critical point is equivalent to finding a saddle point. This motivates the HGD update procedure on with stepsize :
(2) 
HGD has been mentioned in mescheder2017numerics ; balduzzi2018mechanics , and it strongly resembles the Consensus Optimization (CO) approach of mescheder2017numerics .
The HGD update requires a Hessianvector product because , making HGD a secondorder iterative scheme. However, Hessianvector products are cheap to compute when the objective is defined by a neural net, taking only two gradient oracle calls pearlmutter1994fast . This makes the Hessianvector product oracle a theoretically appealing primitive, and it has been used widely in the nonconvex optimization literature. Since Hessianvector product oracles are feasible to compute for GANs, many recent algorithms for local minmax nonconvex optimization have also utilized Hessianvector products mescheder2017numerics ; balduzzi2018mechanics ; adolphs2018local ; letcher2018stable ; mazumdar2019finding .
To the best of our knowledge, previous work on lastiterate convergence rates has only focused on how algorithms perform in three particular cases: (a) when the objective is bilinear, (b) when is strongly convexstrongly concave, and (c) when is convexstrongly concave tseng1995linear ; liang2018interaction ; du2019linear ; mokhtari2019unified . The existence of methods with provable finitetime guarantees for settings beyond the aforementioned has remained an open problem. This work is the first to show that an efficient algorithm, namely HGD, can achieve nonasymptotic convergence in settings that are not strongly convex or linear in either player.
3.1 Convergence Rates for HGD
We now state our main theorems for this paper, which show convergence to critical points. When creftypecap 2.6 holds, we get convergence to minmaxes. All of our main results will use the following multipart assumption:
Assumption 3.1.
Let .

Assume a critical point for exists.

Assume is Lipschitz and let .
Our first theorem shows that HGD converges for the strongly convexstrongly concave case. Although simple, this result will help us demonstrate our analysis techniques.
Theorem 3.2.
Let creftypecap 3.1 hold and let be strongly convex in and strongly concave in . Then the HGD update procedure described in (2) with stepsize starting from some will satisfy
Next, we show that HGD converges when is linear in one of its arguments and the crossderivative is full rank. This setting allows a slightly tighter analysis compared to Theorem 3.4.
Theorem 3.3.
Let creftypecap 3.1 hold and let be smooth in and linear in , and assume the cross derivative is full rank with all singular values at least for all . Then the HGD update procedure described in (2) with stepsize starting from some will satisfy
Finally, we show our main result, which requires smoothness in both players and a large, wellconditioned crossderivative.
Theorem 3.4.
Let creftypecap 3.1 hold and let be smooth in and smooth in . Let and , and assume the cross derivative is full rank with all singular values lower bounded by and upper bounded by for all . Moreover, let the following “sufficiently bilinear” condition hold:
(3) 
Then the HGD update procedure described in (2) with stepsize starting from some will satisfy
(4) 
As discussed above, Theorem 3.4 provides the first lastiterate convergence rate that does not require strong convexity or linearity in either player. We even achieve linear convergence in convexconcave settings where eq. 3 holds, which is surprising since linear convergence is impossible in general for convexconcave settings due to lower bounds for convex optimization. Thus, the “sufficiently bilinear” condition eq. 3 is crucial for our linear rate. We give some explanations for this condition in the following section. In simple experiments for HGD on convexconcave and nonconvexnonconcave objectives, the convergence rate speeds up when there is a larger bilinear component, as expected from our theoretical results. We show these experiments in Appendix H.
We can construct examples of objectives that satisfy the assumptions of Theorem 3.4 but are not strongly monotone or bilinear. One such example is , where and are smooth convex functions. We discuss a simple example that is not convexconcave in Appendix D.
3.2 Explanation of “sufficiently bilinear” condition
In this section, we explain the “sufficiently bilinear” condition eq. 3. Suppose our objective is for a smooth function . Then for sufficiently large values of (i.e. has a large enough bilinear term), we see that satisfies eq. 3. To see this, note that in the worst case (i.e. when the eigenvalues of and are not bounded away from zero), condition eq. 3 requires . Let and be lower and upper bounds on the singular values of . Then eq. 3 becomes , which is true for (i.e. suffices).
We can understand this condition by appealing to an analogous minmax optimization setting. Suppose that we use SGDA on the objective for smooth convexconcave . According to liang2018interaction , SGDA will converge at a rate of roughly .^{4}^{4}4The actual rate is , for some parameter that is at least . For , SGDA will diverge in the worst case. For , we get linear convergence, but it will be slow because is large (this can be thought of as a large condition number). Finally, for , we get fast linear convergence, since . Thus, to get fast linear convergence it suffices to make the problem “sufficiently strongly convexstrongly concave” (or “sufficiently strongly monotone”).
Theorem 3.4 and condition eq. 3 show that there exists another class of settings where we can achieve linear rates in the minmax setting. In our case, if we have an objective for a smooth function , we will get linear convergence if and , which ensures that the problem is “sufficiently bilinear.” Intuitively, it makes sense that the “sufficiently bilinear” setting allows a linear rate because the pure bilinear setting allows a linear rate.
Another way to understand condition eq. 3 is that it is a sufficient condition for the existence of a unique critical point in a general class of settings, as we show in the following lemma, which we prove in Appendix E.
Lemma 3.5.
Let where and are smooth. Moreover, assume that and each have a 0 eigenvalue for some and . If eq. 3 holds, then has a unique critical point.
4 Proof sketches for HGD convergence rate results
In this section, we go over the key components of the proofs for our convergence rates from Section 3.1. Recall that the intuition behind HGD was that critical points (where ) are global minima of . On the other hand, there is no guarantee that is a convex potential function, and a priori, one would not assume gradient descent on this potential would find a critical point. Nonetheless, we are able to show that in a variety of settings, satisfies the PL condition, which allows HGD to have linear convergence. Proving this requires proving properties about the singular values of .
4.1 The PolyakŁojasiewicz condition for the Hamiltonian
We begin by recalling the definition of the PL condition.
Definition 4.1 (PolyakŁojasiewicz (PL) condition polyak1963 ; lojasiewicz1963 ).
A function satisfies the PL condition with a constant if for all , .
The PL condition is wellknown to be the weakest condition necessary to obtain linear convergence rate for gradient methods; see for example karimi2016linear . We will show that satisfies the PL condition, which allows us to use the following classic theorem.
Theorem 4.2 (Linear rate under PL polyak1963 ; lojasiewicz1963 ).
Let be smooth and let . Suppose satisfies the PL condition with parameter . Then if we run gradient descent from with stepsize , we have: .
For completeness, we provide the proof of Theorem 4.2 in Appendix B.
All of our results use creftypecap 3.1, so we are guaranteed that has a critical point. This implies that the global minimum of is 0, which allows us to prove the following key lemma:
Lemma 4.3.
Assume we have a twice differentiable with associated . Let . If for every , then satisfies the PL condition with parameter .
Proof.
Consider the squared norm of the gradient of the Hamiltonian:
The proof is finished by noting that when is a critical point. ∎
To use Theorem 4.2, we will also need to show that is smooth, which holds when is Lipschitz. The proof of Lemma 4.4 is in Appendix F.
Lemma 4.4.
Consider any which is Lipschitz for constants . Then the Hamiltonian is smooth.
To use Lemma 4.3, we will need control over the eigenvalues of , which we achieve with the following linear algebra lemmas. We provide their proofs in Appendix F.
Lemma 4.5.
Let and let . If and , then for all eigenvalues of , we have .
Lemma 4.6.
Let , where C is square and full rank. Then if is an eigenvalue of , then we must have .
4.2 Proof sketches for Theorems 3.2, 3.3, and 3.4
We now proceed to sketch the proofs of our main theorems using the techniques we have described. The following lemma shows it suffices to prove the PL condition for for the various settings of our theorems:
Lemma 4.7.
Given , suppose satisfies the PL condition with parameter is smooth. Then if we update some using eq. 2 with stepsize , then we have the following:
Proof.
Since satisfies the PL condition with parameter and is smooth, we know by Theorem 4.2 that gradient descent on with stepsize converges at a rate of . Substituting in for gives the lemma. ∎
It remains to show that satisfies the PL condition in the settings of Theorems 3.4, 3.3 and 3.2. First, we show the result for the strongly convexstrongly concave setting of Theorem 3.2.
Lemma 4.8 (PL for the strongly convexstrongly concave setting).
Let be strongly convex in and strongly concave in . Then satisfies the PL condition with parameter .
Proof.
Next, we show that satisfies the PL condition for the nonconvexlinear setting of Theorem 3.3. We prove this lemma in Section F.4 by using Lemma 4.6.
Lemma 4.9 (PL for the smooth nonconvexlinear setting).
Let be smooth in and linear in . Moreover, for all , let be full rank and square with . Then satisfies the PL condition with parameter .
Finally, we prove that satisfies the PL condition in the nonconvexnonconvex setting of Theorem 3.4. The proof for Lemma 4.10 is in Section F.5, and it uses Lemma F.2, which is similar to Lemma 4.6.
Lemma 4.10 (PL for the smooth nonconvexnonconvex setting).
Let be smooth in and smooth in . Also, let be full rank and let all of its singular values be lower bounded by and upper bounded by for all . Let and . Assume the following condition holds:
Then satisfies the PL condition with constant .
Combining Lemmas 4.10, 4.9 and 4.8 with Lemma 4.7 yields Theorems 3.4, 3.3 and 3.2.
5 Convergence rates for Consensus Optimization
The Consensus Optimization (CO) algorithm of mescheder2017numerics is as follows:
(5) 
where . This is essentially a weighted combination of SGDA and HGD. mescheder2017numerics remark that while HGD has poor performance on nonconvex problems in practice, CO can effectively train GANs in a variety of settings, including on CIFAR10 and celebA. While they frame CO as SGDA with a small modification, they actually set for several of their experiments, which suggests that one can also view CO as a modified form of HGD.
Using this perspective, we prove Theorem 5.1, which implies that we get linear convergence of CO in the same settings as Theorems 3.4, 3.3 and 3.2 provided that is sufficiently large (i.e. the HGD update is large compared to the SGDA update). The key technical component is showing that HGD still performs well even with a certain kind of small arbitrary perturbation. Previously, liang2018interaction proved that CO achieves linear convergence in the bilinear setting, so our result greatly expands the settings where CO has provable nonasymptotic convergence. We prove Theorem 5.1 in Appendix G.
Theorem 5.1.
Let creftypecap 3.1 hold. Let be smooth and suppose satisfies the PL condition with parameter . Then if we update some using the CO update eq. 5 with stepsize and , we get the following convergence:
(6) 
We also show that CO converges in practice on some simple examples in Appendix H.
References
 [ADLH19] Leonard Adolphs, Hadi Daneshmand, Aurelien Lucchi, and Thomas Hofmann. Local saddle point optimization: A curvature exploitation approach. In Artificial Intelligence and Statistics (AISTATS), 2019.
 [AH17] Naman Agarwal and Elad Hazan. Lower bounds for higherorder convex optimization. In Conference on Learning Theory (COLT), 2017.
 [ASS17] Yossi Arjevani, Ohad Shamir, and Ron Shiff. Oracle complexity of secondorder methods for smooth convex optimization. Mathematical Programming, pages 1–34, 2017.
 [AZH16] Zeyuan AllenZhu and Elad Hazan. Variance reduction for faster nonconvex optimization. In International Conference on Machine Learning (ICML), pages 699–707, 2016.
 [BRM18] David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of nplayer differentiable games. In International Conference on Machine Learning (ICML), 2018.
 [Bro51] George W Brown. Iterative solution of games by fictitious play. Activity analysis of production and allocation, 13(1):374–376, 1951.
 [CBL06] Nicolo CesaBianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
 [CHDS17] Yair Carmon, Oliver Hinder, John C Duchi, and Aaron Sidford. “Convex until proven guilty": Dimensionfree acceleration of gradient descent on nonconvex functions. In International Conference on Machine Learning (ICML), 2017.
 [DH19] Simon S Du and Wei Hu. Linear convergence of the primaldual gradient method for convexconcave saddle point problems without strong convexity. In Artificial Intelligence and Statistics (AISTATS), 2019.
 [DISZ18] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. In International Conference on Learning Representations (ICLR), 2018.
 [DP18] Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in minmax optimization. In Advances in Neural Information Processing Systems (NeurIPS), pages 9255–9265, 2018.
 [FS99] Yoav Freund and Robert E. Schapire. Adaptive Game Playing Using Multiplicative Weights. Games and Economic Behavior, 29(12):79–103, October 1999.
 [GL16] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(12):59–99, 2016.
 [GPAM14] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS), pages 2672–2680, 2014.
 [Han57] James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
 [Haz16] Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
 [KNS16] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximalgradient methods under the Polyakłojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
 [LFB19] Alistair Letcher, Jakob Foerster, David Balduzzi, Tim Rocktäschel, and Shimon Whiteson. Stable opponent shaping in differentiable games. In International Conference on Learning Representations, 2019.
 [Loj63] Lojasiewicz. A topological property of real analytic subsets (in french). Coll. du CNRS, Les équations aux dérivées partielles, page 87–89, 1963.
 [LS19] Tengyuan Liang and James Stokes. Interaction matters: A note on nonasymptotic local convergence of generative adversarial networks. Artificial Intelligence and Statistics (AISTATS), 2019.
 [MGN18] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In International Conference on Machine Learning (ICML), pages 3478–3487, 2018.
 [MJS19] Eric V Mazumdar, Michael I Jordan, and S Shankar Sastry. On finding local nash equilibria (and only local nash equilibria) in zerosum games. arXiv preprint arXiv:1901.00838, 2019.
 [MLZ19] Panayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, ChuanSheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Optimistic mirror descent in saddlepoint problems: Going the extra(gradient) mile. In International Conference on Learning Representations (ICLR), 2019.
 [MNG17] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of GANs. In Advances in Neural Information Processing Systems (NeurIPS), pages 1825–1835, 2017.
 [MOP19] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A unified analysis of extragradient and optimistic gradient methods for saddle point problems: Proximal point approach. arXiv preprint arXiv:1901.08511, 2019.
 [MPP18] Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In Proceedings of the TwentyNinth Annual ACMSIAM Symposium on Discrete Algorithms (SODA), pages 2703–2717. SIAM, 2018.
 [Neu28] J v Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320, 1928.
 [Pea94] Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.
 [Pol63] B. T. Polyak. Gradient methods for minimizing functionals (in russian). Zh. Vychisl. Mat. Mat. Fiz., page 643–653, 1963.
 [Rob51] Julia Robinson. An iterative method of solving a game. Annals of mathematics, pages 296–301, 1951.
 [Roc76] R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898, 1976.
 [Tse95] Paul Tseng. On linear convergence of iterative methods for the variational inequality problem. Journal of Computational and Applied Mathematics, 60(12):237–252, 1995.
Appendix A General nonconvex minmax optimization
In standard nonconvex optimization, a common goal is to find secondorder local minima, which are approximate critical points where is approximately positive definite. Likewise, a common goal in nonconvex minmax optimization is to find approximate critical points where an analogous secondorder condition holds, namely that is approximately positive definite and is approximately negative definite. Critical points where this secondorder condition holds are called local minmaxes. When creftypecap 2.6 holds, all critical points are global minmaxes, but in more general settings, we may encounter critical points that do not satisfy these conditions. Critical points may be local minmins or maxmins or indefinite points. A number of recent papers have proposed dynamics for nonconvex minmax optimization, showing local stability or local asymptotic convergence results [MNG17, DP18, BRM18, LFB19, MJS19]. The key guarantee that these papers generally give is that their algorithms will be stable at local minmaxes and unstable at some set of undesirable critical points (such as local maxmins). This essentially amounts to a guarantee that in the convexconcave setting, their algorithms will converge asymptotically and in the strictly concavestrictly convex setting (i.e. where there is only an undesirable maxmin), their algorithms will diverge asymptotically. This type of local stability is essentially the best one can ask for in the general nonconvex setting, and we show how to give similar guarantees for our algorithm in Section A.1.
a.1 Nonconvex extensions for HGD
While the naive version of HGD will try to converge to all critical points, we can modify HGD slightly to achieve secondorder stability guarantees as in various related work such as [BRM18, LFB19]. In particular, we consider modifying HGD so that there is some scalar in front of the term as follows:
(7) 
We now present two ways to choose . Our first method is inspired by the Simplectic Gradient Adjustment algorithm of [BRM18], which is as follows:
(8) 
where is the antisymmetric part of and . [BRM18] show that is positive when in a strictly convexstrictly concave region and negative in a strictly concavestrictly convex region. Thus, if we choose , we can ensure that the modified HGD will exhibit local stability around strict minmaxes and local instability around strict maxmins. This follows simply because we will do gradient descent on in the first case and gradient ascent on in the second case.
Another way to choose involves using an approximate eigenvalue computation on and to detect whether is positive semidefinite and is negative semidefinite (which would mean we are in a convexconcave region). We set if we are in a convexconcave region and
otherwise, which will guarantee local stability around minmaxes and local instability around other critical points. This approximate eigenvector computation can be done using a logarithmic number of Hessianvector products.
Appendix B Proof of linear convergence rate under PL condition
Here we present a classic proof of Theorem 4.2.
Proof of Theorem 4.2.
where the first line comes from smoothness and the update rule for gradient descent, the second inequality comes from the PL condition. Applying the last line recursively gives the result. ∎
Appendix C Comparison of Theorem 3.4 to [Dh19]
In this section, we compare our results in Theorem 3.4 to those of [DH19]. [DH19] prove a rate for SGDA when is smooth and convex in and smooth and strongly concave in and is some fixed matrix . The specific setting they consider is to find the unconstrained minmax for a function defined as where is convex and smooth, is stronglyconvex and smooth, and has rank (i.e. has full column rank).
Their rate uses the potential function , where we have:
(9)  
(10)  
(11) 
where is the minmax for the objective. Their rate (Theorem 3.1 in [DH19]) is
(12) 
for some constant . To translate this rate into bounds on , we can use the smoothness of in both of its arguments to note that and likewise for . So the rate on translates into a rate on with some additional factor in front.
Their rate and our rate are incomparable – neither is strictly better. For instance when is much larger than all other quantities, their rates simplify to , while ours go to . While our convergence rate requires the sufficiently bilinear condition eq. 3 to hold, we do not require convexity in or concavity in . Moreover, we allow to change as long as the bounds on the singular values hold whereas [DH19] require to be a fixed matrix.
Appendix D Nonconvexnonconcave setting where creftypecap 2.6 and the conditions for Theorem 3.4 hold
In this section we give a concrete example of a nonconvexnonconcave setting where creftypecap 2.6 and the conditions for Theorem 3.4 hold. We choose this example for simplicity, but one can easily come up with other more complicated examples.
For our example, we define the following function:
(13) 
The first and second derivatives of are as follows:
(14) 
(15) 
From Figure 2, we can see that this function is neither convex nor concave.
Our objective will be . Note that because for all . Also, since .
First, we show that satisfies creftypecap 3.1. We see that has a critical point at . Moreover, is Lipschitz for any finitesized region of . Thus, if we assume our algorithm stays within a ball of some radius , the Lipschitz assumption will be satisfied. Since our algorithm does not diverge and indeed converges at a linear rate to the minmax, this assumption is fairly mild.
Next, we show that satisfies condition eq. 3. Condition eq. 3 requires for . We see that this holds because and .
Therefore, the assumptions of Theorem 3.4 are satisfied.
We can also show that this objective satisfies creftypecap 2.6, so we get convergence to the minmax of . We will show that has only one critical point (at ) and that this critical point is a minmax. We first give a “proof by picture” below, showing a plot of in Figure 3, along with plots of and showing that is indeed a minmax.
We can also formally show that is the unique critical point of and that it is a minmax. We prove this for completeness, although the calculations more or less amount to a simple case analysis. Let us look at the derivatives of with respect to and :
(16) 
(17) 
Observe that if then critical points of must satisfy , which implies that . Likewise, if , then critical points of must have . We show that this implies that only has critical points where and are both in the range .
Suppose had a critical point such that . Then this critical point must satisfy . But from our observation above, if a critical point has , then must lie in , which contradicts .
Next, suppose had a critical point such that . Then this critical point must satisfy , which implies that . But then by the observation above, must lie in , which contradicts .
From this we see that any critical point of must have . We can make analogous arguments to show that any critical point of must have .
From this, we can conclude that all critical points of must satisfy the following:
(18)  
(19) 
These equations imply the following:
(20)  
(21)  
(22)  
Comments
There are no comments yet.