1 Introduction
We consider the following minimax optimization problem:
(1.1) 
where is a smooth (possibly nonconvex in ) function and is a convex set. Since von Neumann’s pioneering work in 1928 [36], the problem of finding the solution to problem (1.1) has been a major endeavor in mathematics, economics and computer science [4, 37, 46]. In recent years minimax optimization theory has begun to see applications in machine learning, including adversarial learning [15, 27], statistical learning [6, 47, 1, 13]
, certification of robustness in deep learning
[43] and distributed computing [42, 28]. On the other hand, realworld machinelearning systems are often embedded in larger economic markets and subject to gametheoretic constraints [21].The most widely used and seemingly the simplest algorithm to solve problem (1.1) is a natural generalization of gradient descent (GD). Known as gradient descent ascent (GDA), it alternates between gradient descent on the variable and gradient ascent on the variable . There is a vast literature that applies GDA, and stochastic variants of GDA (SGDA), to problems in the form of (1.1) [15, 27, 43]. However, the theoretical understanding of the algorithm is still fairly limited. In particular, most of the asymptotic and nonasymptotic convergence results [22, 7, 33, 34, 11] are established for the special case of convexconcave problem (1.1)— is convex in and concave in . Unlike the convexconcave case, for which the behavior of GDA has been investigated quite thoroughly, the issue of the convergence of GDA remains largely open in the general setting. More specifically, there is no shortage of work highlighting that GDA can converge to limit cycles or even diverge in a gametheoretic setting [5, 19, 9, 31]. Despite several recent progress on solving general minimax optimization problems via a range of techniques [32, 8, 18, 2, 24, 30, 29], it remains unclear why GDA and SGDA often work well in various applications in which the objective is not convexconcave.
The following general structure arises in many applications: is concave for any and is a bounded set. For example, consider the problem of certifying robustness in deep learning [43]. Training a model is basically a nonconvex minimization problem,
, where the loss function
refers to a neural network over data samples
. Since the neural networks are vulnerable to adversarial examples [16], it is necessary to develop efficient procedures with rigorous guarantees for small to moderate amounts of robustness. An example of such a scheme, involving the solution of a nonconvexstronglyconcave minimax problem, is presented in [43]. A second example is robust learning from multiple distributions [27]. Given multiple empirical distributions from an underlying true distribution, the goal is to introduce robustness by minimizing the maximum of expected loss over these distributions. This problem can also be posed as a nonconvexconcave minimax problem.Despite the popularity of GDA and SGDA in practice, few results has been established on their efficiency beyond the convexconcave setting. Thus, a natural question arises:
Are GDA and SGDA provably efficient for solving nonconvexconcave minimax problems?
This paper presents an affirmative answer to this question. In particular, we first characterize stationary conditions for nonconvexstronglyconcave and nonconvexconcave minimax problems, respectively. For nonconvexstronglyconcave problems, GDA and SGDA return an stationary point within gradient evaluations and stochastic gradient evaluations where is a condition number. For nonconvexconcave problems, GDA and SGDA return an stationary point within gradient evaluations and stochastic gradient evaluations.
Technically, the concavity of makes it computationally feasible to find the corresponding global maximum, , for any . A straightforward way to solve nonconvexconcave minimax problems is a class of nestedloop variant of GDA, which finds for every iterate . We denote this as gradient descent with maxoracle (GDmax), and realize the maxoracle by gradient ascent on . We use GDmax and stochastic GDmax (SGDmax) as the baseline approaches in this paper; see Table 1 for the details.
The complexity analysis for GDmax and SGDmax can be decomposed into two parts: the number of gradient evaluations required by a maxoracle, and the number of iterations required to find a stationary point of . In contrast, the complexity analysis for GDA and SGDA is more challenging as the iterate is not necessarily guaranteed to be close to , such that it becomes less clear why is a reasonable direction to follow. In response to this, we develop some techniques to analyze the concave optimization with a slowly changing objective over the course of optimization which may be of independent interest.
Algorithms  Oracle Complexity  Algorithms  Oracle Complexity  
Nonconvex  GDA  GDmax  
StronglyConcave  SGDA  SGDmax  
Nonconvex  GDA  GDmax  
Concave  SGDA  SGDmax 
Related work: Historically, an early concrete instantiation of problem problem (1.1
) involved computing a pair of probability vectors
, or equivalently solving for a matrix and probability simplices and . This socalled bilinear minimax problem together with von Neumann’s minimax theorem [36]was a cornerstone in the development of game theory. A general algorithm schema was developed for solving this problem in which the min and max players each run a simple learning procedure in tandem; e.g., the fictitious play
[39]. Later, Sion [44] generalized von Neumann’s result from bilinear functions to general convexconcave functions, , and triggered a line of algorithmic research on convexconcave minimax optimization, in continuous time [23, 8] and discrete time [45, 14, 22, 34, 33]. Unfortunately, the techniques in these works rely heavily on the convexconcave structure and can not be extended to nonconvexconcave minimax problems.During the past decade, the study of general minimax problems has become a central topic in machine learning, inspired in part by the advent of adversarial learning [15, 27, 43]. Most recent work has focused on reducing oscillations and speeding up convergence of the gradient dynamics; see, e.g., consensus optimization [32], twotimescale GDA [18], the symplectic gradient [3]
, the lineartransformed gradient
[2], optimistic mirror descent [30, 24], inexact proximal point algorithm [25] and the twotimescale algorithm [29]. Despite empirical successes in real applications, the existing convergence analysis of these methods is still limited—all of the existing global convergence results are asymptotic and require strong conditions on the problem structure.Nonconvexconcave minimax problems appear to be a class of tractable problems in the form of problem (1.1) and have emerged as a focus in optimization and machine learning [43, 38, 41, 17, 26]. Grnarova et al., [17] proposed a variant of GDA for nonconvexconcave problem but did not provide theoretical guarantees for it. Rafique et al., [38] proposed a proximally guided stochastic mirror descent (PGSMD) and proved that it finds an approximate stationary point of . However, PGSMD is a nestedloop algorithm, and thus relatively complex to implement; one would like to know whether the nestedloop structure is necessary or whether GDA, a singleloop algorithm, can also be showed to converge in the nonconvexstronglyconcave setting. Such a convergence result has been established in the special case where is a linear function [38]. Lu et al., [26] analyzed a variant of GDA for nonconvexconcave problems and provided the theoretical guarantee under a slightly different setting and a different notion of optimality. A class of inexact nonconvex SGD algorithms [43, 41] can be categorized as variants of one of the algorithms that we analyze here (SGDmax in Algorithm 2). We provide a theoretical guarantee for such algorithms in the general nonconvexconcave case.
2 Preliminaries
Notation. We use bold lowercase letters to denote vectors, as in . We use to denote the norm of vectors and spectral norm of matrices. For a function , denotes the subdifferential of at . If is differentiable, then where denotes the gradient of at , and denotes the partial gradient of with respect to at . For a symmetric matrix
, we denote the largest and smallest eigenvalue of
as and . We use caligraphic uppercase letter to denote sets, as in .Before presenting the objectives in nonconvexconcave minimax optimization, we first describe some standard definitions.
Definition 2.1
is Lipschitz if for , we have .
Definition 2.2
is gradient Lipschitz if for , we have .
Intuitively, a function being Lipschitz means that the function values at two nearby points must also be close; a function being gradient Lipschitz means that the gradients at two nearby points must also be close. Recall that the minimax problem (1.1) is equivalent to the following minimization problem:
(2.1) 
In this paper, we study the special case where is either concave or strongly concave, thus the maximization problem can be solved efficiently for any . However, since is a nonconvex function, it is NPhard to find the global minimum of it in general, even in the idealized setting in which the maximizer for any can be computed for free.
Objectives in this paper.
We begin by specifying a notion of a local surrogate for a global minimum.
Definition 2.3
We call an stationary point () of a differentiable function if . If , then is called a stationary point.
Unfortunately, even if is Lipschitz and gradientLipschitz, need not be differentiable. A weaker condition that is sufficient for the purpose of our paper is the following notion of weak convexity.
Definition 2.4
Function is weakly convex if function is convex.
In particular, when is twice differentiable, is gradient Lipschitz if and only if all the eigenvalues of the Hessian are upper and lower bounded by and , while is weak convex if and only if all the eigenvalues of the Hessian are lower bounded by .
For any weakly convex function , its subdifferential can be uniquely determined by the subdifferential of . A naive measure of approximate stationarity can be defined as a point such that at least one subgradient is small: .
However, this criterion can be very restrictive when optimizing nonsmooth functions. For example, when is a onedimensional function, an approximate stationary point must be for any . This means that finding an approximate stationary point under this notion is as difficult as solving the minimization exactly. An alternative criterion based on the Moreau envelope of has been recognized as standard when is weakly convex [10].
Definition 2.5
Function is the Moreau envelope of with parameter if for any .
Lemma 2.6
If is gradient Lipschitz and is bounded, then the Moreau envelope is differentiable, gradient Lipschitz, and strongly convex.
An stationary point of a weakly convex function thus can be alternatively defined a a point where the gradient of Moreau envelope is small.
Definition 2.7
We call an stationary point () of a weakly convex function , if . If , then is called a stationary point.
We can also express Definition 2.7 in terms of the original function .
Lemma 2.8
If is an stationary point of a weakly convex function (Definition 2.7), then there exists such that and .
Lemma 2.8 shows that an stationary point defined by the Moreau envelope can be interpreted as the relaxation for . More specifically, if is an stationary point of a weakly convex function , it is close to a point which has small subgradient.
3 Main Results
In this section, we establish the nonasymptotic convergence rate of GD with a maxoracle (GDmax), SGD with a maxoracle (SGDmax), GDA and SGDA for nonconvexstronglyconcave minimax problems and nonconvexconcave minimax problems.
We present pseudocode for GDmax and SGDmax in Algorithms 1 and 2. Fix , the maxoracle approximately solves at each iteration. Although GDmax and SGDmax are easier to understand, they have two disadvantages over GDA and SGDA: (1) Both GDmax and SGDmax are nestedloop algorithms. Since it is difficult to predetermine the number iterations for the inner loop, these algorithms are complex to implement in practice; (2) In the general setting where is nonconcave, GDmax and SGDmax are inapplicable as we can not efficiently find a global optimum. In contrast, GDA and SGDA are singleloop algorithms; see Algorithms 3 and 4.
For the stochastic gradient algorithm, we assume that the stochastic gradient oracle satisfies the following condition.
Assumption 3.1
3.1 NonconvexStronglyConcave Minimax Problem
In this subsection, we present the results for the nonconvexstrongly concave minimax problem. We make the following assumption.
Assumption 3.2
The objective function and constraint set pair, satisfy

is gradient Lipschitz and is strongly concave for .

is a convex and bounded set with a diameter .
While the gradientLipschitz assumption is standard in the optimization literature, strongly concavity is crucial here, along with the boundedness of , allowing for an efficient solution of . We let denote the problem condition number throughout this section. The following structural lemma provides further information about in the stronglyconcave setting.
Lemma 3.3
The target is to find an stationary point (cf. Definition 2.3) given only gradient (or stochastic gradient) access to . Denoting , we have the following complexity bound for GDmax.
Theorem 3.4 (Complexity Bound for GDmax)
Under Assumption 3.2, letting the step size and the tolerance for the maxoracle be and , the number of iterations required by Algorithm 1 to return an stationary point is bounded by . Furthermore, the accurate maxoracle can be realized by gradient ascent (GA) with the stepsize for iterations, which gives the total gradient complexity of the algorithm:
Theorem 3.4 shows that if we alternate between one step of gradient descent over and steps of gradient ascent over , with a pair of proper learning rates , we can find at least one stationary point of within gradient evaluations.
We present similar guarantees when only stochastic gradients are available in the following theorem.
Theorem 3.5 (Complexity Bound for SGDmax)
Under Assumptions 3.1 and 3.2, letting the step size and the tolerance for the maxoracle be the same in Theorem 3.4 with the batch size , the number of iterations required by Algorithm 2 to return an stationary point is bounded by . Furthermore, the accurate maxoracle can be realized by minibatch stochastic gradient ascent (SGA) with the step size and the minibatch size for gradient evaluations, which gives the total gradient complexity of the algorithm:
The sample size guarantees that the variance is less than so that the average stochastic gradients over the batch are sufficiently close to the true gradients and .
We proceed to provide a theoretical guarantee for the singlelooped GDA and SGDA algorithms. Since an iterate generated by GDA or SGDA can be far from the maximizer, , it is nontrivial to identify a Lyapunov function that decreases monotonically. We thus need to devise a new proof technique (see Section 4 for details). Based on this technique, we are able to derive the following results for the complexity of the GDA and SGDA algorithms.
Theorem 3.6 (Complexity Bound for GDA)
Theorem 3.7 (Complexity Bound for SGDA)
Theorem 3.6 and 3.7 show that GDA and SGDA can find an stationary point with proper step sizes and the convergence rate matches that of GDmax and SGDmax up to a logarithmic factor. Moreover, GDA and SGDA are simpler and more practical than GDmax and SGDmax which in practice requires accurately determining the value of for the maxoracle.
The gradient complexity of GDA can be improved to with a proper initialization for . More specifically, the extra term in Theorem 3.6 can be removed by performing gradient ascent to find so that . It is unclear if the gradient complexity of GDmax could be improved by similar warmstart strategies. In fact, in GDmax is large so even though is close to , is possibly large. This means that gradient ascent in the maxoracle starts with a bad initial point. In contrast, is small in GDA and so is , leading to an automatically good initialization.
The ratio of learning rates for GDA is equivalent to that for GDmax times the number of gradient ascent steps in the inner loop up to the logarithmic factor. More specifically, the ratio is for GDA while the ratio is and the number of gradient ascent steps in the maxoracle is for GDmax. While [20] suggests that is necessary as in a general setting, our result is nonasymptotic with independent of , obtained by carefully exploiting the structure of the nonconvexstronglyconcave minimax problem.
3.2 NonconvexConcave Minimax Problems
In this subsection, we present the results for the nonconvexconcave minimax problem. The main assumption is the following.
Assumption 3.8
The objective function and constraint set pair, satisfy:

is gradient Lipschitz, is Lipschitz for and is concave for .

is a convex and bounded set with a diameter .
Since is only required to be concave for any , is possibly not differentiable. Fortunately, Lipschitz and gradient Lipschitz assumptions guarantees that is weakly convex and Lipschitz.
Lemma 3.9
Under Assumption 3.8, is weakly convex and Lipschitz with where .
The target is to find an stationary point of a weakly convex function (Definition 2.7) given only gradient (or stochastic gradient) access to . Denote , we present the gradient complexity for GDmax and SGDmax in the following two theorems.
Theorem 3.10 (Complexity Bound for GDmax)
Under Assumption 3.8, letting the step size and the tolerance for the maxoracle be and , the number of iterations required by Algorithm 1 to return an stationary point is bounded by . Furthermore, the accurate maxoracle is realized by GA with the step size for iterations, which gives the total gradient complexity of the algorithm:
Theorem 3.11 (Complexity Bound for SGDmax)
Under Assumptions 3.1 and 3.8, letting the tolerance for the maxoracle be chosen as the same as in Theorem 3.10 with a step size and a batch size given by and , the number of iterations required by Algorithm 2 to return an stationary point is bounded by . Furthermore, the accurate maxoracle is realized by SGA with the step size and a batch size for iterations, which gives the following total gradient complexity of the algorithm:
When , the stochastic gradients are sufficiently close to the true gradients and and the gradient complexity of SGDmax matches that of GDmax.
We now provide a theoretical guarantee for the GDA and SGDA algorithms. While the complexity analysis for GDmax and SGDmax is nearly the same as that in subsection 3.1, the proof techniques for GDA and SGDA are quite different; see Section 4 for the details. In what follows, we present the gradient complexity of GDA and SGDA algorithms.
Theorem 3.12 (Complexity Bound for GDA)
Theorem 3.13 (Complexity Bound for SGDA)
Theorem 3.12 shows that the ratio of learning rates for GDA equals that for GDmax times the number of gradient ascent in the maxoracle. depends on and tends to as . In contrast to [20], we obtain an nonasymptotic result that by exploiting the problem structure with new technique. We note our result does not contradict [12, Proposition 1], which shows that GDA diverges on a simple bilinear minimax problem. In fact, while in their example, is assumed to be compact in our setting, which together with a large ratio , prevents the divergence issue.
4 Overview of Proofs
In this section, we present the key ideas behind our theoretical results of GDA and SGDA. The main technical contribution is to develop new techniques for analyzing convex (or concave) optimization with slowly changing objective over the iterations. In particular, we focus on the complexity analysis for GDA in the nonconvexstronglyconcave and nonconvexconcave minimax settings (Theorems 3.6 and 3.12), and omit the proof overview for SGDA.
4.1 NonconvexStronglyConcave Minimax Problems
In the nonconvexstronglyconcave setting, Lemma 3.3 implies that is gradient Lipschitz, and where . This implies that, if we can find for each iterate , then we can just use the standard technique in nonconvex smooth optimization and provide an efficient guarantee for finding an stationary point (cf. Definition 2.3).
Unfortunately, this is not the case for GDA where in general. To overcome this difficulty, the highlevel idea in our proof is to control a pair of learning rates that force to move more slowly than . More specifically, Lemma 3.3 guarantees that is Lipschitz:
That is, if changes slowly, then also changes slowly. This allows us to perform gradient ascent on a slowly changing stronglyconcave function , guaranteeing that is small in an amortized sense.
More precisely, letting the error be , Lemma B.3 implies that comes into the standard analysis of nonconvex smooth optimization via the final terms in the following equation:
The remaining step is to show that the additional error term (the second term on the righthand side) is always small compared to the first term on the righthand side. This is done via a recursion for (cf. Lemma B.2):
where and is small. Therefore, has a linear contraction and can be well controlled.
4.2 NonconvexConcave Minimax Problems
In the nonconvexconcave case, the main idea is again to control a pair of learning rates to force to move more slowly than . Different from the the setting in the last subsection, is only guaranteed to be concave and is possibly not Lipschitz or even uniquely defined. This means that, even if are extremely close, can be dramatically different from . Therefore, is no longer a viable error to control.
Fortunately, Lemma 3.9 implies that is Lipschitz. This implies that, when the learning rate is very small, the maximum function values changes slowly:
Again, this allows us to perform gradient ascent on concave functions that change slowly in terms of maximum function value, and guarantees is small in an amortized sense. Indeed, Lemma C.1 implies that
where the last term on the righthand side is the error term additional to the standard analysis in nonconvex nonsmooth optimization. The goal of the analysis is again to show the error term is small compared to the sum of the first two terms on the righthand side.
To bound term , the standard analysis in convex optimization (where the optimal point does not change) uses the following inequalities and a telescoping argument:
(4.1) 
The major challenge here is that the optimal points can change dramatically, and the telescoping argument does not go through. An important observation is, however, that (4.1) can also be proved if we replace the on the righthand side by , while paying an additional cost that depends on the difference in function value between and . More specifically, we pick a block of size and show in Lemma C.2 for any , the following statement holds,
We perform an analysis on the blocks where the concave problem are similar so the telescoping argument can go through now. By carefully choosing , the term can also be well controlled.
5 Conclusions
We have presented a theoretical complexity analysis for GDA and SGDA in the setting of nonconvexstronglyconcave and nonconvexconcave minimax problems. We characterize the stationarity conditions in both settings and prove that GDA and SGDA return an stationary point within gradient and stochastic gradient evaluations for the nonconvexstronglyconcave minimax problems, and gradient and stochastic gradient evaluations for the nonconvexconcave minimax problems. Moreover, we analyze GDmax and SGDmax based on a maxoracle at each iteration, providing a complete complexity analysis. Future directions include the investigation of lower bounds for solving minimax problems and obtaining theoretical guarantees for GDA in a still wider range of problems.
References

[1]
S. S. Abadeh, P. M. M. Esfahani, and D. Kuhn.
Distributionally robust logistic regression.
In NeurIPS, pages 1576–1584, 2015.  [2] L. Adolphs, H. Daneshmand, A. Lucchi, and T. Hofmann. Local saddle point optimization: A curvature exploitation approach. ArXiv Preprint: 1805.05751, 2018.
 [3] D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel. The mechanics of nplayer differentiable games. ArXiv Preprint: 1802.05642, 2018.
 [4] T. Basar and G. J. Olsder. Dynamic Noncooperative Game Theory, volume 23. SIAM, 1999.
 [5] M. Benaım and M. W. Hirsch. Mixed equilibria and dynamical systems arising from fictitious play in perturbed games. Games and Economic Behavior, 29(12):36–72, 1999.
 [6] N. CesaBianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
 [7] G. H. G. Chen and R. T. Rockafellar. Convergence rates in forward–backward splitting. SIAM Journal on Optimization, 7(2):421–444, 1997.
 [8] A. Cherukuri, B. Gharesifard, and J. Cortes. Saddlepoint dynamics: conditions for asymptotic stability of saddle points. SIAM Journal on Control and Optimization, 55(1):486–511, 2017.
 [9] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. ArXiv Preprint: 1711.00141, 2017.
 [10] D. Davis and D. Drusvyatskiy. Stochastic modelbased minimization of weakly convex functions. ArXiv Preprint: 1803.06523, 2018.
 [11] S. S. Du and W. Hu. Linear convergence of the primaldual gradient method for convexconcave saddle point problems without strong convexity. ArXiv Preprint: 1802.01504, 2018.
 [12] G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. LacosteJulien. A variational inequality perspective on generative adversarial networks. In ICLR, 2019.
 [13] R. Giordano, T. Broderick, and M. I. Jordan. Covariances, robustness, and variational bayes. ArXiv Preprint: 1709.02536, 2017.
 [14] E. G. Golshtein. Generalized gradient method for finding saddle points. Matekon, 10(3):36–52, 1974.
 [15] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014.
 [16] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
 [17] P. Grnarova, K. Y. Levy, A. Lucchi, T. Hofmann, and A. Krause. An online learning approach to generative adversarial networks. In ICLR, 2018.
 [18] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two timescale update rule converge to a local Nash equilibrium. In NeurIPS, pages 6626–6637, 2017.

[19]
C. H. Hommes and M. I. Ochea.
Multiple equilibria and limit cycles in evolutionary games with logit dynamics.
Games and Economic Behavior, 74(1):434–441, 2012.  [20] C. Jin, P. Netrapalli, and M. I. Jordan. Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. ArXiv Preprint: 1902.00618, 2019.
 [21] M. I. Jordan. Artificial intelligence–the revolution hasn’t happened yet. Medium, 2018.
 [22] G. M. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976.
 [23] T. Kose. Solutions of saddle value problems by differential equations. Econometrica, Journal of the Econometric Society, pages 59–70, 1956.
 [24] T. Liang and J. Stokes. Interaction matters: A note on nonasymptotic local convergence of generative adversarial networks. ArXiv Preprint: 1802.06132, 2018.
 [25] Q. Lin, M. Liu, H. Rafique, and T. Yang. Solving weaklyconvexweaklyconcave saddlepoint problems as weaklymonotone variational inequality. ArXiv Preprint: 1810.10207, 2018.
 [26] S. Lu, I. Tsaknakis, M. Hong, and Y. Chen. Hybrid block successive approximation for onesided nonconvex minmax problems: Algorithms and applications. ArXiv Preprint: 1902.08294, 2019.
 [27] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. ArXiv Preprint: 1706.06083, 2017.

[28]
G. Mateos, J. A. Bazerque, and G. B. Giannakis.
Distributed sparse linear regression.
IEEE Transactions on Signal Processing, 58(10):5262–5276, 2010.  [29] E. V. Mazumdar, M. I. Jordan, and S. S. Sastry. On finding local Nash equilibria (and only local Nash equilibria) in zerosum games. ArXiv Preprint: 1901.00838, 2019.
 [30] P. Mertikopoulos, B. Lecouat, H. Zenati, CS Foo, V. Chandrasekhar, and G. Piliouras. Optimistic mirror descent in saddlepoint problems: Going the extra(gradient) mile. In ICLR, 2019.
 [31] P. Mertikopoulos, C. Papadimitriou, and G. Piliouras. Cycles in adversarial regularized learning. In SODA, pages 2703–2717. SIAM, 2018.
 [32] L. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. In NeurIPS, pages 1825–1835, 2017.
 [33] A. Nedić and A. Ozdaglar. Subgradient methods for saddlepoint problems. Journal of Optimization Theory and Applications, 142(1):205–228, 2009.
 [34] A. Nemirovski. Proxmethod with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convexconcave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
 [35] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2013.
 [36] J. V. Neumann. Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100(1):295–320, 1928.
 [37] N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani. Algorithmic Game Theory. Cambridge University Press, 2007.
 [38] H. Rafique, M. Liu, Q. Lin, and T. Yang. Nonconvex minmax optimization: Provable algorithms and applications in machine learning. ArXiv Preprint: 1810.02060, 2018.
 [39] J. Robinson. An iterative method of solving a game. Annals of Mathematics, pages 296–301, 1951.
 [40] R. T. Rockafellar. Convex Analysis. Princeton University Press, 2015.
 [41] M. Sanjabi, J. Ba, M. Razaviyayn, and J. D. Lee. Solving approximate Wasserstein GANs to stationarity. ArXiv Preprint: 1802.08249, 2018.
 [42] J. Shamma. Cooperative Control of Distributed Multiagent Systems. John Wiley & Sons, 2008.
 [43] A. Sinha, H. Namkoong, and J. Duchi. Certifiable distributional robustness with principled adversarial training. In ICLR, 2018.
 [44] M. Sion. On general minimax theorems. Pacific Journal of Mathematics, 8(1):171–176, 1958.
 [45] H. Uzawa. Iterative methods for concave programming. Studies in Linear and Nonlinear Programming, 6:154–165, 1958.
 [46] J. Von Neumann and O. Morgenstern. Theory of Games and Economic Behavior (Commemorative Edition). Princeton University Press, 2007.

[47]
H. Xu, C. Caramanis, and S. Mannor.
Robustness and regularization of support vector machines.
Journal of Machine Learning Research, 10(Jul):1485–1510, 2009.
Appendix A Proof of Technical Lemmas
a.1 Proof of Lemma 2.6
We provide a proof for an expanded version of Lemma 2.6.
Lemma A.1
If is gradient Lipschitz and is bounded, we have

and are welldefined for .

for .

is gradient Lipschitz with .

for .
Proof. By the definition of , we have
Since is gradient Lipschitz, is convex in for . Since is bounded, Danskin’s theorem [40] implies that is convex. Putting these pieces together yields that is strongly convex. This implies that and are welldefined. Furthermore, by the definition of , we have
Moreover, [10, Lemma 2.2] implies that is gradient Lipschitz with
Finally, it follows from [35, Theorem 2.1.5] that satisfies the last inequality.
a.2 Proof of Lemma 2.8
Denote , we have (cf. Lemma 2.6) and hence
Furthermore, the optimality condition for implies that . Putting these pieces together yields that .
a.3 Proof of Lemma 3.3
Since is strongly concave in for , is unique and welldefined. Then we claim that is Lipschitz. Indeed, let , the optimality of and implies that
(A.1)  
(A.2) 
Letting in (A.1) and in (A.2) and summing the resulting two inequalities yields
(A.3) 
Recall that is strongly concave, we have
(A.4) 
Then we conclude the desired result by combining (A.3), (A.4) and that is gradient Lipschitz, i.e.,
Finally, since is unique and is convex and bounded, we conclude from Danskin’s theorem [40] that is differentiable with . Since , we have
Since is Lipschitz, we conclude the desired result by plugging . Since , is gradient Lipschitz. The last inequality follows from [35, Theorem 2.1.5].