We consider the following minimax optimization problem:
where is a smooth (possibly nonconvex in ) function and is a convex set. Since von Neumann’s pioneering work in 1928 , the problem of finding the solution to problem (1.1) has been a major endeavor in mathematics, economics and computer science [4, 37, 46]. In recent years minimax optimization theory has begun to see applications in machine learning, including adversarial learning [15, 27], statistical learning [6, 47, 1, 13]
, certification of robustness in deep learning and distributed computing [42, 28]. On the other hand, real-world machine-learning systems are often embedded in larger economic markets and subject to game-theoretic constraints .
The most widely used and seemingly the simplest algorithm to solve problem (1.1) is a natural generalization of gradient descent (GD). Known as gradient descent ascent (GDA), it alternates between gradient descent on the variable and gradient ascent on the variable . There is a vast literature that applies GDA, and stochastic variants of GDA (SGDA), to problems in the form of (1.1) [15, 27, 43]. However, the theoretical understanding of the algorithm is still fairly limited. In particular, most of the asymptotic and nonasymptotic convergence results [22, 7, 33, 34, 11] are established for the special case of convex-concave problem (1.1)— is convex in and concave in . Unlike the convex-concave case, for which the behavior of GDA has been investigated quite thoroughly, the issue of the convergence of GDA remains largely open in the general setting. More specifically, there is no shortage of work highlighting that GDA can converge to limit cycles or even diverge in a game-theoretic setting [5, 19, 9, 31]. Despite several recent progress on solving general minimax optimization problems via a range of techniques [32, 8, 18, 2, 24, 30, 29], it remains unclear why GDA and SGDA often work well in various applications in which the objective is not convex-concave.
The following general structure arises in many applications: is concave for any and is a bounded set. For example, consider the problem of certifying robustness in deep learning . Training a model is basically a nonconvex minimization problem,
, where the loss function
refers to a neural network over data samples. Since the neural networks are vulnerable to adversarial examples , it is necessary to develop efficient procedures with rigorous guarantees for small to moderate amounts of robustness. An example of such a scheme, involving the solution of a nonconvex-strongly-concave minimax problem, is presented in . A second example is robust learning from multiple distributions . Given multiple empirical distributions from an underlying true distribution, the goal is to introduce robustness by minimizing the maximum of expected loss over these distributions. This problem can also be posed as a nonconvex-concave minimax problem.
Despite the popularity of GDA and SGDA in practice, few results has been established on their efficiency beyond the convex-concave setting. Thus, a natural question arises:
Are GDA and SGDA provably efficient for solving nonconvex-concave minimax problems?
This paper presents an affirmative answer to this question. In particular, we first characterize stationary conditions for nonconvex-strongly-concave and nonconvex-concave minimax problems, respectively. For nonconvex-strongly-concave problems, GDA and SGDA return an -stationary point within gradient evaluations and stochastic gradient evaluations where is a condition number. For nonconvex-concave problems, GDA and SGDA return an -stationary point within gradient evaluations and stochastic gradient evaluations.
Technically, the concavity of makes it computationally feasible to find the corresponding global maximum, , for any . A straightforward way to solve nonconvex-concave minimax problems is a class of nested-loop variant of GDA, which finds for every iterate . We denote this as gradient descent with max-oracle (GDmax), and realize the max-oracle by gradient ascent on . We use GDmax and stochastic GDmax (SGDmax) as the baseline approaches in this paper; see Table 1 for the details.
The complexity analysis for GDmax and SGDmax can be decomposed into two parts: the number of gradient evaluations required by a max-oracle, and the number of iterations required to find a stationary point of . In contrast, the complexity analysis for GDA and SGDA is more challenging as the iterate is not necessarily guaranteed to be close to , such that it becomes less clear why is a reasonable direction to follow. In response to this, we develop some techniques to analyze the concave optimization with a slowly changing objective over the course of optimization which may be of independent interest.
|Algorithms||Oracle Complexity||Algorithms||Oracle Complexity|
Related work: Historically, an early concrete instantiation of problem problem (1.1, or equivalently solving for a matrix and probability simplices and . This so-called bilinear minimax problem together with von Neumann’s minimax theorem 
was a cornerstone in the development of game theory. A general algorithm schema was developed for solving this problem in which the min and max players each run a simple learning procedure in tandem; e.g., the fictitious play. Later, Sion  generalized von Neumann’s result from bilinear functions to general convex-concave functions, , and triggered a line of algorithmic research on convex-concave minimax optimization, in continuous time [23, 8] and discrete time [45, 14, 22, 34, 33]. Unfortunately, the techniques in these works rely heavily on the convex-concave structure and can not be extended to nonconvex-concave minimax problems.
During the past decade, the study of general minimax problems has become a central topic in machine learning, inspired in part by the advent of adversarial learning [15, 27, 43]. Most recent work has focused on reducing oscillations and speeding up convergence of the gradient dynamics; see, e.g., consensus optimization , two-timescale GDA , the symplectic gradient 
, the linear-transformed gradient, optimistic mirror descent [30, 24], inexact proximal point algorithm  and the two-timescale algorithm . Despite empirical successes in real applications, the existing convergence analysis of these methods is still limited—all of the existing global convergence results are asymptotic and require strong conditions on the problem structure.
Nonconvex-concave minimax problems appear to be a class of tractable problems in the form of problem (1.1) and have emerged as a focus in optimization and machine learning [43, 38, 41, 17, 26]. Grnarova et al.,  proposed a variant of GDA for nonconvex-concave problem but did not provide theoretical guarantees for it. Rafique et al.,  proposed a proximally guided stochastic mirror descent (PG-SMD) and proved that it finds an approximate stationary point of . However, PG-SMD is a nested-loop algorithm, and thus relatively complex to implement; one would like to know whether the nested-loop structure is necessary or whether GDA, a single-loop algorithm, can also be showed to converge in the nonconvex-strongly-concave setting. Such a convergence result has been established in the special case where is a linear function . Lu et al.,  analyzed a variant of GDA for nonconvex-concave problems and provided the theoretical guarantee under a slightly different setting and a different notion of optimality. A class of inexact nonconvex SGD algorithms [43, 41] can be categorized as variants of one of the algorithms that we analyze here (SGDmax in Algorithm 2). We provide a theoretical guarantee for such algorithms in the general nonconvex-concave case.
Notation. We use bold lower-case letters to denote vectors, as in . We use to denote the -norm of vectors and spectral norm of matrices. For a function , denotes the subdifferential of at . If is differentiable, then where denotes the gradient of at , and denotes the partial gradient of with respect to at . For a symmetric matrix
, we denote the largest and smallest eigenvalue ofas and . We use caligraphic upper-case letter to denote sets, as in .
Before presenting the objectives in nonconvex-concave minimax optimization, we first describe some standard definitions.
is -Lipschitz if for , we have .
is -gradient Lipschitz if for , we have .
Intuitively, a function being Lipschitz means that the function values at two nearby points must also be close; a function being gradient Lipschitz means that the gradients at two nearby points must also be close. Recall that the minimax problem (1.1) is equivalent to the following minimization problem:
In this paper, we study the special case where is either concave or strongly concave, thus the maximization problem can be solved efficiently for any . However, since is a nonconvex function, it is NP-hard to find the global minimum of it in general, even in the idealized setting in which the maximizer for any can be computed for free.
Objectives in this paper.
We begin by specifying a notion of a local surrogate for a global minimum.
We call an -stationary point () of a differentiable function if . If , then is called a stationary point.
Unfortunately, even if is Lipschitz and gradient-Lipschitz, need not be differentiable. A weaker condition that is sufficient for the purpose of our paper is the following notion of weak convexity.
Function is -weakly convex if function is convex.
In particular, when is twice differentiable, is -gradient Lipschitz if and only if all the eigenvalues of the Hessian are upper and lower bounded by and , while is -weak convex if and only if all the eigenvalues of the Hessian are lower bounded by .
For any -weakly convex function , its subdifferential can be uniquely determined by the subdifferential of . A naive measure of approximate stationarity can be defined as a point such that at least one subgradient is small: .
However, this criterion can be very restrictive when optimizing nonsmooth functions. For example, when is a one-dimensional function, an approximate stationary point must be for any . This means that finding an approximate stationary point under this notion is as difficult as solving the minimization exactly. An alternative criterion based on the Moreau envelope of has been recognized as standard when is weakly convex .
Function is the Moreau envelope of with parameter if for any .
If is -gradient Lipschitz and is bounded, then the Moreau envelope is differentiable, -gradient Lipschitz, and -strongly convex.
An -stationary point of a -weakly convex function thus can be alternatively defined a a point where the gradient of Moreau envelope is small.
We call an -stationary point () of a -weakly convex function , if . If , then is called a stationary point.
We can also express Definition 2.7 in terms of the original function .
If is an -stationary point of a -weakly convex function (Definition 2.7), then there exists such that and .
Lemma 2.8 shows that an -stationary point defined by the Moreau envelope can be interpreted as the relaxation for . More specifically, if is an -stationary point of a -weakly convex function , it is close to a point which has small subgradient.
3 Main Results
In this section, we establish the nonasymptotic convergence rate of GD with a max-oracle (GDmax), SGD with a max-oracle (SGDmax), GDA and SGDA for nonconvex-strongly-concave minimax problems and nonconvex-concave minimax problems.
We present pseudocode for GDmax and SGDmax in Algorithms 1 and 2. Fix , the max-oracle approximately solves at each iteration. Although GDmax and SGDmax are easier to understand, they have two disadvantages over GDA and SGDA: (1) Both GDmax and SGDmax are nested-loop algorithms. Since it is difficult to pre-determine the number iterations for the inner loop, these algorithms are complex to implement in practice; (2) In the general setting where is nonconcave, GDmax and SGDmax are inapplicable as we can not efficiently find a global optimum. In contrast, GDA and SGDA are single-loop algorithms; see Algorithms 3 and 4.
For the stochastic gradient algorithm, we assume that the stochastic gradient oracle satisfies the following condition.
is unbiased and has bounded variance
bounded variance. For , we have and .
3.1 Nonconvex-Strongly-Concave Minimax Problem
In this subsection, we present the results for the nonconvex-strongly concave minimax problem. We make the following assumption.
The objective function and constraint set pair, satisfy
is -gradient Lipschitz and is -strongly concave for .
is a convex and bounded set with a diameter .
While the gradient-Lipschitz assumption is standard in the optimization literature, strongly concavity is crucial here, along with the boundedness of , allowing for an efficient solution of . We let denote the problem condition number throughout this section. The following structural lemma provides further information about in the strongly-concave setting.
The target is to find an -stationary point (cf. Definition 2.3) given only gradient (or stochastic gradient) access to . Denoting , we have the following complexity bound for GDmax.
Theorem 3.4 (Complexity Bound for GDmax)
Under Assumption 3.2, letting the step size and the tolerance for the max-oracle be and , the number of iterations required by Algorithm 1 to return an -stationary point is bounded by . Furthermore, the -accurate max-oracle can be realized by gradient ascent (GA) with the stepsize for iterations, which gives the total gradient complexity of the algorithm:
Theorem 3.4 shows that if we alternate between one step of gradient descent over and steps of gradient ascent over , with a pair of proper learning rates , we can find at least one stationary point of within gradient evaluations.
We present similar guarantees when only stochastic gradients are available in the following theorem.
Theorem 3.5 (Complexity Bound for SGDmax)
Under Assumptions 3.1 and 3.2, letting the step size and the tolerance for the max-oracle be the same in Theorem 3.4 with the batch size , the number of iterations required by Algorithm 2 to return an -stationary point is bounded by . Furthermore, the -accurate max-oracle can be realized by mini-batch stochastic gradient ascent (SGA) with the step size and the mini-batch size for gradient evaluations, which gives the total gradient complexity of the algorithm:
The sample size guarantees that the variance is less than so that the average stochastic gradients over the batch are sufficiently close to the true gradients and .
We proceed to provide a theoretical guarantee for the single-looped GDA and SGDA algorithms. Since an iterate generated by GDA or SGDA can be far from the maximizer, , it is nontrivial to identify a Lyapunov function that decreases monotonically. We thus need to devise a new proof technique (see Section 4 for details). Based on this technique, we are able to derive the following results for the complexity of the GDA and SGDA algorithms.
Theorem 3.6 (Complexity Bound for GDA)
Theorem 3.7 (Complexity Bound for SGDA)
Theorem 3.6 and 3.7 show that GDA and SGDA can find an -stationary point with proper step sizes and the convergence rate matches that of GDmax and SGDmax up to a logarithmic factor. Moreover, GDA and SGDA are simpler and more practical than GDmax and SGDmax which in practice requires accurately determining the value of for the max-oracle.
The gradient complexity of GDA can be improved to with a proper initialization for . More specifically, the extra term in Theorem 3.6 can be removed by performing gradient ascent to find so that . It is unclear if the gradient complexity of GDmax could be improved by similar warm-start strategies. In fact, in GDmax is large so even though is close to , is possibly large. This means that gradient ascent in the max-oracle starts with a bad initial point. In contrast, is small in GDA and so is , leading to an automatically good initialization.
The ratio of learning rates for GDA is equivalent to that for GDmax times the number of gradient ascent steps in the inner loop up to the logarithmic factor. More specifically, the ratio is for GDA while the ratio is and the number of gradient ascent steps in the max-oracle is for GDmax. While  suggests that is necessary as in a general setting, our result is nonasymptotic with independent of , obtained by carefully exploiting the structure of the nonconvex-strongly-concave minimax problem.
3.2 Nonconvex-Concave Minimax Problems
In this subsection, we present the results for the nonconvex-concave minimax problem. The main assumption is the following.
The objective function and constraint set pair, satisfy:
is -gradient Lipschitz, is -Lipschitz for and is concave for .
is a convex and bounded set with a diameter .
Since is only required to be concave for any , is possibly not differentiable. Fortunately, Lipschitz and gradient Lipschitz assumptions guarantees that is -weakly convex and -Lipschitz.
Under Assumption 3.8, is -weakly convex and -Lipschitz with where .
The target is to find an -stationary point of a weakly convex function (Definition 2.7) given only gradient (or stochastic gradient) access to . Denote , we present the gradient complexity for GDmax and SGDmax in the following two theorems.
Theorem 3.10 (Complexity Bound for GDmax)
Under Assumption 3.8, letting the step size and the tolerance for the max-oracle be and , the number of iterations required by Algorithm 1 to return an -stationary point is bounded by . Furthermore, the -accurate max-oracle is realized by GA with the step size for iterations, which gives the total gradient complexity of the algorithm:
Theorem 3.11 (Complexity Bound for SGDmax)
Under Assumptions 3.1 and 3.8, letting the tolerance for the max-oracle be chosen as the same as in Theorem 3.10 with a step size and a batch size given by and , the number of iterations required by Algorithm 2 to return an -stationary point is bounded by . Furthermore, the -accurate max-oracle is realized by SGA with the step size and a batch size for iterations, which gives the following total gradient complexity of the algorithm:
When , the stochastic gradients are sufficiently close to the true gradients and and the gradient complexity of SGDmax matches that of GDmax.
We now provide a theoretical guarantee for the GDA and SGDA algorithms. While the complexity analysis for GDmax and SGDmax is nearly the same as that in subsection 3.1, the proof techniques for GDA and SGDA are quite different; see Section 4 for the details. In what follows, we present the gradient complexity of GDA and SGDA algorithms.
Theorem 3.12 (Complexity Bound for GDA)
Theorem 3.13 (Complexity Bound for SGDA)
Theorem 3.12 shows that the ratio of learning rates for GDA equals that for GDmax times the number of gradient ascent in the max-oracle. depends on and tends to as . In contrast to , we obtain an nonasymptotic result that by exploiting the problem structure with new technique. We note our result does not contradict [12, Proposition 1], which shows that GDA diverges on a simple bilinear minimax problem. In fact, while in their example, is assumed to be compact in our setting, which together with a large ratio , prevents the divergence issue.
4 Overview of Proofs
In this section, we present the key ideas behind our theoretical results of GDA and SGDA. The main technical contribution is to develop new techniques for analyzing convex (or concave) optimization with slowly changing objective over the iterations. In particular, we focus on the complexity analysis for GDA in the nonconvex-strongly-concave and nonconvex-concave minimax settings (Theorems 3.6 and 3.12), and omit the proof overview for SGDA.
4.1 Nonconvex-Strongly-Concave Minimax Problems
In the nonconvex-strongly-concave setting, Lemma 3.3 implies that is gradient Lipschitz, and where . This implies that, if we can find for each iterate , then we can just use the standard technique in nonconvex smooth optimization and provide an efficient guarantee for finding an -stationary point (cf. Definition 2.3).
Unfortunately, this is not the case for GDA where in general. To overcome this difficulty, the high-level idea in our proof is to control a pair of learning rates that force to move more slowly than . More specifically, Lemma 3.3 guarantees that is -Lipschitz:
That is, if changes slowly, then also changes slowly. This allows us to perform gradient ascent on a slowly changing strongly-concave function , guaranteeing that is small in an amortized sense.
More precisely, letting the error be , Lemma B.3 implies that comes into the standard analysis of nonconvex smooth optimization via the final terms in the following equation:
The remaining step is to show that the additional error term (the second term on the right-hand side) is always small compared to the first term on the right-hand side. This is done via a recursion for (cf. Lemma B.2):
where and is small. Therefore, has a linear contraction and can be well controlled.
4.2 Nonconvex-Concave Minimax Problems
In the nonconvex-concave case, the main idea is again to control a pair of learning rates to force to move more slowly than . Different from the the setting in the last subsection, is only guaranteed to be concave and is possibly not Lipschitz or even uniquely defined. This means that, even if are extremely close, can be dramatically different from . Therefore, is no longer a viable error to control.
Fortunately, Lemma 3.9 implies that is Lipschitz. This implies that, when the learning rate is very small, the maximum function values changes slowly:
Again, this allows us to perform gradient ascent on concave functions that change slowly in terms of maximum function value, and guarantees is small in an amortized sense. Indeed, Lemma C.1 implies that
where the last term on the right-hand side is the error term additional to the standard analysis in nonconvex nonsmooth optimization. The goal of the analysis is again to show the error term is small compared to the sum of the first two terms on the right-hand side.
To bound term , the standard analysis in convex optimization (where the optimal point does not change) uses the following inequalities and a telescoping argument:
The major challenge here is that the optimal points can change dramatically, and the telescoping argument does not go through. An important observation is, however, that (4.1) can also be proved if we replace the on the right-hand side by , while paying an additional cost that depends on the difference in function value between and . More specifically, we pick a block of size and show in Lemma C.2 for any , the following statement holds,
We perform an analysis on the blocks where the concave problem are similar so the telescoping argument can go through now. By carefully choosing , the term can also be well controlled.
We have presented a theoretical complexity analysis for GDA and SGDA in the setting of nonconvex-strongly-concave and nonconvex-concave minimax problems. We characterize the stationarity conditions in both settings and prove that GDA and SGDA return an -stationary point within gradient and stochastic gradient evaluations for the nonconvex-strongly-concave minimax problems, and gradient and stochastic gradient evaluations for the nonconvex-concave minimax problems. Moreover, we analyze GDmax and SGDmax based on a max-oracle at each iteration, providing a complete complexity analysis. Future directions include the investigation of lower bounds for solving minimax problems and obtaining theoretical guarantees for GDA in a still wider range of problems.
S. S. Abadeh, P. M. M. Esfahani, and D. Kuhn.
Distributionally robust logistic regression.In NeurIPS, pages 1576–1584, 2015.
-  L. Adolphs, H. Daneshmand, A. Lucchi, and T. Hofmann. Local saddle point optimization: A curvature exploitation approach. ArXiv Preprint: 1805.05751, 2018.
-  D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel. The mechanics of n-player differentiable games. ArXiv Preprint: 1802.05642, 2018.
-  T. Basar and G. J. Olsder. Dynamic Noncooperative Game Theory, volume 23. SIAM, 1999.
-  M. Benaım and M. W. Hirsch. Mixed equilibria and dynamical systems arising from fictitious play in perturbed games. Games and Economic Behavior, 29(1-2):36–72, 1999.
-  N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
-  G. H. G. Chen and R. T. Rockafellar. Convergence rates in forward–backward splitting. SIAM Journal on Optimization, 7(2):421–444, 1997.
-  A. Cherukuri, B. Gharesifard, and J. Cortes. Saddle-point dynamics: conditions for asymptotic stability of saddle points. SIAM Journal on Control and Optimization, 55(1):486–511, 2017.
-  C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. ArXiv Preprint: 1711.00141, 2017.
-  D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. ArXiv Preprint: 1803.06523, 2018.
-  S. S. Du and W. Hu. Linear convergence of the primal-dual gradient method for convex-concave saddle point problems without strong convexity. ArXiv Preprint: 1802.01504, 2018.
-  G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien. A variational inequality perspective on generative adversarial networks. In ICLR, 2019.
-  R. Giordano, T. Broderick, and M. I. Jordan. Covariances, robustness, and variational bayes. ArXiv Preprint: 1709.02536, 2017.
-  E. G. Golshtein. Generalized gradient method for finding saddle points. Matekon, 10(3):36–52, 1974.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014.
-  I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
-  P. Grnarova, K. Y. Levy, A. Lucchi, T. Hofmann, and A. Krause. An online learning approach to generative adversarial networks. In ICLR, 2018.
-  M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NeurIPS, pages 6626–6637, 2017.
C. H. Hommes and M. I. Ochea.
Multiple equilibria and limit cycles in evolutionary games with logit dynamics.Games and Economic Behavior, 74(1):434–441, 2012.
-  C. Jin, P. Netrapalli, and M. I. Jordan. Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. ArXiv Preprint: 1902.00618, 2019.
-  M. I. Jordan. Artificial intelligence–the revolution hasn’t happened yet. Medium, 2018.
-  G. M. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976.
-  T. Kose. Solutions of saddle value problems by differential equations. Econometrica, Journal of the Econometric Society, pages 59–70, 1956.
-  T. Liang and J. Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. ArXiv Preprint: 1802.06132, 2018.
-  Q. Lin, M. Liu, H. Rafique, and T. Yang. Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality. ArXiv Preprint: 1810.10207, 2018.
-  S. Lu, I. Tsaknakis, M. Hong, and Y. Chen. Hybrid block successive approximation for one-sided non-convex min-max problems: Algorithms and applications. ArXiv Preprint: 1902.08294, 2019.
-  A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. ArXiv Preprint: 1706.06083, 2017.
G. Mateos, J. A. Bazerque, and G. B. Giannakis.
Distributed sparse linear regression.IEEE Transactions on Signal Processing, 58(10):5262–5276, 2010.
-  E. V. Mazumdar, M. I. Jordan, and S. S. Sastry. On finding local Nash equilibria (and only local Nash equilibria) in zero-sum games. ArXiv Preprint: 1901.00838, 2019.
-  P. Mertikopoulos, B. Lecouat, H. Zenati, C-S Foo, V. Chandrasekhar, and G. Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra(-gradient) mile. In ICLR, 2019.
-  P. Mertikopoulos, C. Papadimitriou, and G. Piliouras. Cycles in adversarial regularized learning. In SODA, pages 2703–2717. SIAM, 2018.
-  L. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. In NeurIPS, pages 1825–1835, 2017.
-  A. Nedić and A. Ozdaglar. Subgradient methods for saddle-point problems. Journal of Optimization Theory and Applications, 142(1):205–228, 2009.
-  A. Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
-  Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2013.
-  J. V. Neumann. Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100(1):295–320, 1928.
-  N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani. Algorithmic Game Theory. Cambridge University Press, 2007.
-  H. Rafique, M. Liu, Q. Lin, and T. Yang. Non-convex min-max optimization: Provable algorithms and applications in machine learning. ArXiv Preprint: 1810.02060, 2018.
-  J. Robinson. An iterative method of solving a game. Annals of Mathematics, pages 296–301, 1951.
-  R. T. Rockafellar. Convex Analysis. Princeton University Press, 2015.
-  M. Sanjabi, J. Ba, M. Razaviyayn, and J. D. Lee. Solving approximate Wasserstein GANs to stationarity. ArXiv Preprint: 1802.08249, 2018.
-  J. Shamma. Cooperative Control of Distributed Multi-agent Systems. John Wiley & Sons, 2008.
-  A. Sinha, H. Namkoong, and J. Duchi. Certifiable distributional robustness with principled adversarial training. In ICLR, 2018.
-  M. Sion. On general minimax theorems. Pacific Journal of Mathematics, 8(1):171–176, 1958.
-  H. Uzawa. Iterative methods for concave programming. Studies in Linear and Nonlinear Programming, 6:154–165, 1958.
-  J. Von Neumann and O. Morgenstern. Theory of Games and Economic Behavior (Commemorative Edition). Princeton University Press, 2007.
H. Xu, C. Caramanis, and S. Mannor.
Robustness and regularization of support vector machines.Journal of Machine Learning Research, 10(Jul):1485–1510, 2009.
Appendix A Proof of Technical Lemmas
a.1 Proof of Lemma 2.6
We provide a proof for an expanded version of Lemma 2.6.
If is -gradient Lipschitz and is bounded, we have
and are well-defined for .
is -gradient Lipschitz with .
Proof. By the definition of , we have
Since is -gradient Lipschitz, is convex in for . Since is bounded, Danskin’s theorem  implies that is convex. Putting these pieces together yields that is -strongly convex. This implies that and are well-defined. Furthermore, by the definition of , we have
Moreover, [10, Lemma 2.2] implies that is -gradient Lipschitz with
Finally, it follows from [35, Theorem 2.1.5] that satisfies the last inequality.
a.2 Proof of Lemma 2.8
Denote , we have (cf. Lemma 2.6) and hence
Furthermore, the optimality condition for implies that . Putting these pieces together yields that .
a.3 Proof of Lemma 3.3
Since is strongly concave in for , is unique and well-defined. Then we claim that is -Lipschitz. Indeed, let , the optimality of and implies that
Recall that is -strongly concave, we have
Finally, since is unique and is convex and bounded, we conclude from Danskin’s theorem  that is differentiable with . Since , we have
Since is -Lipschitz, we conclude the desired result by plugging . Since , is -gradient Lipschitz. The last inequality follows from [35, Theorem 2.1.5].