1 Introduction
In this paper we study smooth minimax problems of the form:
(1) 
The problem has applications in several domains such as machine learning
[goodfellow2014generative, madry2017towards], optimization [bertsekas2014constrained], statistics [berger2013statistical], mathematics [kinderlehrer1980introduction], and game theory
[myerson2013game]. Given the importance of these problems, there is an extensive body of work that studies various algorithms and their convergence properties. The vast majority of existing results for this problem focus on the convexconcave setting, where is convex for every and is concave for every . The best known convergence rate in this setting is for the primaldual gap, achieved for example by MirrorProx [nemirovski2004prox]. This rate is also known to be optimal for the class of smooth convexconcave problems [ouyang2018lower]. A natural question is whether we can achieve a faster convergence if we have strong convexity (as opposed to just convexity) of . We answer this in the affirmative, by introducing an algorithm that achieves a convergence rate of for the general smooth, stronglyconvex–concave minimax problem. The algorithm we propose is a novel combination of MirrorProx and Nesterov’s accelerated gradient descent. This matches the known lower bound of from [ouyang2018lower], closing the gap up to a polylogarithmic factor. The only known upper bounds that obtain a rate of in this context are for very special cases, where and are connected through a bilinear term or is linear in [nesterov2005excessive, juditsky2011first, GOS14, chambolle2016ergodic, he2016accelerated, xu2017iteration, hamedani2018primal, xie2019accelerated].While most theoretical results focus on the convexconcave setting, several real world problems fall outside this class. A slightly larger class, which captures several more applications, is the class of smooth nonconvex–concave minimax problems, where is concave for every but can be nonconvex. For example, finite minimax problems, i.e., belong to this class, and so do nonconvex constrained optimization problems [KomiyamaTHS18]
. In addition, several machine learning problems with nondecomposable loss functions
[kar2015surrogate] also belong to this class.In this general nonconvex concave setting however, we cannot hope to find global optimum efficiently as even the special case of nonconvex optimization is NPhard. Similar to nonconvex optimization, we might hope to find an approximate stationary point [nesterov1998introductory].
Our second contribution is a new algorithm and a faster rate for the general smooth nonconvex–concave minimax problem. Our algorithm is an inexact proximal point method for the nonconvex function . The key insight is that the proximal point problem in each iteration results in a stronglyconvex concave minimax problem, for which we use our improved algorithm to obtain the overall computation/iteration complexity of thus improving over the previous best known rate of [jin2019minmax]^{1}^{1}1While [jin2019minmax] gives a rate of with an approximate maximization oracle for , taking into account the cost of implementing such a maximization oracle gives a rate of ..
Finally, we specialize our result to finite minimax problems, i.e., where can be nonconvex function but each is a smooth function; nonconvex constrained optimization problems can be reduced to such finite minimax problems. For these, we obtain a rate of total gradient computations which improves upon the stateoftheart rate () in this setting as well.
Summary of contributions: See also Table 1.
1. convergence rate for smooth, stronglyconvex – concave problems, improving upon the previous best known rate of and,
2. convergence rate for smooth, nonconvex – concave problems, improving upon the previous best known rate of .
Setting  Optimality notion 

Our results  Lower bound  

Convex  Primaldual gap 

  [ouyang2018lower]  
Strongly convex 
Primaldual gap 

[ouyang2018lower]  
Nonconvex 

[jin2019minmax]   
Related works: For stronglyconvexconcave minimax problems with special structures, several algorithms have been proposed. In an increasing order of generality, [GOS14, xu2017iteration, xu2018accelerated] study optimizing a strongly convex function with linear constraints, which can be posed as a special case of minimax optimization, [nesterov2005excessive] studies a minimax problem where and are connected only through a bilinear term, and [hamedani2018primal] and [juditsky2011first] study a case where is linear in . In all these cases, it is shown that convergence rate is achievable if is stronglyconvex . Recently, [zhao2019optimal] provides a unified approach, that achieves convergence rate for general convexconcave case and for a special case with stronglyconvex and linear . However, it has remained an open question if the fast rate of can be achieved for general stronglyconvexconcave minimax problems.
For nonconvexconcave minimax problems, [rafique2018non] considers both deterministic and stochastic settings, and proposes inexact proximal point methods for solving smooth nonconvex–concave problems. In the deterministic setting, their result guarantees an error of . We note that there have also been other notions of stationarity proposed in literature for nonconvexconcave minimax problems [lu2019hybrid, nouiehed2019solving]. These notions however are weaker than the one considered in this paper, in the sense that, our notion of stationarity implies these other notions (without loss in parameters). For one such weaker notion, [nouiehed2019solving] proposes an algorithm with a convergence rate of . Since the notion they consider is weaker, it does not imply the same convergence rate in our setting.
We would also like to highlight the work on variational inequalities that are a generalization of minimax optimization problems. In particular, monotone variational inequalities generalizes the convexconcave minimax problems and have applications in solving differential equations [kinderlehrer1980introduction]. There have also been a large number of works designing efficient algorithms for finding solutions to monotone variational inequalities [bruck1977weak, nemirovsky1981, nemirovski2004prox].
Notations: is the real line and for any natural number ,
is the real vector space of dimension
. is a norm on some metric space which would be evident from the context. For a convex set and , is the projection of on to . For a differentiable function , is its gradient with respect to at . We use the standard bigO notations. For functions such that , (a) means ; (b) means and ; and (c) means that for some polylogarithmic function .Paper organization: In Section 2, we present preliminaries and all relevant background. In Section 3, we present our results for stronglyconvex–concave setting and in section 4, results for nonconvex–concave setting. In Section 5, we present empirical evaluation of our algorithm for nonconvexconcave setting and compare it to a stateoftheart algorithm. We conclude in Section 6. Several technical details are presented in the appendix.
2 Preliminaries and background material
In this section, we will present some preliminaries, describing the setup and reviewing some background material that will be useful in the sequel.
2.1 Minimax problems
We are interested in the minimax problems of the form (1) where is a smooth function.
Definition 1.
A function is said to be smooth if:
Throughout, we assume that is concave for every . For behavior in terms of , there are broadly two settings:
2.1.1 Convexconcave setting
In this setting, is convex . Given any and , the following holds trivially:
which then implies that . The celebrated minimax theorem for the convexconcave setting [sion1958general] says that if is a compact set then the above inequality is in fact an equality, i.e., . Furthermore, any point is an optimal solution to (1) if and only if:
(2) 
Hence, our goal is to find primaldual pair with small primaldual gap: .
Definition 2.
For a convexconcave function , is an primaldualpair of if the primaldual gap is less than : .
2.1.2 Nonconvexconcave setting
In this setting the function need not be convex. One cannot hope to solve such problems in general, since the special case of nonconvex optimization is already NPhard [nouiehed2018convergence]. Furthermore, the minimax theorem no longer holds, i.e., can be strictly smaller than . Oftentimes the order of and might be important for a given application i.e., we might be interested only in minimax but not maximin (or vice versa). So, the primaldual gap may not be a meaningful quantity to measure convergence. One approach, inspired by nonconvex optimization, to measure convergence is to consider the function and consider the convergence rate to approximate first order stationary points (i.e., is small)[rafique2018non, jin2019minmax]. But as could be nonsmooth, might not even be defined. It turns out that whenever is smooth, is weakly convex (Definition 4) for which first order stationarity notions are wellstudied and are discussed below.
Approximate firstorder stationary point for weakly convex functions: We first need to generalize the notion of gradient for a nonsmooth function.
Definition 3.
The Fréchet subdifferential of a function at is defined as the set, .
In order to define approximate stationary points, we also need the notion of weakly convex function and Moreau envelope.
Definition 4.
A function is weakly convex if,
(3) 
for all Fréchet subgradients .
Definition 5.
For a proper lower semicontinuous (l.s.c.) function and (), the Moreau envelope function is given by
(4) 
The following lemma provides some useful properties of the Moreau envelope for weakly convex functions. The proof can be found in Appendix B.2.
Lemma 1.
For an weakly convex proper l.s.c. function () such that , the following hold true,

The minimizer is unique and . Furthermore, .

is smooth and thus differentiable, and

.
Now, first order stationary point of a nonsmooth nonconvex function is welldefined, i.e., is a first order stationary point (FOSP) of a function if, (see Definition 3). However, unlike smooth functions, it is nontrivial to define an approximate FOSP. For example, if we define an FOSP as the point with , there may never exist such a point for sufficiently small , unless is exactly a FOSP. In contrast, by using above properties of the Moreau envelope of a weakly convex function, it’s approximate FOSP can be defined as [davis2018stochastic]:
Definition 6.
Given an weakly convex function , we say that is an first order stationary point (FOSP) if, , where is the Moreau envelope with parameter .
Using Lemma 1, we can show that for any FOSP , there exists such that and . In other words, an FOSP is close to a point which has a subgradient smaller than . We note that other notions of FOSP have also been proposed recently such as in [nouiehed2019solving]. However, it can be shown that an FOSP according to the above definition is also an FOSP with [nouiehed2019solving]’s definition as well, but the reverse is not necessarily true.
2.2 MirrorProx
MirrorProx [nemirovski2004prox] is a popular algorithm proposed for solving convexconcave minimax problems (1). It achieves a convergence rate of for the primal dual gap. The original MirrorProx paper [nemirovski2004prox] motivates the algorithm through a conceptual MirrorProx (CMP) method, which brings out the main idea behind its convergence rate of . CMP does the following update:
(5) 
The main difference between CMP and standard gradient descent ascent (GDA) is that in the step, while GDA uses gradients at , CMP uses gradients at . The key observation of [nemirovski2004prox] is that if is smooth, it can be implemented efficiently. CMP is analyzed as follows:
Implementability of CMP:
Let . For , the iteration
(6) 
can be shown to be contraction (when is smooth) and that its fixed point is . So, in iterations of (6), we can obtain an accurate version of the update required by CMP. In fact, [nemirovski2004prox] showed that just two iterations of (6) suffice.
Convergence rate of CMP: Using CMP update with simple manipulations leads to the following:
convergence rate follows easily using the above result.
3 Stronglyconvex concave saddle point problem
We first study the minimax problem of the form:
(P1) 
where is concave, is stronglyconvex, is smooth, i.e., . and is a convex compact subset of and let the function take a minimum value (). Let be the diameter of .
Our objective here is to find an primaldual pair (see Definition 2). Now the fact that implies that if is an primaldualpair, then is also an approximate minima of . Furthermore, by Sion’s minimax theorem [komiya1988elementary], strongconvexity–concavity of ensures that: . Hence, one approach to efficiently solving the problem is by optimizing the dual problem . By Lemma 2, is an smooth function. So we can use AGD to ensure that . Now, each step of AGD requires computing which can be done efficiently (i.e., logarithmic number of steps) as is stronglyconvex and smooth. So, the overall firstorder oracle complexity is .
Lemma 2.
For a stronglyconvex–concave smooth function , is an smooth concave function.
So does this simple approach give us our desired result? Unfortunately that is not the case, as the above bound on the dual function does not translate to the same error rate for primal function , i.e., the solution need not be primaldual pair. E.g., consider , where , and . If , then and so is .
Instead of using AGD, we introduce a new method to solve the dual problem that we refer to as DIAG, which stands for Dual Implicit Accelerated Gradient. DIAG combines ideas from AGD [nesterov1983method] and Nemirovski’s original derivation of the MirrorProx algorithm [nemirovski2004prox], and can ensure a fast convergence rate of for the primaldual gap. For better exposition, we first present a conceptual version of DIAG (CDIAG), which is not implementable exactly, but brings out the main new ideas in our algorithm. We then present a detailed error analysis for the inexact version of this algorithm, which is implementable.
3.1 Conceptual version: CDIAG
The pseudocode for CDIAG algorithm is presented in Algorithm 1. The main idea of the algorithm is in Step 4, where we simultaneously find and satisfying the following requirements:

is the minimizer of , and
Implementability: The first question is whether it is easy enough to implement such a step? It turns out that it is indeed possible to quickly find points and that approximately satisfy the above requirements. The reason is that:

Since is smooth and strongly convex for every , we can find approximate minimizer for a given in iterations.
Convergence rate: Since and correspond to an AGD update for , we can use the potential function decrease argument for AGD (Lemma 4 in Appendix A) to conclude that ,
where the last step follows from the fact that and so . Noting that we can further recursively bound as above, we obtain
Since for every , we have
where . Since and are arbitrary above, this gives a convergence rate for the primal dual gap.
3.2 Error analysis
The main issue with Algorithm 1 is that the update step is not exactly implementable. However, as we noted in the previous section, we can quickly find updates that almost satisfy the requirements. Algorithm 2 presents this inexact version. The following theorem states our formal result and a detailed proof is provided in Appendix B.4.
Theorem 1 (Convergence rate of DIAG).
Let be a smooth, strongconvex–concave function on and a convex compact subset . Then, after iterations, DIAG (Algorithm 2) finds s.t.:
(8) 
In particular, setting we have: . Furthermore, for this setting the total first order oracle complexity is given by: .
Remark 1: Theorem 1 shows that DIAG needs gradient queries for finding a primaldualpair, while current bestknown rate is achieved by MirrorProx. This dependence in and is optimal, as it is shown in [ouyang2018lower, Theorem 10] that gradient queries are necessary to achieve error in the primaldual gap.
Remark 2: Unlike standard AGD for , which only updates in the outerloop, DIAG’s outerstep updates both and thus allowing us to better track the primaldual gap. However, DIAG’s dependence on the condition number seems suboptimal and can perhaps be improved if we do not compute ImpSTEP nearly optimally allowing for inexact updates; we leave further investigation into improved dependence on the condition number for future work.
4 Nonconvex concave saddle point problem
We study the nonconvex concave minimax problem (1) where is concave, is nonconvex, and is smooth, (such that ) and is a convex compact subset of . As mentioned in Section 2, we measure the convergence to an approximate FOSP of this problem (see Definition 6) but it requires weakconvexity of . The following lemma guarantees weak convexity of given smoothness of .
Lemma 3.
Let be continuous and be compact. Then is weakly convex, if is weakly convex in (Definition 1), or if is smooth in .
See Appendix B.3 for the proof. The arguments of [jin2019minmax] easily extend to show that applying subgradient method on , [davis2018stochastic] gives a convergence rate of . Instead, we exploit the smooth minimax form of to design a faster converging scheme. The main intuition comes from the proximal viewpoint that gradient descent can be viewed as iteratively forming and optimizing local quadratic upper bounds. As is weakly convex, adding enough quadratic regularization should ensure that the resulting sequence of problems are all stronglyconvex–concave. We then exploit DIAG to efficiently solve such local quadratic problems to obtain improved convergence rates. Concretely, let
(9) 
By weakconvexity of , is stronglyconvex–concave (Lemma 5) that can be solved using DIAG up to certain accuracy to obtain . We refer to this algorithm as ProxDIAG and provide a pseudocode for the same in Algorithm 3.
(10) 
The following theorem gives convergence guarantees for ProxDIAG.
Theorem 2 (Convergence rate of ProxDIAG).
Let be smooth, be concave, be , be a convex compact subset of , and the minimum value of function be bounded below, i.e. . Then ProxDIAG (Algorithm 3) after,
steps outputs an FOSP. The total firstorder oracle complexity to output FOSP is:
Note that ProxDIAG solves the quadratic approximation problem to higher accuracy of which then helps bounding the gradient of the Moreau envelope. Also due to the modular structure of the argument, a faster inner loop for special settings, e.g., when is a finitesum, can ensure more efficient algorithm. While our algorithm is able to significantly improve upon existing stateoftheart rate of in general nonconvexconcave setting [jin2019minmax], it is unclear if the rate can be further improved. In fact, precise lowerbounds for this setting are mostly unexplored and we leave further investigation into lowerbounds as a topic of future research.
Proof.
We first note that by Lemma 5 and weak convexity of and strong convexity of , is stronglyconvex. Similarly, is also stronglyconvex.
We now divide the analysis of each iteration of our algorithm into two cases:
Case 1: . As every instance of Case 1 ensures , we can have only Case 1 steps before termination. This claim requires monotonic decrease in which holds until , after which , which inturn imply that ProxDIAG terminates (see termination condition of ProxDIAG).
Case 2: : In this case, we show that is already an FOSP and the algorithm returns .
(11) 
Define as the point satisfying . By strong convexity of