Efficient Algorithms for Smooth Minimax Optimization

07/02/2019 ∙ by Kiran Koshy Thekumparampil, et al. ∙ Microsoft University of Illinois at Urbana-Champaign University of Washington 10

This paper studies first order methods for solving smooth minimax optimization problems _x _y g(x,y) where g(·,·) is smooth and g(x,·) is concave for each x. In terms of g(·,y), we consider two settings -- strongly convex and nonconvex -- and improve upon the best known rates in both. For strongly-convex g(·, y), ∀ y, we propose a new algorithm combining Mirror-Prox and Nesterov's AGD, and show that it can find global optimum in Õ(1/k^2) iterations, improving over current state-of-the-art rate of O(1/k). We use this result along with an inexact proximal point method to provide Õ(1/k^1/3) rate for finding stationary points in the nonconvex setting where g(·, y) can be nonconvex. This improves over current best-known rate of O(1/k^1/5). Finally, we instantiate our result for finite nonconvex minimax problems, i.e., _x _1≤ i≤ m f_i(x), with nonconvex f_i(·), to obtain convergence rate of O(m( m)^3/2/k^1/3) total gradient evaluations for finding a stationary point.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we study smooth minimax problems of the form:

(1)

The problem has applications in several domains such as machine learning 

[goodfellow2014generative, madry2017towards], optimization [bertsekas2014constrained], statistics [berger2013statistical], mathematics [kinderlehrer1980introduction]

, and game theory 

[myerson2013game]. Given the importance of these problems, there is an extensive body of work that studies various algorithms and their convergence properties. The vast majority of existing results for this problem focus on the convex-concave setting, where is convex for every and is concave for every . The best known convergence rate in this setting is for the primal-dual gap, achieved for example by Mirror-Prox [nemirovski2004prox]. This rate is also known to be optimal for the class of smooth convex-concave problems [ouyang2018lower]. A natural question is whether we can achieve a faster convergence if we have strong convexity (as opposed to just convexity) of . We answer this in the affirmative, by introducing an algorithm that achieves a convergence rate of for the general smooth, strongly-convex–concave minimax problem. The algorithm we propose is a novel combination of Mirror-Prox and Nesterov’s accelerated gradient descent. This matches the known lower bound of from [ouyang2018lower], closing the gap up to a poly-logarithmic factor. The only known upper bounds that obtain a rate of in this context are for very special cases, where and are connected through a bi-linear term or is linear in [nesterov2005excessive, juditsky2011first, GOS14, chambolle2016ergodic, he2016accelerated, xu2017iteration, hamedani2018primal, xie2019accelerated].

While most theoretical results focus on the convex-concave setting, several real world problems fall outside this class. A slightly larger class, which captures several more applications, is the class of smooth nonconvex–concave minimax problems, where is concave for every but can be nonconvex. For example, finite minimax problems, i.e., belong to this class, and so do nonconvex constrained optimization problems [KomiyamaTHS18]

. In addition, several machine learning problems with non-decomposable loss functions

[kar2015surrogate] also belong to this class.

In this general nonconvex concave setting however, we cannot hope to find global optimum efficiently as even the special case of nonconvex optimization is NP-hard. Similar to nonconvex optimization, we might hope to find an approximate stationary point [nesterov1998introductory].

Our second contribution is a new algorithm and a faster rate for the general smooth nonconvex–concave minimax problem. Our algorithm is an inexact proximal point method for the nonconvex function . The key insight is that the proximal point problem in each iteration results in a strongly-convex concave minimax problem, for which we use our improved algorithm to obtain the overall computation/iteration complexity of thus improving over the previous best known rate of  [jin2019minmax]111While [jin2019minmax] gives a rate of with an approximate maximization oracle for , taking into account the cost of implementing such a maximization oracle gives a rate of ..

Finally, we specialize our result to finite minimax problems, i.e., where can be nonconvex function but each is a smooth function; nonconvex constrained optimization problems can be reduced to such finite minimax problems. For these, we obtain a rate of total gradient computations which improves upon the state-of-the-art rate () in this setting as well.


Summary of contributions: See also Table 1.
1. convergence rate for smooth, strongly-convex – concave problems, improving upon the previous best known rate of and,
2. convergence rate for smooth, nonconvex – concave problems, improving upon the previous best known rate of .

Setting Optimality notion
Previous
state-of-the-art
Our results Lower bound
Convex Primal-dual gap
 [nemirovski2004prox]
- [ouyang2018lower]

Strongly convex
Primal-dual gap
 [nemirovski2004prox]
[ouyang2018lower]

Nonconvex
Approx. stat. point
 [jin2019minmax] -
Table 1: Comparison of our results with previous state-of-the-art. We assume that is smooth (i.e., has Lipschitz gradients) and is concave . Convexity, strong convexity and nonconvexity in the first column refers to for fixed .

Related works: For strongly-convex-concave minimax problems with special structures, several algorithms have been proposed. In an increasing order of generality, [GOS14, xu2017iteration, xu2018accelerated] study optimizing a strongly convex function with linear constraints, which can be posed as a special case of minimax optimization, [nesterov2005excessive] studies a minimax problem where and are connected only through a bi-linear term, and [hamedani2018primal] and [juditsky2011first] study a case where is linear in . In all these cases, it is shown that convergence rate is achievable if is strongly-convex . Recently, [zhao2019optimal] provides a unified approach, that achieves convergence rate for general convex-concave case and for a special case with strongly-convex and linear . However, it has remained an open question if the fast rate of can be achieved for general strongly-convex-concave minimax problems.

For nonconvex-concave minimax problems, [rafique2018non] considers both deterministic and stochastic settings, and proposes inexact proximal point methods for solving smooth nonconvex–concave problems. In the deterministic setting, their result guarantees an error of . We note that there have also been other notions of stationarity proposed in literature for nonconvex-concave minimax problems [lu2019hybrid, nouiehed2019solving]. These notions however are weaker than the one considered in this paper, in the sense that, our notion of stationarity implies these other notions (without loss in parameters). For one such weaker notion, [nouiehed2019solving] proposes an algorithm with a convergence rate of . Since the notion they consider is weaker, it does not imply the same convergence rate in our setting.

We would also like to highlight the work on variational inequalities that are a generalization of minimax optimization problems. In particular, monotone variational inequalities generalizes the convex-concave minimax problems and have applications in solving differential equations [kinderlehrer1980introduction]. There have also been a large number of works designing efficient algorithms for finding solutions to monotone variational inequalities [bruck1977weak, nemirovsky1981, nemirovski2004prox].


Notations: is the real line and for any natural number ,

is the real vector space of dimension

. is a norm on some metric space which would be evident from the context. For a convex set and , is the projection of on to . For a differentiable function , is its gradient with respect to at . We use the standard big-O notations. For functions such that , (a) means ; (b) means and ; and (c) means that for some poly-logarithmic function .


Paper organization: In Section 2, we present preliminaries and all relevant background. In Section 3, we present our results for strongly-convex–concave setting and in section 4, results for nonconvex–concave setting. In Section 5, we present empirical evaluation of our algorithm for nonconvex-concave setting and compare it to a state-of-the-art algorithm. We conclude in Section 6. Several technical details are presented in the appendix.

2 Preliminaries and background material

In this section, we will present some preliminaries, describing the setup and reviewing some background material that will be useful in the sequel.

2.1 Minimax problems

We are interested in the minimax problems of the form (1) where is a smooth function.

Definition 1.

A function is said to be -smooth if:

Throughout, we assume that is concave for every . For behavior in terms of , there are broadly two settings:

2.1.1 Convex-concave setting

In this setting, is convex . Given any and , the following holds trivially:

which then implies that . The celebrated minimax theorem for the convex-concave setting [sion1958general] says that if is a compact set then the above inequality is in fact an equality, i.e., . Furthermore, any point is an optimal solution to (1) if and only if:

(2)

Hence, our goal is to find -primal-dual pair with small primal-dual gap: .

Definition 2.

For a convex-concave function , is an -primal-dual-pair of if the primal-dual gap is less than : .

2.1.2 Nonconvex-concave setting

In this setting the function need not be convex. One cannot hope to solve such problems in general, since the special case of nonconvex optimization is already NP-hard [nouiehed2018convergence]. Furthermore, the minimax theorem no longer holds, i.e., can be strictly smaller than . Oftentimes the order of and might be important for a given application i.e., we might be interested only in minimax but not maximin (or vice versa). So, the primal-dual gap may not be a meaningful quantity to measure convergence. One approach, inspired by nonconvex optimization, to measure convergence is to consider the function and consider the convergence rate to approximate first order stationary points (i.e., is small)[rafique2018non, jin2019minmax]. But as could be non-smooth, might not even be defined. It turns out that whenever is smooth, is weakly convex (Definition 4) for which first order stationarity notions are well-studied and are discussed below.


Approximate first-order stationary point for weakly convex functions: We first need to generalize the notion of gradient for a non-smooth function.

Definition 3.

The Fréchet sub-differential of a function at is defined as the set, .

In order to define approximate stationary points, we also need the notion of weakly convex function and Moreau envelope.

Definition 4.

A function is -weakly convex if,

(3)

for all Fréchet subgradients .

Definition 5.

For a proper lower semi-continuous (l.s.c.) function and (), the Moreau envelope function is given by

(4)

The following lemma provides some useful properties of the Moreau envelope for weakly convex functions. The proof can be found in Appendix B.2.

Lemma 1.

For an -weakly convex proper l.s.c. function () such that , the following hold true,

  1. The minimizer is unique and . Furthermore, .

  2. is -smooth and thus differentiable, and

  3. .

Now, first order stationary point of a non-smooth nonconvex function is well-defined, i.e., is a first order stationary point (FOSP) of a function if, (see Definition 3). However, unlike smooth functions, it is nontrivial to define an approximate FOSP. For example, if we define an -FOSP as the point with , there may never exist such a point for sufficiently small , unless is exactly a FOSP. In contrast, by using above properties of the Moreau envelope of a weakly convex function, it’s approximate FOSP can be defined as [davis2018stochastic]:

Definition 6.

Given an -weakly convex function , we say that is an -first order stationary point (-FOSP) if, , where is the Moreau envelope with parameter .

Using Lemma 1, we can show that for any -FOSP , there exists such that and . In other words, an -FOSP is close to a point which has a subgradient smaller than . We note that other notions of FOSP have also been proposed recently such as in [nouiehed2019solving]. However, it can be shown that an -FOSP according to the above definition is also an -FOSP with [nouiehed2019solving]’s definition as well, but the reverse is not necessarily true.

2.2 Mirror-Prox

Mirror-Prox [nemirovski2004prox] is a popular algorithm proposed for solving convex-concave minimax problems (1). It achieves a convergence rate of for the primal dual gap. The original Mirror-Prox paper [nemirovski2004prox] motivates the algorithm through a conceptual Mirror-Prox (CMP) method, which brings out the main idea behind its convergence rate of . CMP does the following update:

(5)

The main difference between CMP and standard gradient descent ascent (GDA) is that in the step, while GDA uses gradients at , CMP uses gradients at . The key observation of [nemirovski2004prox] is that if is smooth, it can be implemented efficiently. CMP is analyzed as follows:
Implementability of CMP: Let . For , the iteration

(6)

can be shown to be -contraction (when is smooth) and that its fixed point is . So, in iterations of (6), we can obtain an accurate version of the update required by CMP. In fact, [nemirovski2004prox] showed that just two iterations of (6) suffice.
Convergence rate of CMP: Using CMP update with simple manipulations leads to the following:

convergence rate follows easily using the above result.

Finally, our method and analysis also requires Nesterov’s accelerated gradient descent method (see Algorithm 4 in Appendix A)and it’s per-step analysis by [bansal2017potential] (Lemma 4 in Appendix A).

3 Strongly-convex concave saddle point problem

We first study the minimax problem of the form:

(P1)

where is concave, is -strongly-convex, is -smooth, i.e., . and is a convex compact sub-set of and let the function take a minimum value (). Let be the diameter of .

Our objective here is to find an -primal-dual pair (see Definition 2). Now the fact that implies that if is an -primal-dual-pair, then is also an -approximate minima of . Furthermore, by Sion’s minimax theorem [komiya1988elementary], strong-convexity–concavity of ensures that: . Hence, one approach to efficiently solving the problem is by optimizing the dual problem . By Lemma 2, is an -smooth function. So we can use AGD to ensure that . Now, each step of AGD requires computing which can be done efficiently (i.e., logarithmic number of steps) as is strongly-convex and smooth. So, the overall first-order oracle complexity is .

Lemma 2.

For a -strongly-convex–concave -smooth function , is an -smooth concave function.

So does this simple approach give us our desired result? Unfortunately that is not the case, as the above bound on the dual function does not translate to the same error rate for primal function , i.e., the solution need not be -primal-dual pair. E.g., consider , where , and . If , then and so is .

Instead of using AGD, we introduce a new method to solve the dual problem that we refer to as DIAG, which stands for Dual Implicit Accelerated Gradient. DIAG combines ideas from AGD [nesterov1983method] and Nemirovski’s original derivation of the Mirror-Prox algorithm [nemirovski2004prox], and can ensure a fast convergence rate of for the primal-dual gap. For better exposition, we first present a conceptual version of DIAG (C-DIAG), which is not implementable exactly, but brings out the main new ideas in our algorithm. We then present a detailed error analysis for the inexact version of this algorithm, which is implementable.

3.1 Conceptual version: C-DIAG

The pseudocode for C-DIAG algorithm is presented in Algorithm 1. The main idea of the algorithm is in Step 4, where we simultaneously find and satisfying the following requirements:

  • is the minimizer of , and

  • corresponds to an AGD step (see Algorithm 4 in Appendix A) for

Input: , , , , ,
Output:
1 Set , for  do
2       , , Choose ensuring:
3       ,
return
Algorithm 1 Conceptual Dual Implicit Accelerated Gradient (C-DIAG) for strongly-convex–concave programming

Implementability: The first question is whether it is easy enough to implement such a step? It turns out that it is indeed possible to quickly find points and that approximately satisfy the above requirements. The reason is that:

  • Since is smooth and strongly convex for every , we can find -approximate minimizer for a given in iterations.

  • Let . The iteration is a -contraction with a unique fixed point satisfying the update step requirements (i.e., Step of Algorithm 1). See Lemma 6 in Appendix B.4 for a proof. This means that only iterations again suffice to find an update that approximately satisfies the requirements.

Convergence rate: Since and correspond to an AGD update for , we can use the potential function decrease argument for AGD (Lemma 4 in Appendix A) to conclude that ,

where the last step follows from the fact that and so . Noting that we can further recursively bound as above, we obtain

Since for every , we have

where . Since and are arbitrary above, this gives a convergence rate for the primal dual gap.

3.2 Error analysis

The main issue with Algorithm 1 is that the update step is not exactly implementable. However, as we noted in the previous section, we can quickly find updates that almost satisfy the requirements. Algorithm 2 presents this inexact version. The following theorem states our formal result and a detailed proof is provided in Appendix B.4.

Input: , , , , , ,
Output:
1 Set , for  do
2       , , Imp-STEP(, , , , , , ), ensuring:
3       ,
4return Imp-STEP(, , , , , , ):
5       Set , , , for  do
6             Starting at use AGD (Algorithm 4 with ) to compute such that:
(7)
7            
8      return ,
Algorithm 2 Dual Implicit Accelerated Gradient (DIAG) for strongly-convex–concave programming
Theorem 1 (Convergence rate of DIAG).

Let be a -smooth, -strong-convex–concave function on and a convex compact sub-set . Then, after iterations, DIAG (Algorithm 2) finds s.t.:

(8)

In particular, setting we have: . Furthermore, for this setting the total first order oracle complexity is given by: .


Remark 1: Theorem 1 shows that DIAG needs gradient queries for finding a -primal-dual-pair, while current best-known rate is achieved by Mirror-Prox. This dependence in and is optimal, as it is shown in [ouyang2018lower, Theorem 10] that gradient queries are necessary to achieve error in the primal-dual gap.


Remark 2: Unlike standard AGD for , which only updates in the outer-loop, DIAG’s outer-step updates both and thus allowing us to better track the primal-dual gap. However, DIAG’s dependence on the condition number seems sub-optimal and can perhaps be improved if we do not compute Imp-STEP nearly optimally allowing for inexact updates; we leave further investigation into improved dependence on the condition number for future work.

4 Nonconvex concave saddle point problem

We study the nonconvex concave minimax problem (1) where is concave, is nonconvex, and is -smooth, (such that ) and is a convex compact sub-set of . As mentioned in Section 2, we measure the convergence to an approximate FOSP of this problem (see Definition 6) but it requires weak-convexity of . The following lemma guarantees weak convexity of given smoothness of .

Lemma 3.

Let be continuous and be compact. Then is -weakly convex, if is -weakly convex in (Definition 1), or if is -smooth in .

See Appendix B.3 for the proof. The arguments of [jin2019minmax] easily extend to show that applying subgradient method on [davis2018stochastic] gives a convergence rate of . Instead, we exploit the smooth minimax form of to design a faster converging scheme. The main intuition comes from the proximal viewpoint that gradient descent can be viewed as iteratively forming and optimizing local quadratic upper bounds. As is weakly convex, adding enough quadratic regularization should ensure that the resulting sequence of problems are all strongly-convex–concave. We then exploit DIAG to efficiently solve such local quadratic problems to obtain improved convergence rates. Concretely, let

(9)

By -weak-convexity of , is strongly-convex–concave (Lemma 5) that can be solved using DIAG up to certain accuracy to obtain . We refer to this algorithm as Prox-DIAG and provide a pseudo-code for the same in Algorithm 3.

Input: , , , ,
Output:
1 Set for  do
2       Using DIAG for strongly convex concave minimax problem, find such that,
(10)
if  then
3             return
4      
Algorithm 3 Proximal Dual Implicit Accelerated Gradient (Prox-DIAG) for nonconvex concave programming

The following theorem gives convergence guarantees for Prox-DIAG.

Theorem 2 (Convergence rate of Prox-DIAG).

Let be -smooth, be concave, be , be a convex compact subset of , and the minimum value of function be bounded below, i.e. . Then Prox-DIAG (Algorithm 3) after,

steps outputs an -FOSP. The total first-order oracle complexity to output -FOSP is:

Note that Prox-DIAG solves the quadratic approximation problem to higher accuracy of which then helps bounding the gradient of the Moreau envelope. Also due to the modular structure of the argument, a faster inner loop for special settings, e.g., when is a finite-sum, can ensure more efficient algorithm. While our algorithm is able to significantly improve upon existing state-of-the-art rate of in general nonconvex-concave setting [jin2019minmax], it is unclear if the rate can be further improved. In fact, precise lower-bounds for this setting are mostly unexplored and we leave further investigation into lower-bounds as a topic of future research.

Proof.

We first note that by Lemma 5 and -weak convexity of and -strong convexity of , is -strongly-convex. Similarly, is also -strongly-convex.

We now divide the analysis of each iteration of our algorithm into two cases:

Case 1: . As every instance of Case 1 ensures , we can have only Case 1 steps before termination. This claim requires monotonic decrease in which holds until , after which , which in-turn imply that Prox-DIAG terminates (see termination condition of Prox-DIAG).

Case 2: : In this case, we show that is already an -FOSP and the algorithm returns .

(11)

Define as the point satisfying . By -strong convexity of