 # Higher-order methods for convex-concave min-max optimization and monotone variational inequalities

We provide improved convergence rates for constrained convex-concave min-max problems and monotone variational inequalities with higher-order smoothness. In min-max settings where the p^th-order derivatives are Lipschitz continuous, we give an algorithm HigherOrderMirrorProx that achieves an iteration complexity of O(1/T^p+1/2) when given access to an oracle for finding a fixed point of a p^th-order equation. We give analogous rates for the weak monotone variational inequality problem. For p>2, our results improve upon the iteration complexity of the first-order Mirror Prox method of Nemirovski  and the second-order method of Monteiro and Svaiter . We further instantiate our entire algorithm in the unconstrained p=2 case.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this work, we focus on two well-studied classes of problems: monotone variational inequalities (MVIs) and convex-concave min-max problems (Minty et al., 1962; Kinderlehrer and Stampacchia, 1980; Nemirovski, 2004). In an MVI, we are given a monotone operator over a convex set , and the goal is to find a point such that

 ∀z∈Z,⟨F(z),z∗−z⟩≤0. (1)

Such a point is called a solution to a weak (Minty) MVI (Komlósi, 1999). The MVI problem eq. 1 is closely related to the classic min-max optimization problem:

 minx∈Xmaxy∈Yg(x,y) (2)

where is a convex-concave function over convex sets and

. Such problems are ubiquitous in statistics, optimization, machine learning, and game theory. Solving

eq. 2 is equivalent to finding the Nash Equilibrium of a zero-sum game and is also sometimes called a saddle point problem.

The Mirror Prox (MP) algorithm of Nemirovski (2004) is a popular method for solving both eq. 1 (when is Lipschitz continuous) and eq. 2 (when is smooth). MP is a generalization of the extragradient algorithm of Korpelevich (1976), and it converges in iterations, which is tight for first-order methods (FOMs) (Nemirovski and Yudin, 1983). Given that MP achieves the optimal performance for FOMs, there is a natural question of whether one can improve the iteration complexity by using higher-order methods (HOMs), which tend to converge in fewer iterations but at the expense of higher cost per iteration. HOMs use higher-order derivatives of the objective function and generally require higher-order smoothness, namely that the higher-order derivatives of the objective be Lipschitz continuous.

In convex and nonconvex optimization, while FOMs such as gradient descent are the gold standard for optimization algorithms, HOMs are useful in a variety of different settings. Newton’s method is one of the most well-known HOMs, and it is a central component of path-following interior-point methods (Nesterov and Nemirovski, 1994). In cases when the higher-order update is efficiently computable, HOMs can achieve faster overall running times than FOMs. For example, HOMs have been used to find approximate local minima in nonconvex optimization faster than gradient descent (Agarwal et al., 2017; Carmon et al., 2018). While second-order methods are the most common type of HOM, there has also been significant recent work on HOMs beyond second-order methods (Agarwal and Hazan, 2018; Arjevani et al., 2018; Gasnikov et al., 2018; Jiang et al., 2018; Bubeck et al., 2018; Bullins, 2018).

HOMs have seen much less study in the context of MVIs and min-max problems. Monteiro and Svaiter (2012) use a second-order method with an implicit update that achieves an improved iteration complexity of for problems with second-order smoothness. Their method uses the Hybrid Proximal Extragradient (HPE) framework established in Monteiro and Svaiter (2010) and requires access to an oracle for finding a fixed point of a constrained second-order equation. However, it was unknown whether one could achieve further improved iteration complexity in the presence of third-order smoothness and beyond.

#### Contributions.

Our main contribution is a higher-order method HigherOrderMirrorProx for approximately solving MVIs and convex-concave min-max problems that achieves an iteration complexity of for problems with -order smoothness. To our knowledge, this is the first work showing that improved convergence rates are possible for problems with third-order smoothness and beyond. Our algorithm requires access to an oracle for finding a fixed point of a -order equation, using a higher-order implicit update that can be thought of as a generalization of Mirror Prox. Since the implicit update may be difficult to compute in the constrained case, we show how to instantiate our algorithm in the second-order unconstrained case, giving overall running time bounds in that setting.

We begin by reviewing definitions, notions of convergence, and related work in Section 2. Then we summarize our main results and our algorithm in Section 3. In Section 4, we present the proof of our main result. We then show how to fully instantiate our algorithm in the unconstrained case in Section 5.

## 2 Preliminaries

We will use MVI() to denote the MVI given in eq. 1

over a vector field

and convex constraint set . Unless otherwise specified, we will use to signify a solution to MVI(). Throughout the paper, we will use to represent positive weights, and we let . We use to denote the Jacobian operator. We use to denote an arbitrary norm and to denote its dual norm. We use to denote the Euclidean norm for vectors and the operator norm for matrices.

We use to denote a Bregman divergence over a distance generating function that is 1-strongly convex with respect to some norm . Recall that the definition of a Bregman divergence is as follows:

 D(u,v)=d(u)−d(v)−⟨∇d(v),u−v⟩ (3)

for all .

We now discuss several key definitions:

###### Definition 2.1.

A vector field is monotone if for all .

For notational convenience, we assume our algorithms have access to a monotone operator . This is the usual assumption in MVIs, but it will also allow us to solve min-max problems, as we now show. For min-max problems eq. 2, one can consider the gradient descent-ascent field of :

 Fg(x,y)def=(∇xg(x,y)−∇yg(x,y)) (4)

Letting and , we can say maps to with only a slight abuse of notation. It is then easy to show that is monotone when is convex-concave. So to apply our algorithms to min-max settings, we simply apply them on .

Our algorithms will we require the following general notion of smoothness:

###### Definition 2.2.

A vector field is -order smooth w.r.t. if, for all ,

 ∥∇p−1F(u)−∇p−1F(v)∥∗≤Lp∥u−v∥,

where we define

###### Remark 2.3.

Our definition of -order smoothness as a property of the derivative of is motivated by the min-max setting eq. 2, where is already expressed in terms of the gradient of . If is order smooth, this is a statement about the Lipschitz continuity of order derivatives of .

Another key component of our algorithms is the -order Taylor expansion of at evaluated at :

 Tp(v;u)=p∑i=01i!∇(i)F(u)[v−u]i. (5)

While depends on , we leave this implicit to lighten notation, as the relevant will be clear from context.

###### Remark 2.4.

To be consistent with Remark 2.3, when we refer to “-order methods,” we will be referring to methods that use a -order Taylor expansion of and which typically require -order smoothness. Again, this indexing makes sense in the context of min-max problems, where a -order method uses a Taylor expansion involving -order derivatives of .

A well-studied consequence of Definition 2.2 is the following:

###### Fact 2.5.

Let , and let be -order smooth. Then,

 ∥F(v)−Tp−1(v;u)∥∗≤Lpp!∥v−u∥p. (6)

Finally, our algorithms will all require the following assumption:

###### Assumption 2.6.

There exists a solution to the weak variational inequality MVI, namely is a point that satisfies eq. 1.

Assumption 2.6 always holds when is a compact convex set and is continuous on (Kinderlehrer and Stampacchia, 1980).

### 2.1 Notions of convergence for variational inequalities

The main solution concept for eq. 1 that we consider is an -approximate weak solution to MVI(), namely a point such that:

 ∀z∈Z,⟨F(z),z∗−z⟩≤ε. (7)

Our main bounds will be of the form:

 ∀z∈Z,1ΓTT∑t=1γt⟨F(zt),zt−z⟩≤ε, (8)

where are iterates produced by our algorithm, are positive constants, and . We now show conditions under which a guarantee of the form eq. 8 gives -approximate weak solutions.

###### Lemma 2.7.

Let , let for be monotone, and let . Let . Assume eq. 8 holds. Then is an -approximate weak solution to MVI().

###### Proof.

By monotonicity, we have:

 ⟨F(zt),zt−z⟩≥⟨F(z),zt−z⟩.

Therefore,

 T∑t=1γt⟨F(zt),zt−z⟩≥T∑t=1γt⟨F(z),zt−z⟩=ΓT⟨F(z),¯zt−z⟩.

Then is an -approximate solution to the weak MVI problem. ∎

### 2.2 Solving convex-concave min-max problems with variational inequalities

The classic notion of convergence for eq. 2 is the duality gap :

 (9)

The duality gap is defined in terms of a min-max objective , but we leave it implicit because the relevant will be clear from context. We will now show how to prove bounds on the duality gap given a bound like in eq. 8.

We will use the following lemma to prove bounds on the duality gap:

###### Lemma 2.8.

Let , let for , and let . Let . Assume eq. 8 holds. If is the gradient descent-ascent field for a convex-concave problem (as in eq. 4), then .

###### Proof.

When is the gradient descent-ascent field for a convex-concave problem, we have:

 ⟨F(zt),zt−z⟩ =(⟨∇xg(xt,yt),xt−x⟩+⟨−∇yg(xt,yt),yt−y⟩) ≥g(xt,yt)−g(x,yt)+g(xt,y)−g(xt,yt) =g(xt,y)−g(x,yt).

Overall, we then have:

 T∑t=1γt⟨F(zt),zt−z⟩≥T∑t=1γt(g(xt,y)−g(x,yt))≥ΓT(g(¯xt,y)−g(x,¯yt))≥ΓTΦX×Y(¯xt,¯yt).

### 2.3 Related work

#### Monotone variational inequalities.

The weak MVI eq. 1 is a classic and well-studied optimization problem (Minty et al., 1962; Komlósi, 1999; Nemirovski, 2004; Monteiro and Svaiter, 2010). It is closely related to the strong MVI problem (Stampacchia, 1970), where the goal is to find a such that

 ∀z∈Z,⟨F(z∗),z∗−z⟩≤0. (10)

When is continuous and single-valued, any solution to the weak MVI eq. 1 is a solution to the strong MVI.

Our algorithm is based on the Mirror Prox (MP) algorithm of Nemirovski (2004), which is a generalization of the extragradient method of Korpelevich (1976). MP is a first-order method that achieves iteration complexity, which is tight (Nemirovski and Yudin, 1983). Monteiro and Svaiter (2010) prove convergence rates for MP in the unconstrained case by formulating MP as an instance of what they call a Hybrid Proximal Extragradient (HPE) algorithm. Monteiro and Svaiter (2012) provide a second-order algorithm to solve eq. 1 in settings with second-order smoothness. That algorithm achieves an iteration complexity, and its analysis goes through the HPE framework from Monteiro and Svaiter (2010).

#### Min-max optimization.

Many convex-concave min-max optimization problems are either solved with MP or first-order no-regret algorithms. Ouyang and Xu (2018) show a lower bound of for first-order methods in constrained smooth convex-concave saddle point problems, even in the simple case when for convex and . A number of recent works have also applied second-order methods to unconstrained smooth min-max problems, where the second-order information is often accessed through Hessian-vector products (Balduzzi et al., 2018; Gemp and Mahadevan, 2018; Letcher et al., 2019; Adolphs et al., 2019; Abernethy et al., 2019; Schäfer and Anandkumar, 2019).

#### Higher-order methods for convex optimization.

Higher-order methods have a long history of use in solving convex optimization problems. Assuming Lipschitz continuity of the Hessian, Nesterov (2008) provided an accelerated variant of the cubic regularization method (Nesterov and Polyak, 2006), which was further generalized by Baes (2009) under -order smoothness assumptions. The rate in (Nesterov, 2008) was later improved by Monteiro and Svaiter (2013), and since then several works concerning lower bounds in this setting (Agarwal and Hazan, 2018; Arjevani et al., 2018) have shown that this rate is essentially tight (up to logarithmic factors) when the Hessian is Lipschitz continuous. Recently, several works have shown that the lower bound is also essentially tight for (Gasnikov et al., 2018; Jiang et al., 2018; Bubeck et al., 2018; Bullins, 2018), leading to advances in related problems, such as regression (Bullins and Peng, 2019) and parallel non-smooth convex optimization (Bubeck et al., 2019).

## 3 Main results

Our main result is a new higher-order method HigherOrderMirrorProx (Algorithm 1) for solving MVIs and convex-concave min-max problems with higher-order smoothness. We prove the following convergence rate:

###### Theorem 3.1.

Suppose is -order -smooth. Let . Moreover,
let . Then for as output by Algorithm 1:

1. If is monotone, then is an -approximate solution to the weak MVI problem.

2. If is the gradient descent-ascent field for a convex-concave problem over and , then .

Our result matches the rate of Monteiro and Svaiter (2012) when and gives improved convergence rates for higher . To our knowledge, this is the first algorithm to achieve improved iteration complexity in the presence of higher-order smoothness. We compare our algorithm to that of Monteiro and Svaiter (2012) in more detail in Section 3.3.

Similar to other higher-order algorithms, which require an oracle for solving a minimization over a order Taylor series (Gasnikov et al., 2018; Jiang et al., 2018; Bubeck et al., 2018), our algorithm requires an oracle for solving a fixed point problem of a order equation. While this oracle is stronger, we believe it is justified given that the MVI and convex-concave min-max settings are significantly more difficult compared to convex minimization problems. A common downside of higher-order algorithms is that the required oracle may be difficult to compute, particularly in the constrained setting. We can also consider running our algorithm in the unconstrained setting, which requires a slightly weaker unconstrained oracle rather than a constrained oracle. We discuss how to interpret our bounds in the unconstrained setting in Section 3.1.

Finally, we show how to instantiate our method in the second-order unconstrained case, giving the following running time bounds:

###### Theorem 3.2 (Main theorem, p=2 (Informal)).

Suppose is sufficiently smooth, and let be the output of HigherOrderMirrorProx (Algorithm 2). Then, for , the iterates satisfy, for all ,

 1ΓTT∑t=1⟨γtF(^zt),^zt−z⟩≤8L2(max{D(z,z1),1}T)32, (11)

with per-iteration cost dominated by matrix inversions.***Here we use the notation to suppress logarithmic factors.

### 3.1 Interpreting our results in the unconstrained setting

In the unconstrained setting, the standard solution concepts for MVIs and min-max problems can be vacuous in general. For example, for and the associated vector field , all approximate solutions to the min-max problem / MVI are exact solutions. However, the bounds we prove are still meaningful. In the MVI case, our guarantee can be interpreted as stating that for all such that , we have as long as . Likewise, for min-max problems, if is a convex set containing , then we can say that , where .

### 3.2 Explanation of our algorithm

Our algorithm is inspired by the Mirror Prox (MP) algorithm of Nemirovski (2004), defined as follows:

 ^zt=argminz∈Z{⟨γtF(zt),z−zt⟩+D(z,zt)} (15)
 zt+1=argminz∈Z{⟨γtF(^zt),z−^zt⟩+D(z,zt)} (16)

where is a Bregman divergence. Nemirovski (2004) motivates MP with a “conceptual prox method”, which is given as follows:

 zt+1=argminz∈Z{⟨γt+1F(zt+1),z−zt+1⟩+D(z,zt)}. (17)

This is an implicit method, as computing requires solving the equation above for a given step-size . However, this method has good iteration complexity. Nemirovski (2004) shows that if one could run eq. 17 exactly, then the -averaged iterate converges at a rate of . Thus, if one could implement eq. 17 with large step-sizes, one could achieve faster iteration complexity.

It turns out that as long as one approximates eq. 17 with small error, one can achieve a similar convergence rate. The MP algorithm with constant does just that, leading to a convergence rate. While one would like to increase the step-size in MP to improve the convergence rate, this approach does not work because MP with large step-sizes will no longer approximate eq. 17 with small error.

In our algorithm, we replace the first-order minimization in MP eq. 15 with a -order minimization (12). We also simultaneously choose a particular step-size. This can be viewed as approximating eq. 17 with large step-sizes while using the higher-order minimization to ensure that our algorithm is still a “good” approximation of eq. 17.

### 3.3 Comparison to Monteiro and Svaiter (2012)

Monteiro and Svaiter (2012) give a second-order algorithm for solving eq. 1 with iteration complexity in the presence of second-order smoothness. Like our algorithm, their algorithm also heavily relies on the idea of approximating a proximal point method with a large step-size. In fact, their algorithm is very similar to our algorithm in the second-order case. However, our analysis is rather different and arguably simpler. While their analysis goes through the Hybrid Proximal Extragradient framework of Monteiro and Svaiter (2010), our analysis relies on a natural extension of the Mirror Prox analysis. Finally, Monteiro and Svaiter (2012) only deal with the Euclidean setting, whereas we allow arbitrary norms.

While Monteiro and Svaiter (2012) do not explicitly instantiate their second-order oracle, they mention that their oracle reduces to solving a strongly monotone variational inequality, which can then be solving using a variety of approaches, including interior point methods. In the case, our oracle can be similarly instantiated.

## 4 Higher-Order Mirror Prox Guarantees

In this section, we prove our main result of the convergence guarantees provided by Algorithm 1.

###### Lemma 4.1.

Suppose is -order -smooth and let . Then, the iterates generated by Algorithm 1 satisfy, for all ,

 1ΓTT∑t=1⟨γtF(^zt),^zt−z⟩≤16Lpp!(D(z,z1)T)p+12. (18)
###### Theorem 4.2.

Suppose is -order -smooth. Let . Moreover,
let . Then for as output by Algorithm 1:

1. If is monotone, then is an -approximate solution to the weak MVI problem.

2. If is the gradient descent-ascent field for a convex-concave problem over and , then .

Theorem 4.2 follows immediately from Lemmas 2.7, 2.8, and 4.1. To prove Lemma 4.1, we will need to establish our main technical result (Lemma 4.3), which we prove in Section 4.1 and whose proof proceeds in a similar manner to the Mirror Prox analysis (Nemirovski, 2004; Tseng, 2008).

###### Lemma 4.3.

Suppose is -order -smooth. Then, as generated by Algorithm 1 satisfy, for all ,

 T∑t=1⟨γtF(^zt),^zt−z⟩+14T∑t=1∥^zt−zt∥2+14T∑t=1∥zt+1−^zt∥2≤D(z,z1)−D(z,zt+1). (19)

We will also need the following technical lemma:

###### Lemma 4.4.

Let for all , and let . Then .

###### Proof.

We use the following power means:

 M1(x) =∑Tt=1xtT M−2/p(x) =⎛⎝∑Tt=1x−2/ptT⎞⎠−p/2.

By the power mean inequality, we have , so letting gives:

 ∑Tt=11ξptT ≥(T∑Tt=1ξ2t)p/2≥(TR)p/2 ⇒T∑t=11ξpt≥T1+p/2Rp/2.

We now have the necessary tools to prove Lemma 4.1.

###### Proof of Lemma 4.1.

Using Lemma 4.3, we can divide both sides of (19) by , and so using the non-negativity of and the Bregman divergence, we get:

 1ΓTT∑t=1⟨γtF(^zt),^zt−z⟩≤D(z,z1)ΓT.

We simply need to lower bound in order to prove our convergence rate result. By Assumption 2.6, we know that there exists a solution to MVI(), which means that for all , we have . We can combine this with Lemma 4.3 to get that . Since , we can apply Lemma 4.4 by setting and , which gives the result. ∎

### 4.1 Proof of main technical result (Lemma 4.3)

Before proving Lemma 4.3, we state a useful lemma concerning the updates (12) and (14) in Algorithm 1.

###### Lemma 4.5 (Tseng (2008)).

Let be a convex function, let , and let

 z+=argminx{ϕ(x)+D(x,z)}. (20)

Then, for all ,

 ϕ(x)+D(x,z)≥ϕ(z+)+D(z+,z)+D(x,z+). (21)
###### Proof.

By the optimality condition for , we know that for all ,

 ϕ(x)+⟨∇xD(z+,z),x−z+⟩≥ϕ(z+). (22)

Rearranging and adding to both sides gives us

 ϕ(x)+D(x,z) ≥ϕ(z+)+D(x,z)−⟨∇xD(z+,z),x−z+⟩ =ϕ(z+)+D(x,z)+D(x,z+)+D(z+,z)−D(x,z) =ϕ(z+)+D(x,z+)+D(z+,z),

where the first equality comes from the Bregman three-point property, i.e.,

 ⟨∇d(w)−∇d(v),u−v⟩=D(u,v)+D(v,w)−D(u,w),  for % all  u,v,w∈Z. (23)

We now prove Lemma 4.3, which is our main technical result.

###### Proof of Lemma 4.3.

By Lemma 4.5, along with the algorithm’s determination of , we have that for all ,

 γt⟨Tp−1(^zt;zt),^zt−z⟩≤D(z,zt)−D(z,^zt)−D(^zt,zt) (24)

Using Lemma 4.5 again with the choice of , it follows that for all ,

 γt⟨F(^zt),zt+1−z⟩≤D(z,zt)−D(z,zt+1)−D(zt+1,zt). (25)

We may now observe that

 γt⟨ F(^zt),^zt−z⟩=γt⟨F(^zt),^zt−zt+1⟩+γt⟨F(^zt),zt+1−z⟩ =γt⟨F(^zt)−Tp−1(^zt;zt),^zt−zt+1⟩+γt⟨Tp−1(^zt;zt),^zt−zt+1⟩+γt⟨F(^zt),zt+1−z⟩ ≤γt⟨F(^zt)−Tp−1(^zt;zt),^zt−zt+1⟩−D(zt+1,^zt)−D(^zt,zt)+D(z,zt)−D(z,zt+1),

where the final inequality follows from (24) and (25). Now by Hölder’s inequality, using eq. (6), and the 1-strong convexity of w.r.t. , it follows that

 γt⟨ F(^zt),^zt−z⟩≤γt∥F(^zt)−Tp−1(^zt;zt)∥∗⋅∥^zt−zt+1∥−D(zt+1,^zt)−D(^zt,zt) +D(z,zt)−D(z,zt+1) ≤γtLpp!∥^zt−zt∥p⋅∥^zt−zt+1∥−D(zt+1,^zt)−D(^zt,zt)+D(z,zt)−D(z,zt+1) ≤γtLpp!∥^zt−zt∥p⋅∥^zt−zt+1∥−12∥zt+1−^zt∥2−12∥^zt−zt∥2+D(z,zt)−D(z,zt+1).

Finally, by our guarantee from Algorithm 1 that , and using the fact that for , it follows that

 γt⟨F(^zt),^zt−z⟩+14∥^zt−zt∥2+14∥zt+1−^zt∥2≤D(z,zt)−D(z,zt+1). (26)

Summing over gives the result. ∎

## 5 Instantiating HigherOrderMirrorProx (for p=2)

In this section, we provide an efficient implementation of HigherOrderMirrorProxfor the case where is second-order smooth. In particular, we consider the unconstrained problem (i.e., ) with the Bregman divergence chosen as . First, for technical reasons, we require the following assumption:

###### Assumption 5.1.

During the execution of Algorithm 2, for all , , we assume that is invertible and .

As we later discuss in Section 5.2, these conditions always hold for convex-concave min-max problems. We then arrive at the following result for this setting:

###### Theorem 5.2 (Main theorem, p=2).

Suppose is first-order -smooth, second-order -smooth, and Assumption 5.1 holds. Let be a solution to MVI(), let , and let be the output of HigherOrderMirrorProx + (Algorithm 2). Further assume that, for all , . Then, for , the iterates satisfy, for all ,

 1ΓTT∑t=1⟨γtF(^zt),^zt−z⟩≤8L2(max{D(z,z1),1}T)32. (27)

In addition, the computational cost of each iteration of Algorithm 2 is dominated by a total of
matrix inversions.

###### Proof.

We will first show that the choices of and are valid binary search bounds whenever
is called by Algorithm 2, i.e., that and . We begin with our choice of . Suppose that, for some iteration , it is the case that . If so, then the algorithm sets , which means that . Therefore, since we know that

 1ΓTT∑t=1⟨γtF(^zt),^zt−z⟩≤8L2D(z,z1)ΓT, (28)

it follows that

 1ΓTT∑t=1⟨γtF(^zt),^zt−z⟩≤8L2D(z,z1)T32≤8L2D(z,z1)T32≤8L2(max{D(z,z1),1}T)32, (29)

and so we would be done. In addition, supposing it is the case that (at which point, the algorithm sets ), we again reach this conclusion by the same reasoning. For ensuring the validity of , note that by (36), it follows that .

Having established the validity of the binary search bounds in the case that the search routine is in fact called, we now move on to show how we may explicitly instantiate the implicitly defined update in (12). Namely, in this setting the key conditions (12) and (13) that must simultaneously hold can be equivalently expressed as

 ^zt=argminz∈Rd{γt⟨F(zt)+∇F(zt)(^zt−zt),z−zt⟩+12∥z−zt∥2}, and (30)
 116L1∥^zt−zt∥2≤γt≤18L1∥^zt−zt∥2. (31)

From (30), it follows by first-order optimality conditions that , and so rearranging gives us

 (I+γt∇F(zt))^zt=(I+γt∇F(zt))zt−γ