A Unified Analysis of Extra-gradient and Optimistic Gradient Methods for Saddle Point Problems: Proximal Point Approach

01/24/2019 ∙ by Aryan Mokhtari, et al. ∙ MIT 0

We consider solving convex-concave saddle point problems. We focus on two variants of gradient decent-ascent algorithms, Extra-gradient (EG) and Optimistic Gradient (OGDA) methods, and show that they admit a unified analysis as approximations of the classical proximal point method for solving saddle-point problems. This viewpoint enables us to generalize EG (in terms of extrapolation steps) and OGDA (in terms of parameters) and obtain new convergence rate results for these algorithms for the bilinear case as well as the strongly convex-concave case.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the following saddle point problem

(1)

where the function is a convex-concave function i.e. is convex for all and is concave for all . We are interested in computing a saddle point of problem (1) defined as a pair that satisfies

for all . This problem appears in several areas, including zero-sum games Basar & Olsder (1999), robust optimization Ben-Tal et al. (2009), robust control Hast et al. (2013)

and more recently in machine learning in the context of Generative adversarial networks (GANs) (see

Goodfellow et al. (2014) for an introduction to GANs and Arjovsky et al. (2017) for the formulation of Wasserstein GANs).

Motivated by the interest in computational methods in GANs, in this paper we consider convergence rate analysis of discrete-time gradient based optimization algorithms for finding a saddle point of problem (1). We focus on Extra-gradient (EG) and Optimistic Gradient Descent Ascent (OGDA) methods, which have attracted much attention in the recent literature because of their superior empirical performance in GAN training. EG is a classical method introduced in Korpelevich (1976) with its linear rate of convergence for strongly convex-concave established in the variational inequality literature Facchinei & Pang (2007).111Facchinei & Pang (2007) establishes linear rate of convergence for the general case of pseudo-monotone VIs under an additional error bound condition. The convergence properties of OGDA were recently studied in Daskalakis et al. (2017), which showed the convergence of the iterates to a neighborhood of the solution when the objective function is bilinear, i.e., . Liang & Stokes (2018) used a dynamical system approach to prove the linear convergence of the OGDA and EG methods for the special case when and the matrix is square and full rank. They also presented a linear convergence rate of the vanilla Gradient Ascent Descent (GDA) method when the objective function is strongly convex-concave. In a recent paper Gidel et al. (2018) consider a variant of the EG method, relating it to OGDA updates, and show the linear convergence of the corresponding EG iterates in the case where is strongly convex-concave 222Note that when we state that is strongly convex-concave, it means that is convex for all and is concave for all . (though without showing the convergence rate for the OGDA iterates).

The previous works use disparate approaches to analyze EG and OGDA methods, obtaining results in several different settings and making it difficult to see the connections and unifying principles between these iterative methods. In this paper we show that the update of EG and OGDA can be seen as approximations of the Proximal Point (PP) method, introduced in Martinet (1970) and studied in Rockafellar (1976b). This viewpoint allows us to understand why EG and OGDA are convergent for a bilinear problem. It enables us to generalize EG (in terms of extrapolation steps) and OGDA (in terms of parameters) and obtain new convergence rate results for these generalized algorithms for the bilinear case as well as the strongly convex-concave case. Our results recover the linear convergence rate results of Liang & Stokes (2018)

for the bilinear case, and provide new linear convergence rate estimates for the classical EG and OGDA for the strongly convex-concave case.

Figure 1: Convergence trajectories of proximal point (PP), extra-gradient (EG), optimistic gradient descent ascent (OGDA), and gradient descent ascent (GDA) for . The proximal point method has the fastest convergence. EG and OGDA approximate the trajectory of PP and both converge to the optimal solution. The GDA method is the only method that diverges.

Related Work

There are several papers that study the convergence rate of algorithms for solving saddle point problems over a compact set. Nemirovski (2004) shows convergence rate for the mirror-prox algorithm (a special case of which is the EG method) in convex-concave saddle point problems over compact sets. Nedić & Ozdaglar (2009) analyzes the (sub)Gradient Descent Ascent (GDA) algorithm for convex-concave saddle point problems when the (sub)gradients are bounded over the constraint set.

Several papers study the special case of Problem (1) when the objective function is of the form . When the functions and are strongly convex, primal-dual gradient-type methods converge linearly Chen & Rockafellar (1997); Bauschke et al. (2011). Further, Du & Hu (2018) shows that GDA achieves a linear convergence rate when is convex and is strongly convex. Chambolle & Pock (2011) introduce a primal-dual variant of the proximal point method that converges to a saddle point at a sublinear rate of when are convex and at a linear convergence rate when are strongly convex.

For the case that is strongly concave with respect to , but possibly nonconvex with respect to , Sanjabi et al. (2018) provided convergence to a first-order stationary point using an algorithm that requires running multiple updates with respect to at each step.

Optimistic gradient methods have also been studied in the context of convex online learning. In particular, Rakhlin & Sridharan (2013a, b) introduce the general version of the Optimistic Mirror Descent algorithm in the framework of online optimization. Prior to this work, a special case of Optimistic Mirror descent was analyzed in Chiang et al. (2012), again in the context of online learning.

Outline. The rest of the paper is organized as follows. We start the paper by presenting some definitions and preliminaries required for presenting our results in Section 2. Then, we revisit the Proximal Point (PP) point method in Section 3 and present its convergence properties for bilinear problems (Theorem 1) and general strongly convex-concave problems (Theorem 2). In Section 4, we recap the update of Extra-Gradient (EG) method for solving a saddle point problem. Then, we show that EG can be interpreted as an approximation of PP (Proposition 1) and use this interpretation to study the convergence properties of EG in bilinear problems (Theorem 3) and general strongly convex-concave problems (Theorem 4). We then generalize the EG method by increasing the number of extrapolation points and provide the convergence rate for this generalized method for the strongly convex-concave case (Theorem 5). In Section 5, we similarly show that the Optimistic Gradient Descent Ascent (OGDA) is an approximation of PP (Proposition 2) and prove its linear convergence rate for bilinear (Theorem 6) and strongly convex-concave problems (Theorem 7). We generalize the OGDA method in terms of its parameters and show the convergence of the generalized OGDA method for the bilinear case (Theorem 8) and the strongly convex-concave case (Theorem 9) In Section 6, we present our numerical results, comparing the performance of PP, EG, and OGDA for solving both a bilinear problem and a quadratic program. We conclude the paper with final remarks.

Notation.  Lowercase boldface

denotes a vector and uppercase boldface

denotes a matrix. We use to denote the Euclidean norm of vector . Given a multi-input function , its gradient with respect to and at points are denoted by and

, respectively. We refer to the largest and smallest eigenvalues of a matrix

by and , respectively.

2 Preliminaries

In this section we present properties and notations used in our results.

Definition 1.

A function is -smooth if it has -Lipschitz continuous gradients on , i.e., for any , we have

Definition 2.

A continuously differentiable function is -strongly convex on if for any , we have

Further, is -strongly concave if is -strongly convex. If we set , then we recover the definition of convexity for a continuous differentiable function.

Definition 3.

The pair is a saddle point of a convex-concave function , if for any and , we have

Throughout the paper, we will consider two specific cases for Problem (1) stated in the next two assumptions.

Assumption 1.

The function , where is a square full-rank matrix. The point is the unique saddle point. In this case, we define the condition number .

Assumption 2.

The function is continuously differentiable in and . It is -strongly convex in and -strongly concave in . The unique saddle point of is denoted by . We define .

For the strongly convex-concave case, we also make the following smoothness assumption for analyzing the Extra-gradient (Section 4) and the Optimistic Gradient Descent Ascent (Section 5) methods.

Assumption 3.

The gradient , is -Lipschitz in and -Lipschitz in , i.e.,

Moreover, the gradient , is -Lipschitz in and -Lipschitz in , i.e.,

We define .

Remark 1.

Under Assumptions 2 and 3, we define the condition number

In the following sections, we present and analyze three different iterative algorithms for solving the saddle point problem introduced in (1). The iterates of these algorithms are denoted by . We denote , where is the unique saddle point.

3 Proximal Point Method

The Proximal Point (PP) method for minimizing a convex function is defined by the following update

(2)

where is a positive scalar Bertsekas (1999); Beck (2017).333Note that is unique since the objective function of problem (2) is strongly convex. Using the optimality condition of the update in (2), one can also write the update of the PP method as

(3)

Indeed, this expression shows that the PP method is an implicit algorithm. The PP method can also be interpreted as a backward Euler discretization of the ODE with stepsize . Convergence properties of the PP method for convex minimization have been extensively studied Rockafellar (1976a); Güler (1991); Ferris (1991); Eckstein & Bertsekas (1992); Parikh et al. (2014); Beck (2017). The extension of the PP method for solving saddle point problems has been considered in Rockafellar (1976b). Here, we first formally define the saddle point variant of the update in (2) where the iterates are defined as the unique solution to the saddle point problem 444Again is unique since the objective function of problem (4) is strongly convex in and strongly concave in

(4)

It can be verified that if the pair is the solution of problem (4), then and satisfy

(5)
(6)

Using the optimality conditions of the updates in (5) and (6) (which are necessary and sufficient since the problems in (5) and (6) are convex), the update of the PP method for the saddle point problem in (1) can be written as

(7)
(8)

The steps of the PP method for solving the saddle point problem in (1) are summarized in Algorithm 1. Note that implementing the system of updates in (7)-(8) requires computing the operators and , and, therefore, may not be computationally affordable for any general function .

0:  Stepsize , initial vectors
1:  for  do
2:     Compute ;
3:     Compute ;
4:  end for
Algorithm 1 Proximal point method for saddle point problem

In the following theorem, we show that the PP method converges linearly to which is the unique solution of the problem (see Theorem 2 in Rockafellar (1976b)).

Theorem 1.

Consider the saddle point problem under Assumption 1 and the PP method outlined in Algorithm 1. For any , the iterates generated by the PP method satisfy

We would like to emphasize that the function is neither strongly convex with respect to nor strongly concave with respect , but the PP method achieves a linear convergence rate in this setting.

In the following theorem, we characterize the convergence rate of PP for a function that is strongly convex with respect to and strongly concave with respect to . This result was established in Rockafellar (1976b). We include its proof in the appendix for completeness.

Theorem 2.

Consider the saddle point problem under Assumption 2 and the PP method outlined in Algorithm 1. For any , the iterates generated by the PP method satisfy

The result in Theorem 2 states that for the general saddle point problem defined in (1), if the function is strongly convex-concave, the iterates generated by the PP method converge linearly to the optimal solution.

4 Extra-gradient Method

In this section, we first study the classical Extra-gradient (EG) method for solving the general saddle point problem in (1) and provide linear rates of convergence for the bilinear and the strongly convex-concave case. We next introduce a new variant of the EG method (by increasing the number of extrapolation steps) and show better convergence rates than the classical EG rates in terms of problem parameters in the strongly convex-concave case.

4.1 Convergence rate of the EG Method

The main idea of the EG method is to use the gradient at the current point to find a mid-point, and then use the gradient at that mid-point to find the next iterate. To be more precise, given a stepsize , the update of EG at step for solving the saddle point problem in (1) has two steps. First, we find mid-point iterates and by performing a primal-dual gradient update as

Then, the gradients evaluated at the midpoints and are used to compute the new iterates and by performing the updates

The steps of the EG method for solving saddle point problems are outlined in Algorithm 2.

Note that in the update of the EG method, as the name suggests, for both primal and dual updates we need to evaluate an extra gradient at the midpoints and which doubles the computational complexity of this algorithm compared to the vanilla Gradient Descent Ascent (GDA) method. We show next that the EG method approximates the Proximal Point (PP) method more accurately, as compared to the GDA method. Consider the bilinear saddle point problem (Assumption 1). By following the update of the PP method in Section 3 and simplifying the expressions, the PP update for the bilinear problem under Assumption 1 can be written as

As the computation of the inverse could be costly, one can use instead with an error of . This approximation retrieves the update of GDA which is known to possibly diverge for bilinear saddle point problems [see Daskalakis et al. (2017)]. If we use the more accurate approximation which has an error of , we obtain the following system of updates

(9)
(10)

If we ignore the extra terms in (9)-(10) which are of , we recover the update of the EG method for the bilinear saddle point problem (Assumption 1)

0:  Stepsize , initial vectors
1:  for  do
2:     Compute ;
3:     Compute ;
4:     Compute ;
5:     Compute ;
6:  end for
Algorithm 2 Extra-gradient method for saddle point problem

Therefore, in the bilinear problem, the EG method can be interpreted as an approximation of the PP method with error . In the following proposition, we extend this result and show that for a general smooth (possibly nonconvex) function , EG is an approximation of PP.

Proposition 1.

Consider the saddle point problem in (1). Given the stepsize , the update of the EG method is an approximation of the PP method with an error of .

The next theorem views the EG method as the PP method with an error and properly bounds the error to provide convergence rate estimates for the EG method in the bilinear case.

Theorem 3.

Consider the saddle point problem under Assumption 1 and the EG method outlined in Algorithm 2. If the stepsize satisfies the condition , then the iterates generated by the EG method satisfy

(11)

where and

The result in Theorem 3 shows that if the stepsize is properly chosen such that , then the iterates generated by EG converge linearly to the optimal solution. Indeed, the best possible rate is achieved by minimizing with respect , i.e., the best convergence factor for the EG method is

(12)

Using solvers one can find the optimal which is the minimizer of the expression in (12). In the following corollary, we pick a particular stepsize , which allows us to provide a convergence rate estimate that highlights dependence on the relevant problem parameters.

Corollary 1.

Suppose the conditions in Theorem 3 are satisfied. Let which leads to . Then there exists a positive constant such that

The results in Theorem 3 and Corollary 1 show linear convergence of the EG method in the bilinear case where the matrix is square and full rank. In other words, we obtain that the overall number of iterations to reach a point satisfying is at most . This result is similar to the one in Liang & Stokes (2018). The following theorem characterizes the convergence rate of the EG method when is strongly convex-concave.

Theorem 4.

Consider the saddle point problem under Assumptions 2 and 3 and the EG method outlined in Algorithm 2. For stepsize , there exists a constant such that the iterates generated by the EG method satisfy

The result in Theorem 4 shows that the computational complexity of EG to achieve an -suboptimal solution, i.e., , is , where is the condition number. Note that GDA achieves an -suboptimal solution in , see Du & Hu (2018). We see that these upper bounds suggest a better dependence of the EG method on the condition number compared to GDA.

4.2 Generalized Extra-gradient method

In this section, we introduce and analyze a generalized version of the Extra-Gradient (EG) method, which has ‘midpoints’ (Note that the original EG method has ). The generalized EG method can be written as follows:

where:

for and

For these updates, we have the following convergence result.

Theorem 5.

Consider the saddle point problem under Assumptions 2 and 3 and the EG method outlined in Algorithm 2. Let . Then, there exists a constant such that the iterates generated by the generalized EG method satisfy

The result in Theorem 5 shows that by increasing the number of midpoints , the total computational cost, including the per iteration cost, of the generalized EG method becomes , where the additional multiplicand is due to the fact that we need gradients per iteration. The convergence rate of the generalized EG can be shown in a similar manner for the bilinear case (Assumption 1). In the bilinear case, the generalized EG approximates the inverse up to higher orders of , which in turn reduces the order of the error between the PP method and the generalized EG method. We do not state the convergence rate results for the bilinear case here due to space limitations.

5 Optimistic Gradient Descent Ascent Method

The Optimistic Gradient Descent Ascent (OGDA) method is a popular method for solving the saddle point problem (1); see Algorithm 3 for the steps of the OGDA method (Daskalakis et al. (2017)).

5.1 Convergence rate of the OGDA Method

The main idea behind the updates of the OGDA method is the addition of a “negative-momentum” term to the updates which can be clearly seen when we write the iterations as follows:

The last term in parenthesis for each of the updates can be interpreted as a “negative-momentum”, differentiating the OGDA method from vanilla Gradient Descent Ascent (GDA).

0:  Stepsize , vectors
1:  for  do
2:     ;
3:     ;
4:  end for
Algorithm 3 OGDA method for saddle point problems

We analyze the OGDA method as an approximation of the Proximal Point (PP) method presented in Section 3. We first focus on the bilinear case (Assumption 1) for which the OGDA updates are as follows:

Note that the update of the PP method for the variable can be written as

where we used the fact that is an approximation of with an error of . Regrouping the terms and using the updates of the PP method yield

where the last expression is the OGDA update for variable plus an additional error of . A similar derivation can be done for the update of variable to show that OGDA is an approximation of the PP method up to . In the following proposition, we show that this observation can be generalized for general smooth (possibly nonconvex) function .

Proposition 2.

Consider the saddle point problem in (1). The update of the OGDA method is an approximation of the PP method with an error of .

To analyze the convergence of OGDA, we view it as a PP algorithm with an additional error term. In the following theorem, we characterize the convergence rate of the OGDA method for the bilinear saddle point problem (Assumption 1)

Theorem 6.

Consider the saddle point problem under Assumption 1 and the OGDA method outlined in Algorithm 3. If the stepsize satisfies the condition , then the iterates generated by the OGDA method satisfy

where and

The result in Theorem 6 shows that if the stepsize is properly chosen such that , then after at most four iterations the error of OGD decreases by a constant factor. The best possible rate is achieved by minimizing with respect to , i.e.,

Similar to EG, we identify a specific , and corresponding , which provides explicit rate estimates for the OGDA method in the bilinear case.

Corollary 2.

Suppose the conditions in Theorem 6 are satisfied. Let which leads to , then for a constant , the iterates generated by the OGDA method satisfy

where .

The result in Theorem 6 shows linear convergence of OGDA in the bilinear case where the matrix is square and full rank. This result is similar to the one in Liang & Stokes (2018), except here we analyze OGDA as an approximation of PP. We use this interpretation to provide a convergence rate estimate for OGDA when it is used for solving a general strongly convex-concave saddle point problem.

Theorem 7.

Consider the saddle point problem under Assumptions 2 and 3 and the OGDA method outlined in Algorithm 3. Let the stepsize . Then for a constant , the iterates generated by the OGDA method satisfy

where

The result in Theorem 7 shows that the OGDA method converges linearly to the optimal solution under the assumption that is smooth and strongly convex-concave. In other words, it shows that to achieve a point with error , we need to run at most iterations of OGDA. Note that the factor in the power of the linear convergence factor does not change the order of overall complexity as it appears as a factor in the overall complexity.

5.2 Generalized OGDA method

In this section we consider the following OGDA dynamics with general stepsize parameters :

(13)
(14)

Note that for , we recover the original OGDA method. We have the following results for the generalized OGDA method described in Equations (13) and (14)

Theorem 8.

Consider the saddle point problem under Assumption 1 and the generalized OGDA method Assume and , then the iterates generated by the generalized OGDA method satisfy

where , and

Theorem 8 shows that it is not necessary to use a factor of in the OGDA update to have a linearly convergent method and for a wide range of parameters this result holds. 555A result similar to Theorem 8 can be established when . We do not state the results here due to space limitations. The difference from Theorem 6 is in the last two terms of , which have the factor . In the classical OGDA, , and we recover the result of Theorem 6. As done in the previous sections, we can substitute explicit values for the stepsizes to characterize the exact rate dependence on the problem parameters. We do not state these results here due to space limitations.

The following theorem shows the convergence of the generalized OGDA method when the function is strongly convex-concave.

Theorem 9.

Consider the saddle point problem under Assumptions 2 and 3 and the generalized OGDA method. Set the stepsizes such that Then, for a constant , the iterates generated by the generalized OGDA method satisfy

where

Note that in Theorem 9, only the maximum of takes a specific value, leaving the other parameter free. This generalizes Theorem 7, which analyzes classical OGDA, where . Note that in Theorem 8, , was restricted to lie in a range around . In other words, for the bilinear case, we have a lower bound for , but we do not need such a lower bound in the strongly convex-concave case. This is because here the Gradient Descent Ascent (GDA) method converges, and the GDA method is nothing but a special case of the generalized OGDA method with the parameter . In the bilinear case, since the GDA method may possibly diverge, we cannot set arbitrarily small in the generalized OGDA method.

Figure 2: Convergence of proximal point (PP), extra-gradient (EG), and optimistic gradient descent ascent (OGDA) in terms of number of iterations for the bilinear problem in (15). All algorithms converge linearly, and the proximal point method has the best performance. Stepsizes of EG and OGDA were tuned for best performance.

6 Numerical Experiments

In this section, we compare the performance of the Proximal Point (PP) method with the Extra–Gradient (EG), Gradient Descent Ascent (GDA), and Optimistic Gradient Descent Ascent (OGDA) methods.

We first focus on the following bilinear problem

(15)

where we set to be a diagonal matrix with a condition number of , and we set the dimension of the problem to . The iterates are initialized at and , where is a dimensional vector with all elements equal to . Figure 2 demonstrates the errors of PP, OGDA, and EG versus number of iterations for this bilinear problem. Note that in this figure we do not show the error of GDA since it diverges for this problem, as illustrated in Figure 1 (For more details check Daskalakis et al. (2017)). We can observe that all the three considered algorithms converge linearly to the optimal solution .

Figure 3: Convergence of proximal point (PP), extra-gradient (EG), optimistic gradient descent ascent (OGDA), and gradient descent ascent (GDA) in terms of number of iterations for the quadratic problem in (16). Stepsizes of EG, OGDA and GDA were tuned for best performance.

We proceed to study the performance of PP, EG, GDA, and OGDA for solving the following strongly convex-concave saddle point problem

(16)

This is the saddle point reformulation of the linear regression