Convergence Rates of Two-Time-Scale Gradient Descent-Ascent Dynamics for Solving Nonconvex Min-Max Problems

There are much recent interests in solving noncovnex min-max optimization problems due to its broad applications in many areas including machine learning, networked resource allocations, and distributed optimization. Perhaps, the most popular first-order method in solving min-max optimization is the so-called simultaneous (or single-loop) gradient descent-ascent algorithm due to its simplicity in implementation. However, theoretical guarantees on the convergence of this algorithm is very sparse since it can diverge even in a simple bilinear problem. In this paper, our focus is to characterize the finite-time performance (or convergence rates) of the continuous-time variant of simultaneous gradient descent-ascent algorithm. In particular, we derive the rates of convergence of this method under a number of different conditions on the underlying objective function, namely, two-sided Polyak-L ojasiewicz (PL), one-sided PL, nonconvex-strongly concave, and strongly convex-nonconcave conditions. Our convergence results improve the ones in prior works under the same conditions of objective functions. The key idea in our analysis is to use the classic singular perturbation theory and coupling Lyapunov functions to address the time-scale difference and interactions between the gradient descent and ascent dynamics. Our results on the behavior of continuous-time algorithm may be used to enhance the convergence properties of its discrete-time counterpart.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/05/2019

Last-iterate convergence rates for min-max optimization

We study the problem of finding min-max solutions for smooth two-input o...
02/13/2020

Sharp Analysis of Epoch Stochastic Gradient Descent Ascent Methods for Min-Max Optimization

Epoch gradient descent method (a.k.a. Epoch-GD) proposed by (Hazan and K...
03/03/2022

Min-Max Bilevel Multi-objective Optimization with Applications in Machine Learning

This paper is the first to propose a generic min-max bilevel multi-objec...
02/10/2020

Super-efficiency of automatic differentiation for functions defined as a minimum

In min-min optimization or max-min optimization, one has to compute the ...
10/22/2020

Adaptive extra-gradient methods for min-max optimization and games

We present a new family of min-max optimization algorithms that automati...
11/03/2020

Nonlinear Two-Time-Scale Stochastic Approximation: Convergence and Finite-Time Performance

Two-time-scale stochastic approximation, a generalized version of the po...
11/08/2016

Recursive Decomposition for Nonconvex Optimization

Continuous optimization is an important problem in many areas of AI, inc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we consider the following min-max optimization problems

(1)

where is a nonconvex function w.r.t for a fixed and (possibly) nonconcave w.r.t for a fixed . The min-max problem has received much interests for years due to its broad applications in different areas including control, machine learning, and economics. In particular, many problems in these areas can be formulated as problem (1

), for example, game theory

[1, 2]

, stochastic control and reinforcement learning

[3, 4]

, training generative adversarial networks (GANs)

[5, 6], adversarial and robust machine learning [7, 8], resource allocation over networks [9], and distributed optimization [10, 11]; to name just a few.

In the existing literature, there are two types of iterative first-order methods for solving problem (1), namely, nested-loop algorithms and single-loop algorithms. Nested-loop algorithms implement multiple inner steps in each iteration to solve the maximization problem either exactly or approximately. However, this approach is not applicable to the setting when is nonconcave in , since the maximization problem is NP-hard. Only finding a stationary point of the maximization problem is likely to affect the quality of solving the minimization problem.

On the other hand, single-loop algorithm simultaneously updates the iterates and by using the vanilla gradient descent and ascent steps at different time scales, respectively. As a result, this algorithm is applicable to more general settings and more practical due to its simplicity in implementation. However, single-loop algorithms may not converge in many settings, for example, they fail to converge even in a simple bilinear zero-sum game [12]. Indeed, theoretical guarantees of these methods are very sparse.

Our focus in this paper is to study the continuous-time variant of the single-loop gradient descent-ascent method for solving problem (1). Considering the continuous-time variant will help us to have a better understanding about the behavior of this method through studying the convergence of the corresponding differential equations using Lyapunov theory. Such an understanding can then be used to enhance the analysis of the discrete-time algorithms, as recently observed in the single objective optimization counterpart [13, 14, 15, 16]. Our main contributions are summarized below.

Main Contributions. The focus of this paper is to study the performance of the continuous-time gradient descent-ascent dynamics in solving nonconvex min-max optimization problems. In particular, we derive the rates of convergence of this method under a number of different conditions on the underlying objective function, namely, two-sided Polyak-Łojasiewicz (PŁ), one-sided PŁ, nonconvex-strongly concave, and strongly convex-nonconcave conditions. These rates are summarized in Table 1 and presented in detail in Section 3, where we show that our results improve the ones in prior works under the same conditions of objective functions. The key idea in our analysis is to use the classic singular perturbation theory and coupling Lyapunov function of the fast and slow dynamics to address the time-scale difference and interactions between the gradient descent and ascent dynamics. Proper choices of step sizes allows us to derive improved convergence properties of the two-time-scale gradient descent-ascent dynamics.

1.1 Related Works

Convex-Concave Settings.

Given the broad applications of problem (1), there are a large number of works to study algorithms and their convergence in solving this problem, especially in the context of convex-concave settings. Some examples include prox-method and its variant [17, 18, 19, 20, 21], extragradient and optimistic gradient methods [22, 23, 24, 25, 26, 27], and recently Hamiltonian gradient descent methods [6, 12, 28]. Some algorithms in these settings have convergence rates matched with the lower bound complexity; see the recent work [26] for a detailed discussion.

Nonconvex-Concave Settings.

Unlike the convex-concave settings, algorithmic development and theoretical understanding in the general nonconvex settings are very limited. Indeed, finding the global optimality of nonconvex-nonconcave problem is NP-hard, or at least as hard as solving a single nonconvex objective problem. As a result, the existing literature often aims to find a stationary point of when the max problem is concave. For example, multiple-loop algorithms have been studied in [29, 30, 31, 32, 33]. Our work in this paper is closely related to the recent literature on studying single-loop algorithm [34, 35, 36, 37, 38]. While these works study discrete-time algorithms, we consider continuous-time counterpart. We will show that for some settings, our approach improves the existing convergence results.

Other Settings.

We also want to mention some related literature in game theory [39, 40, 41, 42, 43], two-time-scale stochastic approximation [44, 45, 46, 47, 48, 49, 50, 51, 52, 53], reinforcement learning [54, 55, 56, 57, 58], two-time-scale optimization [59, 60], and decentralized optimization [61, 62, 63, 64, 65, 66, 67]. These works study different variants of two-time-scale methods mostly for solving a single optimization problem, and often aim to find global optimality (or fixed points) using different structure of the underlying problems (e.g., Markov structure in stochastic games and reinforcement learning or strong monotonicity in stochastic approximation). As a result, their techniques may not be applicable to the context of problem (1) considered in the current paper.

Notation.

Given any vector

we use to denote its -norm. We denote by and the partial gradients of with respect to and , respectively.

2 Two-Time-Scale Gradient Descent-Ascent Dynamics

For solving problem (1), we are interested in studying two-time-scale gradient descent-ascent dynamics (GDAD), where we implement simultaneously the following two differential equations

(2)

Here, are two step sizes, whose values will be specified later. In the convex-concave setting, one can choose . However, as observed in [68], choosing different step sizes achieves a better convergence in the context of nonconvex problem. Indeed, we will choose since in our settings studied in the following sections, the maximization problem is often easier to solve than the minimization problem. In this case, the dynamic of is implemented at a faster time scale (using larger step sizes) than (using smaller step sizes). The time-scale difference is loosely defined as the ratio . Thus, one has to design these two step sizes properly so that the method converges as fast as possible.

Technical Approach. The convergence analysis of (2) studied in this paper is mainly motivated by the classic singular perturbation theory [69]. The main idea of our approach can be explained as follows. Since is implemented at a faster time scale than , one can consider being fixed in and separately study the stability of the system using Lyapunov theory. Let be the Lyapunov function corresponding to . When converges to an equilibrium (e.g., ), one can fix and study the stability of . Let be the corresponding Lyapunov function of . We note that and both depend on and , as a result, their time derivatives are coupled through the dynamics in (2). Addressing this coupling and the time-scale difference between the two dynamics is the key idea in our approach. To do that, we will consider the following Lyapunov function

(3)

where represents the time-scale difference, while the constant will be properly chosen to eliminate the impact of on the convergence of and vice versa. Proper choices of these constants will also help us to derive the convergence rates of (2). Similar approach has been used in different settings of two-time-scale methods, see for example [53, 66].

We conclude this section by introducing two assumptions for our analysis studied later.

Assumption 1.

The function has Lipschitz continuous gradients for each variable, i.e., there exist positive constants , , and such that for all we have

(4)
Assumption 2.

Given any the problem has a nonempty solution set , i.e., there exists such that

Objectives Prior Works This Paper
PŁ& PŁ [36]
NCvex & PŁ [33]
NCvex & SCave [37]
SCvex & NCave [37]
Table 1: Convergence rates of GDAD for solving (1) given some accuracy . The abbreviations NCvex, NCave, SCvex, SCave, and PŁ stand for nonconvex, nonconcave, strongly convex, strongly concave, and Polyak-Łojasiewicz condtions, respectively. Condition number is defined in (11), and is the size of compact set used in [33].

3 Main Results

In this section, we present the main results of this paper, where we derive the convergence rates of GDAD under different conditions on the objective function . Our results are summarized in Table 1. First, our approach improves the analysis in [36], where we show in Section 3.1 that for two-sided PŁ functions the convergence of GDAD only scales with instead of studied in [36]. Our result addresses the conjecture raised in [36], where the authors state that such an improvement may not be possible. Second, our analysis achieves a better result than the one in [33] for the case of one-sided PL function by a factor of . We note that a nested-loop is studied in [33] while GDAD is a single-loop method. Finally, our result is the same as the one in [37] when is either strongly concave in for fixed . In Section 3.4, we will show that this observation also holds when is either strongly convex in and nonconcave in . Note that as compared to the analysis in [37], we use a simpler analysis and simpler choice of step sizes to achieve these results.

3.1 Two-Sided Polyak–Łojasiewicz Conditions

We first study the convergence rates of GDAD when satisfies a two-sided Polyak–Łojasiewicz (PŁ) condition, which is considered in [36] and stated here for convenience.

Definition 1 (Two-Sided PŁ Conditions).

A continuously differentiable function is called to satisfy two-sided PŁ conditions if there exist two positive constants and such that the following conditions hold for all :

(5)

The two-sided PŁ condition, which we will assume to hold in this subsection, is a generalized variant of the popular PŁ condition, proposed by [70] as a sufficient condition to guarantee that the classic gradient descent method converges exponentially to the optimal value of an unconstrained minimization problem. As shown in [71], the PŁ condition also implies the quadratic growth condition, i.e., given any we have

(6)

where we assume that is a nonempty solution set of and is the projection of to this set. More discussions on PŁ condition can be found in [71], while some examples of functions satisfying the two-sided PŁ condition are given in [36].

Our focus in this section is to show that GDAD converges exponentially to the global min-max solution of under the two-sided PŁ condition. To do that, we consider the following assumption and lemmas, which are useful for our analysis considered later. We first consider an assumption on the existence of , a global min-max solution of .

Assumption 3.

There exists a global min-max solution of , i.e.,

Next, we consider the following lemma about the Lipschitz continuity of the gradient of , which is a variant of the well-known Danskin lemma [72][Proposition B.25] and studied in [33][Lemma A.5].

Lemma 1.

Suppose that Assumptions 13 hold. Then, the function is differentiable and its gradient is Lipschitz continuous with a constant .

Finally, for our analysis we consider the following two Lyapunov functions

(7)
(8)

where it is obvious to see that and are nonnegative. The time derivatives of and over the trajectories and are given in the following lemma, whose proof can be found in Section 4.1.

Lemma 2.

Suppose that Assumptions 13 hold. Then we have

(9)
(10)

As mentioned, the dynamics of and are implemented at different time scales, where this difference is often loosely defined as the ratio . To capture such time-scale difference in our analysis, we will utilize the coupling Lyapunov function defined in (3). We denote by and the condition number

(11)

representing the condition number of . The convergence rate of GDAD under the two-sided PŁ condition is formally stated in the following theorem.

Theorem 1.

Suppose that Assumptions 13 hold. Let be chosen as

(12)

Then we have for all

(13)
Proof.

By (5) we have

Thus, by using (9), (10), (3), and the preceding relation we have

(14)

Using (12) we have

which when substituting into (14) and using (5) and we obtain

where the last inequality is due to

Taking the integral on both sides of the equation above immediately gives (13), i.e.,

3.2 Nonconvex–Polyak-Łojasiewicz Conditions

In this subsection, we consider an extension of the result studied in the previous section, where we assume that the objective function satisfies the Polyak-Łojasiewicz condition given any and is nonconvex given any .

Assumption 4 (One-Sided PŁ Conditions).

We assume that is nonconvex in for any fixed and satisfies the PŁ condition in for any fixed , that is, there exists a positive constants such that the following condition hold for any :

(15)

Since satisfies only one-sided PŁ condition, we are giving up the hope to find a global optimal solution of (1), as studied in Theorem 1. In stead, we will show that GDAD will return a stationary point of , as studied in [33]. Note that under Assumption 2 the result in Lemma 1 still holds since the work in [33] only assumes one-sided PŁ condition. In addition, since we relax the two-sided PŁ condition, we introduce the following two Lyapunov functions for our analysis studied later.

(16)
(17)

where it is obvious to see that and are nonnegative. The time derivatives of and over the trajectories and are given in the following lemma, whose proof is presented in Section 4.2.

Lemma 3.

Suppose that Assumptions 1, 2, and 4 hold. Then we have

(18)
(19)

Similar to the previous subsection, we utilize the following coupling Lyapunov function

(20)

for some constant , which will be defined below. The convergence rate of GDAD under the nonconvex-PŁ condition is formally stated in the following theorem.

Theorem 2.

Suppose that Assumptions 1, 2, and 4 hold. Let be chosen as

(21)

Then we have for all

(22)
Proof.

By using (18), (19), and (20) we have

(23)

where in the last inequality we use (15) to have

Using (21) and the preceding relation we have

which when substituting into (23) gives

Taking the integral on both sides over for some and rearranging we obtain

which since and by using (21) gives

which concludes our proof. ∎

3.3 Nonconvex–Strongly Concave Conditions

In this subsection, we study the rate of GDAD when the function is nonconvex given any and strongly concave given any . In particular, we consider the following assumption.

Assumption 5.

The objective function is nonconvex for any given and is strongly concave with constant for any given . The latter is equivalent to

(24)

For our analysis of in this section, we introduce the following two Lyapunov functions

(25)
(26)

The time derivatives of and over the trajectories and are given in the following lemma, whose proof is presented in Section 4.3.

Lemma 4.

Suppose that Assumptions 1 and 5 hold. Then we have

(27)
(28)

We next derive the convergence rate of GDAD under Assumption 5 in the following theorem, where we show that GDAD converges sublinear to a stationary point of .

Theorem 3.

Suppose that Assumptions 1 and 5 hold. Let be chosen as

(29)

Then we have for all

(30)
Proof.

By using (27) and (28) we consider

(31)

where in the last equality we use (29) to have

Using (29) one more time we obtain

which when using into (31) we obtain

which when taking the integral on both sides over from to and rearrange we obtain

Thus, the preceding relation gives (30), i.e., for all

3.4 Strongly Convex–Nonconcave Conditions

As mentioned, the single-loop GDA method is applicable to the convex-nonconcave min-max problem, while the nested-loop GDA method is not. In this section, we complete our analysis by studying the rate of GDAD when the function