# On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems

We consider nonconvex-concave minimax problems, _x_y∈Y f(x, y), where f is nonconvex in x but concave in y. The standard algorithm for solving this problem is the celebrated gradient descent ascent (GDA) algorithm, which has been widely used in machine learning, control theory and economics. However, despite the solid theory for the convex-concave setting, GDA can converge to limit cycles or even diverge in a general setting. In this paper, we present a nonasymptotic analysis of GDA for solving nonconvex-concave minimax problems, showing that GDA can find a stationary point of the function Φ(·) :=_y∈Yf(·, y) efficiently. To the best our knowledge, this is the first theoretical guarantee for GDA in this setting, shedding light on its practical performance in many real applications.

• 34 publications
• 48 publications
• 255 publications
06/03/2020

### A Unified Single-loop Alternating Gradient Projection Algorithm for Nonconvex-Concave and Convex-Nonconcave Minimax Problems

Much recent research effort has been directed to the development of effi...
10/29/2020

### A Single-Loop Smoothed Gradient Descent-Ascent Algorithm for Nonconvex-Concave Min-Max Problems

Nonconvex-concave min-max problem arises in many machine learning applic...
02/20/2019

### Perfect reconstruction of sparse signals with piecewise continuous nonconvex penalties and nonconvexity control

We consider compressed sensing formulated as a minimization problem of n...
03/05/2019

### Convergence of gradient descent-ascent analyzed as a Newtonian dynamical system with dissipation

A dynamical system is defined in terms of the gradient of a payoff funct...
08/01/2021

### Zeroth-Order Alternating Randomized Gradient Projection Algorithms for General Nonconvex-Concave Minimax Problems

In this paper, we study zeroth-order algorithms for nonconvex-concave mi...
02/22/2020

### Global Convergence and Variance-Reduced Optimization for a Class of Nonconvex-Nonconcave Minimax Problems

Nonconvex minimax problems appear frequently in emerging machine learnin...
06/15/2020

### The Landscape of Nonconvex-Nonconcave Minimax Optimization

Minimax optimization has become a central tool for modern machine learni...

## 1 Introduction

We consider the following minimax optimization problem:

 minx∈Rmmaxy∈Y f(x,y), (1.1)

where is a smooth (possibly nonconvex in ) function and is a convex set. Since von Neumann’s pioneering work in 1928 [36], the problem of finding the solution to problem (1.1) has been a major endeavor in mathematics, economics and computer science [4, 37, 46]. In recent years minimax optimization theory has begun to see applications in machine learning, including adversarial learning [15, 27], statistical learning [6, 47, 1, 13]

, certification of robustness in deep learning

[43] and distributed computing [42, 28]. On the other hand, real-world machine-learning systems are often embedded in larger economic markets and subject to game-theoretic constraints [21].

The most widely used and seemingly the simplest algorithm to solve problem (1.1) is a natural generalization of gradient descent (GD). Known as gradient descent ascent (GDA), it alternates between gradient descent on the variable and gradient ascent on the variable . There is a vast literature that applies GDA, and stochastic variants of GDA (SGDA), to problems in the form of (1.1[15, 27, 43]. However, the theoretical understanding of the algorithm is still fairly limited. In particular, most of the asymptotic and nonasymptotic convergence results [22, 7, 33, 34, 11] are established for the special case of convex-concave problem (1.1)— is convex in and concave in . Unlike the convex-concave case, for which the behavior of GDA has been investigated quite thoroughly, the issue of the convergence of GDA remains largely open in the general setting. More specifically, there is no shortage of work highlighting that GDA can converge to limit cycles or even diverge in a game-theoretic setting [5, 19, 9, 31]. Despite several recent progress on solving general minimax optimization problems via a range of techniques [32, 8, 18, 2, 24, 30, 29], it remains unclear why GDA and SGDA often work well in various applications in which the objective is not convex-concave.

The following general structure arises in many applications: is concave for any and is a bounded set. For example, consider the problem of certifying robustness in deep learning [43]. Training a model is basically a nonconvex minimization problem,

, where the loss function

refers to a neural network over data samples

. Since the neural networks are vulnerable to adversarial examples [16], it is necessary to develop efficient procedures with rigorous guarantees for small to moderate amounts of robustness. An example of such a scheme, involving the solution of a nonconvex-strongly-concave minimax problem, is presented in [43]. A second example is robust learning from multiple distributions [27]. Given multiple empirical distributions from an underlying true distribution, the goal is to introduce robustness by minimizing the maximum of expected loss over these distributions. This problem can also be posed as a nonconvex-concave minimax problem.

Despite the popularity of GDA and SGDA in practice, few results has been established on their efficiency beyond the convex-concave setting. Thus, a natural question arises:

Are GDA and SGDA provably efficient for solving nonconvex-concave minimax problems?

This paper presents an affirmative answer to this question. In particular, we first characterize stationary conditions for nonconvex-strongly-concave and nonconvex-concave minimax problems, respectively. For nonconvex-strongly-concave problems, GDA and SGDA return an -stationary point within gradient evaluations and stochastic gradient evaluations where is a condition number. For nonconvex-concave problems, GDA and SGDA return an -stationary point within gradient evaluations and stochastic gradient evaluations.

Technically, the concavity of makes it computationally feasible to find the corresponding global maximum, , for any . A straightforward way to solve nonconvex-concave minimax problems is a class of nested-loop variant of GDA, which finds for every iterate . We denote this as gradient descent with max-oracle (GDmax), and realize the max-oracle by gradient ascent on . We use GDmax and stochastic GDmax (SGDmax) as the baseline approaches in this paper; see Table 1 for the details.

The complexity analysis for GDmax and SGDmax can be decomposed into two parts: the number of gradient evaluations required by a max-oracle, and the number of iterations required to find a stationary point of . In contrast, the complexity analysis for GDA and SGDA is more challenging as the iterate is not necessarily guaranteed to be close to , such that it becomes less clear why is a reasonable direction to follow. In response to this, we develop some techniques to analyze the concave optimization with a slowly changing objective over the course of optimization which may be of independent interest.

Related work: Historically, an early concrete instantiation of problem problem (1.1

) involved computing a pair of probability vectors

, or equivalently solving for a matrix and probability simplices and . This so-called bilinear minimax problem together with von Neumann’s minimax theorem [36]

was a cornerstone in the development of game theory. A general algorithm schema was developed for solving this problem in which the min and max players each run a simple learning procedure in tandem; e.g., the fictitious play

[39]. Later, Sion [44] generalized von Neumann’s result from bilinear functions to general convex-concave functions, , and triggered a line of algorithmic research on convex-concave minimax optimization, in continuous time [23, 8] and discrete time [45, 14, 22, 34, 33]. Unfortunately, the techniques in these works rely heavily on the convex-concave structure and can not be extended to nonconvex-concave minimax problems.

During the past decade, the study of general minimax problems has become a central topic in machine learning, inspired in part by the advent of adversarial learning [15, 27, 43]. Most recent work has focused on reducing oscillations and speeding up convergence of the gradient dynamics; see, e.g., consensus optimization [32], two-timescale GDA [18], the symplectic gradient [3]

[2], optimistic mirror descent [30, 24], inexact proximal point algorithm [25] and the two-timescale algorithm [29]. Despite empirical successes in real applications, the existing convergence analysis of these methods is still limited—all of the existing global convergence results are asymptotic and require strong conditions on the problem structure.

Nonconvex-concave minimax problems appear to be a class of tractable problems in the form of problem (1.1) and have emerged as a focus in optimization and machine learning [43, 38, 41, 17, 26]. Grnarova et al., [17] proposed a variant of GDA for nonconvex-concave problem but did not provide theoretical guarantees for it. Rafique et al., [38] proposed a proximally guided stochastic mirror descent (PG-SMD) and proved that it finds an approximate stationary point of . However, PG-SMD is a nested-loop algorithm, and thus relatively complex to implement; one would like to know whether the nested-loop structure is necessary or whether GDA, a single-loop algorithm, can also be showed to converge in the nonconvex-strongly-concave setting. Such a convergence result has been established in the special case where is a linear function [38]. Lu et al., [26] analyzed a variant of GDA for nonconvex-concave problems and provided the theoretical guarantee under a slightly different setting and a different notion of optimality. A class of inexact nonconvex SGD algorithms [43, 41] can be categorized as variants of one of the algorithms that we analyze here (SGDmax in Algorithm 2). We provide a theoretical guarantee for such algorithms in the general nonconvex-concave case.

## 2 Preliminaries

Notation. We use bold lower-case letters to denote vectors, as in . We use to denote the -norm of vectors and spectral norm of matrices. For a function , denotes the subdifferential of at . If is differentiable, then where denotes the gradient of at , and denotes the partial gradient of with respect to at . For a symmetric matrix

, we denote the largest and smallest eigenvalue of

as and . We use caligraphic upper-case letter to denote sets, as in .

Before presenting the objectives in nonconvex-concave minimax optimization, we first describe some standard definitions.

###### Definition 2.1

is -Lipschitz if for , we have .

###### Definition 2.2

is -gradient Lipschitz if for , we have .

Intuitively, a function being Lipschitz means that the function values at two nearby points must also be close; a function being gradient Lipschitz means that the gradients at two nearby points must also be close. Recall that the minimax problem (1.1) is equivalent to the following minimization problem:

 (2.1)

In this paper, we study the special case where is either concave or strongly concave, thus the maximization problem can be solved efficiently for any . However, since is a nonconvex function, it is NP-hard to find the global minimum of it in general, even in the idealized setting in which the maximizer for any can be computed for free.

#### Objectives in this paper.

We begin by specifying a notion of a local surrogate for a global minimum.

###### Definition 2.3

We call an -stationary point () of a differentiable function if . If , then is called a stationary point.

Unfortunately, even if is Lipschitz and gradient-Lipschitz, need not be differentiable. A weaker condition that is sufficient for the purpose of our paper is the following notion of weak convexity.

###### Definition 2.4

Function is -weakly convex if function is convex.

In particular, when is twice differentiable, is -gradient Lipschitz if and only if all the eigenvalues of the Hessian are upper and lower bounded by and , while is -weak convex if and only if all the eigenvalues of the Hessian are lower bounded by .

For any -weakly convex function , its subdifferential can be uniquely determined by the subdifferential of . A naive measure of approximate stationarity can be defined as a point such that at least one subgradient is small: .

However, this criterion can be very restrictive when optimizing nonsmooth functions. For example, when is a one-dimensional function, an approximate stationary point must be for any . This means that finding an approximate stationary point under this notion is as difficult as solving the minimization exactly. An alternative criterion based on the Moreau envelope of has been recognized as standard when is weakly convex [10].

###### Definition 2.5

Function is the Moreau envelope of with parameter if for any .

###### Lemma 2.6

If is -gradient Lipschitz and is bounded, then the Moreau envelope is differentiable, -gradient Lipschitz, and -strongly convex.

An -stationary point of a -weakly convex function thus can be alternatively defined a a point where the gradient of Moreau envelope is small.

###### Definition 2.7

We call an -stationary point () of a -weakly convex function , if . If , then is called a stationary point.

We can also express Definition 2.7 in terms of the original function .

###### Lemma 2.8

If is an -stationary point of a -weakly convex function (Definition 2.7), then there exists such that and .

Lemma 2.8 shows that an -stationary point defined by the Moreau envelope can be interpreted as the relaxation for . More specifically, if is an -stationary point of a -weakly convex function , it is close to a point which has small subgradient.

## 3 Main Results

In this section, we establish the nonasymptotic convergence rate of GD with a max-oracle (GDmax), SGD with a max-oracle (SGDmax), GDA and SGDA for nonconvex-strongly-concave minimax problems and nonconvex-concave minimax problems.

We present pseudocode for GDmax and SGDmax in Algorithms 1 and 2. Fix , the max-oracle approximately solves at each iteration. Although GDmax and SGDmax are easier to understand, they have two disadvantages over GDA and SGDA: (1) Both GDmax and SGDmax are nested-loop algorithms. Since it is difficult to pre-determine the number iterations for the inner loop, these algorithms are complex to implement in practice; (2) In the general setting where is nonconcave, GDmax and SGDmax are inapplicable as we can not efficiently find a global optimum. In contrast, GDA and SGDA are single-loop algorithms; see Algorithms 3 and 4.

For the stochastic gradient algorithm, we assume that the stochastic gradient oracle satisfies the following condition.

###### Assumption 3.1

is unbiased and has

bounded variance

. For , we have and .

### 3.1 Nonconvex-Strongly-Concave Minimax Problem

In this subsection, we present the results for the nonconvex-strongly concave minimax problem. We make the following assumption.

###### Assumption 3.2

The objective function and constraint set pair, satisfy

1. is -gradient Lipschitz and is -strongly concave for .

2. is a convex and bounded set with a diameter .

While the gradient-Lipschitz assumption is standard in the optimization literature, strongly concavity is crucial here, along with the boundedness of , allowing for an efficient solution of . We let denote the problem condition number throughout this section. The following structural lemma provides further information about in the strongly-concave setting.

###### Lemma 3.3

Under Assumption 3.2, (as defined in Eq.(2.1)) is -gradient Lipschitz and where is -Lipschitz.

The target is to find an -stationary point (cf. Definition 2.3) given only gradient (or stochastic gradient) access to . Denoting , we have the following complexity bound for GDmax.

###### Theorem 3.4 (Complexity Bound for GDmax)

Under Assumption 3.2, letting the step size and the tolerance for the max-oracle be and , the number of iterations required by Algorithm 1 to return an -stationary point is bounded by . Furthermore, the -accurate max-oracle can be realized by gradient ascent (GA) with the stepsize for iterations, which gives the total gradient complexity of the algorithm:

 O(κ2ℓΔΦϵ2log(ℓDϵ)).

Theorem 3.4 shows that if we alternate between one step of gradient descent over and steps of gradient ascent over , with a pair of proper learning rates , we can find at least one stationary point of within gradient evaluations.

We present similar guarantees when only stochastic gradients are available in the following theorem.

###### Theorem 3.5 (Complexity Bound for SGDmax)

Under Assumptions 3.1 and 3.2, letting the step size and the tolerance for the max-oracle be the same in Theorem 3.4 with the batch size , the number of iterations required by Algorithm 2 to return an -stationary point is bounded by . Furthermore, the -accurate max-oracle can be realized by mini-batch stochastic gradient ascent (SGA) with the step size and the mini-batch size for gradient evaluations, which gives the total gradient complexity of the algorithm:

 O(κ2ℓΔΦϵ2log(ℓDϵ)max{1, κσ2ϵ2}).

The sample size guarantees that the variance is less than so that the average stochastic gradients over the batch are sufficiently close to the true gradients and .

We proceed to provide a theoretical guarantee for the single-looped GDA and SGDA algorithms. Since an iterate generated by GDA or SGDA can be far from the maximizer, , it is nontrivial to identify a Lyapunov function that decreases monotonically. We thus need to devise a new proof technique (see Section 4 for details). Based on this technique, we are able to derive the following results for the complexity of the GDA and SGDA algorithms.

###### Theorem 3.6 (Complexity Bound for GDA)

Under Assumption 3.2, letting the step sizes and be chosen as and , the number of iterations required by Algorithm 3 to return an -stationary point is bounded by

 O(κ2ℓΔΦ+κℓ2D2ϵ2),

which is also the total gradient complexity of the algorithm.

###### Theorem 3.7 (Complexity Bound for SGDA)

Under Assumptions 3.1 and 3.2, let the step sizes and be the same in Theorem 3.6 with the batch size , the number of iterations required by Algorithm 4 to return an -stationary point is bounded by , which gives the total gradient complexity of the algorithm:

 O(κ2ℓΔΦ+κℓ2D2ϵ2max{1, κσ2ϵ2}).

Theorem 3.6 and 3.7 show that GDA and SGDA can find an -stationary point with proper step sizes and the convergence rate matches that of GDmax and SGDmax up to a logarithmic factor. Moreover, GDA and SGDA are simpler and more practical than GDmax and SGDmax which in practice requires accurately determining the value of for the max-oracle.

The gradient complexity of GDA can be improved to with a proper initialization for . More specifically, the extra term in Theorem 3.6 can be removed by performing gradient ascent to find so that . It is unclear if the gradient complexity of GDmax could be improved by similar warm-start strategies. In fact, in GDmax is large so even though is close to , is possibly large. This means that gradient ascent in the max-oracle starts with a bad initial point. In contrast, is small in GDA and so is , leading to an automatically good initialization.

The ratio of learning rates for GDA is equivalent to that for GDmax times the number of gradient ascent steps in the inner loop up to the logarithmic factor. More specifically, the ratio is for GDA while the ratio is and the number of gradient ascent steps in the max-oracle is for GDmax. While [20] suggests that is necessary as in a general setting, our result is nonasymptotic with independent of , obtained by carefully exploiting the structure of the nonconvex-strongly-concave minimax problem.

### 3.2 Nonconvex-Concave Minimax Problems

In this subsection, we present the results for the nonconvex-concave minimax problem. The main assumption is the following.

###### Assumption 3.8

The objective function and constraint set pair, satisfy:

1. is -gradient Lipschitz, is -Lipschitz for and is concave for .

2. is a convex and bounded set with a diameter .

Since is only required to be concave for any , is possibly not differentiable. Fortunately, Lipschitz and gradient Lipschitz assumptions guarantees that is -weakly convex and -Lipschitz.

###### Lemma 3.9

Under Assumption 3.8, is -weakly convex and -Lipschitz with where .

The target is to find an -stationary point of a weakly convex function (Definition 2.7) given only gradient (or stochastic gradient) access to . Denote , we present the gradient complexity for GDmax and SGDmax in the following two theorems.

###### Theorem 3.10 (Complexity Bound for GDmax)

Under Assumption 3.8, letting the step size and the tolerance for the max-oracle be and , the number of iterations required by Algorithm 1 to return an -stationary point is bounded by . Furthermore, the -accurate max-oracle is realized by GA with the step size for iterations, which gives the total gradient complexity of the algorithm:

 O(ℓ3L2D2ˆΔΦϵ6).
###### Theorem 3.11 (Complexity Bound for SGDmax)

Under Assumptions 3.1 and 3.8, letting the tolerance for the max-oracle be chosen as the same as in Theorem 3.10 with a step size and a batch size given by and , the number of iterations required by Algorithm 2 to return an -stationary point is bounded by . Furthermore, the -accurate max-oracle is realized by SGA with the step size and a batch size for iterations, which gives the following total gradient complexity of the algorithm:

 O(ℓ3(L2+σ2)D2ˆΔΦϵ6max{1, σ2ϵ2}).

When , the stochastic gradients are sufficiently close to the true gradients and and the gradient complexity of SGDmax matches that of GDmax.

We now provide a theoretical guarantee for the GDA and SGDA algorithms. While the complexity analysis for GDmax and SGDmax is nearly the same as that in subsection 3.1, the proof techniques for GDA and SGDA are quite different; see Section 4 for the details. In what follows, we present the gradient complexity of GDA and SGDA algorithms.

###### Theorem 3.12 (Complexity Bound for GDA)

Under Assumption 3.8, letting the step sizes and be chosen as and , the number of iterations required by Algorithm 3 to return an -stationary point is bounded by

 O(ℓ2L2ˆΔΦϵ6).

which is also the total gradient complexity of the algorithm.

###### Theorem 3.13 (Complexity Bound for SGDA)

Under Assumptions 3.1 and 3.8, letting the step sizes and , and a batch size be chosen as , and , the number of iterations required by Algorithm 4 to return an -stationary point is bounded by

 O(ℓ3(L2+σ2)D2ˆΔΦϵ6max{1, σ2ϵ2}),

which is also the total gradient complexity of the algorithm.

Theorem 3.12 shows that the ratio of learning rates for GDA equals that for GDmax times the number of gradient ascent in the max-oracle. depends on and tends to as . In contrast to [20], we obtain an nonasymptotic result that by exploiting the problem structure with new technique. We note our result does not contradict [12, Proposition 1], which shows that GDA diverges on a simple bilinear minimax problem. In fact, while in their example, is assumed to be compact in our setting, which together with a large ratio , prevents the divergence issue.

## 4 Overview of Proofs

In this section, we present the key ideas behind our theoretical results of GDA and SGDA. The main technical contribution is to develop new techniques for analyzing convex (or concave) optimization with slowly changing objective over the iterations. In particular, we focus on the complexity analysis for GDA in the nonconvex-strongly-concave and nonconvex-concave minimax settings (Theorems 3.6 and 3.12), and omit the proof overview for SGDA.

### 4.1 Nonconvex-Strongly-Concave Minimax Problems

In the nonconvex-strongly-concave setting, Lemma 3.3 implies that is gradient Lipschitz, and where . This implies that, if we can find for each iterate , then we can just use the standard technique in nonconvex smooth optimization and provide an efficient guarantee for finding an -stationary point (cf. Definition 2.3).

Unfortunately, this is not the case for GDA where in general. To overcome this difficulty, the high-level idea in our proof is to control a pair of learning rates that force to move more slowly than . More specifically, Lemma 3.3 guarantees that is -Lipschitz:

 ∥y⋆(x1)−y⋆(x2)∥ ≤ κ∥x1−x2∥,∀x1,x2∈Rm.

That is, if changes slowly, then also changes slowly. This allows us to perform gradient ascent on a slowly changing strongly-concave function , guaranteeing that is small in an amortized sense.

More precisely, letting the error be , Lemma B.3 implies that comes into the standard analysis of nonconvex smooth optimization via the final terms in the following equation:

 Φ(xT+1)−Φ(x0) ≤ −Ω(ηx)T∑t=0∥∇Φ(xt)∥2+O(ηxℓ2)T∑t=0δt.

The remaining step is to show that the additional error term (the second term on the right-hand side) is always small compared to the first term on the right-hand side. This is done via a recursion for (cf. Lemma B.2):

 δt ≤ γδt−1+β∥∇Φ(xt−1)∥2.

where and is small. Therefore, has a linear contraction and can be well controlled.

### 4.2 Nonconvex-Concave Minimax Problems

In the nonconvex-concave case, the main idea is again to control a pair of learning rates to force to move more slowly than . Different from the the setting in the last subsection, is only guaranteed to be concave and is possibly not Lipschitz or even uniquely defined. This means that, even if are extremely close, can be dramatically different from . Therefore, is no longer a viable error to control.

Fortunately, Lemma 3.9 implies that is Lipschitz. This implies that, when the learning rate is very small, the maximum function values changes slowly:

 |Φ(xt)−Φ(xt−1)| ≤ L∥xt−xt−1∥≤ηxL2.

Again, this allows us to perform gradient ascent on concave functions that change slowly in terms of maximum function value, and guarantees is small in an amortized sense. Indeed, Lemma C.1 implies that

 Φ1/2ℓ(xT+1)−Φ1/2ℓ(x0) ≤ −Ω(ηx)T∑t=0∥∥∇Φ1/2ℓ(xt)∥∥2+O(η2xℓL2)(T+1)+O(ηxℓ)T∑t=0Δt,

where the last term on the right-hand side is the error term additional to the standard analysis in nonconvex nonsmooth optimization. The goal of the analysis is again to show the error term is small compared to the sum of the first two terms on the right-hand side.

To bound term , the standard analysis in convex optimization (where the optimal point does not change) uses the following inequalities and a telescoping argument:

 (4.1)

The major challenge here is that the optimal points can change dramatically, and the telescoping argument does not go through. An important observation is, however, that (4.1) can also be proved if we replace the on the right-hand side by , while paying an additional cost that depends on the difference in function value between and . More specifically, we pick a block of size and show in Lemma C.2 for any , the following statement holds,

 Δt−1 ≤ O(ηxL2)(t−1−s)+O(ℓ)(∥yt−y∗(xs)∥2−∥∥yt+1−y∗(xs)∥∥2).

We perform an analysis on the blocks where the concave problem are similar so the telescoping argument can go through now. By carefully choosing , the term can also be well controlled.

## 5 Conclusions

We have presented a theoretical complexity analysis for GDA and SGDA in the setting of nonconvex-strongly-concave and nonconvex-concave minimax problems. We characterize the stationarity conditions in both settings and prove that GDA and SGDA return an -stationary point within gradient and stochastic gradient evaluations for the nonconvex-strongly-concave minimax problems, and gradient and stochastic gradient evaluations for the nonconvex-concave minimax problems. Moreover, we analyze GDmax and SGDmax based on a max-oracle at each iteration, providing a complete complexity analysis. Future directions include the investigation of lower bounds for solving minimax problems and obtaining theoretical guarantees for GDA in a still wider range of problems.

## Appendix A Proof of Technical Lemmas

In this section, we provide complete proofs for the lemmas in Section 2 and Section 3.

### a.1 Proof of Lemma 2.6

We provide a proof for an expanded version of Lemma 2.6.

###### Lemma A.1

If is -gradient Lipschitz and is bounded, we have

1. and are well-defined for .

2. for .

3. is -gradient Lipschitz with .

4. for .

Proof. By the definition of , we have

 Ψ(x) ≐ Φ(x)+ℓ2∥x∥2 = maxy∈Y {f(x,y)+ℓ2∥x∥2}.

Since is -gradient Lipschitz, is convex in for . Since is bounded, Danskin’s theorem [40] implies that is convex. Putting these pieces together yields that is -strongly convex. This implies that and are well-defined. Furthermore, by the definition of , we have

 Φ(proxΦ/2ℓ(x)) ≤ Φ1/2ℓ(proxΦ/2ℓ(x)) ≤ Φ(x),∀x∈Rm.

Moreover, [10, Lemma 2.2] implies that is -gradient Lipschitz with

 ∇Φ1/2ℓ(x)=2ℓ(x−proxΦ/2ℓ(x)).

Finally, it follows from [35, Theorem 2.1.5] that satisfies the last inequality.

### a.2 Proof of Lemma 2.8

Denote , we have (cf. Lemma 2.6) and hence

 ∥^x−x∥ = ∥∥∇Φ1/2ℓ(x)∥∥2ℓ.

Furthermore, the optimality condition for implies that . Putting these pieces together yields that .

### a.3 Proof of Lemma 3.3

Since is strongly concave in for , is unique and well-defined. Then we claim that is -Lipschitz. Indeed, let , the optimality of and implies that

 ≤ 0,∀y∈Y, (A.1) ≤ 0,∀y∈Y. (A.2)

Letting in (A.1) and in (A.2) and summing the resulting two inequalities yields

 (y∗(x2)−y∗(x1))⊤(∇yf(x1,y∗(x1))−∇yf(x2,y∗(x2))) ≤ 0. (A.3)

Recall that is -strongly concave, we have

 (A.4)

Then we conclude the desired result by combining (A.3),  (A.4) and that is -gradient Lipschitz, i.e.,

 μ∥y∗(x2)−y∗(x1)∥2 ≤ (y∗(x2)−y∗(x1))⊤(∇yf(x2,y∗(x2))−∇yf(x1,y∗(x2))) ≤ ℓ∥y∗(x2)−y∗(x1)∥∥x2−x1∥.

Finally, since is unique and is convex and bounded, we conclude from Danskin’s theorem [40] that is differentiable with . Since , we have

 ∥∥∇Φ(x)−∇Φ(x′)∥∥ = ∥∥∇xf(x,y∗(x))−∇xf(x′,y∗(x′))∥∥ ≤ ℓ(∥∥x−x′∥∥+∥∥y∗(x)−y∗(x′)∥∥).

Since is -Lipschitz, we conclude the desired result by plugging . Since , is -gradient Lipschitz. The last inequality follows from [35, Theorem 2.1.5].

### a.4 Proof of Lemma 3.9

By the proof in Lemma A.1, is -weakly convex and where

 Ψ(x) = maxy∈Y {f(x,y)+ℓ2∥x∥2}.

Since is convex in for and is bounded, Danskin’s theorem [40] implies that