# Uniform Convergence and Generalization for Nonconvex Stochastic Minimax Problems

This paper studies the uniform convergence and generalization bounds for nonconvex-(strongly)-concave (NC-SC/NC-C) stochastic minimax optimization. We first establish the uniform convergence between the empirical minimax problem and the population minimax problem and show the 𝒪̃(dκ^2ϵ^-2) and 𝒪̃(dϵ^-4) sample complexities respectively for the NC-SC and NC-C settings, where d is the dimension number and κ is the condition number. To the best of our knowledge, this is the first uniform convergence measured by the first-order stationarity in stochastic minimax optimization. Based on the uniform convergence, we shed light on the sample and gradient complexities required for finding an approximate stationary point for stochastic minimax optimization in the NC-SC and NC-C settings.

• 5 publications
• 31 publications
• 57 publications
• 33 publications
07/02/2019

### Efficient Algorithms for Smooth Minimax Optimization

This paper studies first order methods for solving smooth minimax optimi...
03/29/2021

### The Complexity of Nonconvex-Strongly-Concave Minimax Optimization

This paper studies the complexity for finding approximate stationary poi...
01/22/2020

### Zeroth-Order Algorithms for Nonconvex Minimax Problems with Improved Complexities

In this paper, we study zeroth-order algorithms for minimax optimization...
06/03/2020

### A Unified Single-loop Alternating Gradient Projection Algorithm for Nonconvex-Concave and Convex-Nonconcave Minimax Problems

Much recent research effort has been directed to the development of effi...
06/09/2022

### What is a Good Metric to Study Generalization of Minimax Learners?

Minimax optimization has served as the backbone of many machine learning...
06/15/2020

### The Landscape of Nonconvex-Nonconcave Minimax Optimization

Minimax optimization has become a central tool for modern machine learni...
02/13/2022

### Minimax in Geodesic Metric Spaces: Sion's Theorem and Algorithms

Determining whether saddle points exist or are approximable for nonconve...

## 1 Introduction

In this paper, we consider nonconvex stochastic minimax problems:

 minx∈Xmaxy∈Y F(x,y)≜Eξ[f(x,y;ξ)], (1)

where and () are two nonempty closed convex sets,

is a random variable following an unknown distribution

, and is continuously differentiable and Lipschitz smooth jointly in and for any . We denote the objective (1) as the population minimax problem. Throughout the paper, we focus on the case where is nonconvex in and (strongly) concave in , i.e., nonconvex-(strongly)-concave (NC-SC / NC-C). Such problems widely appear in practical applications like adversarial training (madry2018towards; wang2019convergence)(goodfellow2014generative; sanjabi2018convergence; lei2020sgd)(dai2017learning; dai2018sbeed; huang2020convergence) and robust training (sinha2018certifying). The distribution is often unknown and one generally only has access to a dataset consisting of i.i.d. samples from and instead solves the following empirical minimax problem:

 minx∈Xmaxy∈Y FS(x,y)≜1nn∑i=1f(x,y;ξi). (2)

Since functions and are nonconvex in and pursuing their global optimal solutions is intractable in general, instead one aims to design an algorithm that finds an -stationary point,

 ∥∇Φ(Ax(S))∥≤ϵordist(0,∂Φ(Ax(S)))≤ϵ, (3)

where and are primal functions, is the -component of the output of any algorithm for solving (2), and is the (Fréchet) subdifferential of . When is nonsmooth, we resort to the gradient norm of its Moreau envelope to measure the first-order stationarity as it provides an upper bound on (davis2019stochastic).

Take the NC-SC setting as an example. The optimization error for solving the population minimax problem (1) consists of two terms222Here for simplicity of illustration, we assume there is no constraint and primal functions are differentiable, the detailed setting will be formally introduced in Section 2.:

 (4)

where the first term on the right-hand-side corresponds to the optimization error of solving the empirical minimax problem (2) and the second term corresponds to the generalization error. Such decomposition on the gradient norm has been studied recently in nonconvex minimization, e.g., foster2018uniform; mei2018landscape; davis2022graphical. Recently, there is a line of work that develops efficient algorithms for solving the empirical minimax problems, which gives a hint on the optimization error; see e.g,  (luo2020stochastic; yang2020catalyst), just to list a few. However, a full characterization of the generalization error is still lacking.

Characterizing the generalization error is not easy as both and depend on the dataset , which induces some correlation. One way to address such dependence issue in generalization bounds is to establish the stability argument of specific algorithms in stochastic optimization (bousquet2002stability; shalev2010learnability; hardt2016train) and stochastic minimax optimization (farnia2021train; lei2021stability; boob2021optimal; yang2022differentially). However, these stability-based generalization bounds have several drawbacks:

1. Generally they require a case-by-case analysis for different algorithms, i.e, these bounds are algorithm-dependent.

2. Existing stability analysis only applies to simple gradient-based algorithms for minimization and minimax problems (note that for minimax optimization, simple algorithms such as stochastic gradient descent ascent often turns out to be suboptimal), yet such analysis can be difficult to generalize to more sophisticated state-of-the-art algorithms.

3. Existing stability analysis generally requires specific parameters (e.g., stepsizes), which may misalign with those required for convergence analysis, thus making the generalization bounds less informative.

4. Existing stability-based generalization bounds generally use function value-based gap as the measurement of the algorithm, which may not be suitable concerning the nonconvex landscape.

To the best of our knowledge, there is no generalization bound results measured by the first-order stationarity in nonconvex minimax optimization.

To overcome these difficulties, we aim to derive generalization bounds via establishing the uniform convergence between the empirical minimax and the population minimax problem, i.e., . Note that uniform convergence is invariant to the choice of algorithms and provides an upper bound on the generalization error for any , thus the derived generalization bound is algorithm-agnostic. Although uniform convergence has been extensively studied in the literature of stochastic optimization (kleywegt2002sample; mei2018landscape; davis2022graphical), a key difference in uniform convergence for minimax optimization is that the primal function cannot be written as the average over i.i.d. random functions and one needs to additionally characterize the differences between and . Thus techniques in uniform convergence for classical stochastic optimization are not directly applicable.

We are interested in both the sample complexity and gradient complexity for achieving stationarity convergence of the population minimax problem (1). Here the sample complexity refers to the number of samples , and the gradient complexity refers to the number of gradient evaluations of . Combining the derived generalization error with optimization error from existing algorithms in finite-sum nonconvex minimax optimization, e.g., luo2020stochastic; yang2020catalyst; zhang2021complexity, one automatically obtains the sample and gradient complexities bounds of these algorithms for solving the population minimax problem.

### 1.1 Contributions

Our contributions are two-fold:

• [leftmargin = 2em]

• We establish the first uniform convergence results between the population and the empirical nonconvex minimax optimization in NC-SC and NC-C settings, measured by the gradients of primal functions (or its Moreau envelope). It provides an algorithm-agnostic generalization bound for any algorithms that solve the empirical minimax problem. Specifically, the sample complexities to achieve an -uniform convergence and an -generalization error are and for the NC-SC and NC-C settings, respectively.

• Combined with algorithms for nonconvex finite-sum minimax optimization, the generalization results further imply gradient complexities for solving NC-SC and NC-C stochastic minimax problems, respectively. See Table 1 for a summary. In terms of dependence on the accuracy and the condition number , the achieved sample complexities significantly improve over the sample complexities of SOTA SGD-type algorithms in literature (luo2020stochastic; yang2022faster; rafique2021weakly; lin2020gradient; boct2020alternating); and the achieved gradient complexities match with existing SOTA results. The dependence on the dimension may be avoided if one directly analyzes SGD-type algorithms as shown in literature on stochastic optimization (kleywegt2002sample; nemirovski2009robust; davis2022graphical; hu2020sample), and evidenced for both NC-SC and NC-C minimax problems in our paper.

### 1.2 Literature Review

#### Nonconvex Minimax Optimization

In the NC-SC setting, many algorithms have been proposed, e.g., nouiehed2019solving; lin2020gradient; lin2020near; luo2020stochastic; yang2020global; boct2020alternating; xu2020unified; lu2020hybrid; yan2020optimal; guo2021novel; sharma2022federated. Among them, (zhang2021complexity) achieved the optimal complexity in the deterministic case by introducing the Catalyst acceleration scheme (lin2015universal; paquette2018catalyst) into minimax problems, and luo2020stochastic; zhang2021complexity achieved the best complexity in the finite-sum case for now, which are and , respectively. For the purely stochastic NC-SC minimax problems, yang2022faster introduced a stochastic smoothed-AGDA algorithm, which achieves the best complexity, while luo2020stochastic achieves the best complexity if further assuming average smoothness. The lower bounds of NC-SC problems in deterministic, finite-sum, and stochastic settings have been extensively studied recently in zhang2021complexity; han2021lower; li2021complexity.

In general, NC-C problems are harder than NC-SC problems since its primal function can be both nonsmooth and nonconvex (thekumparampil2019efficient). Recent years witnessed a surge of algorithms for NC-C problems in deterministic, finite-sum, and stochastic settings, e.g., zhang2020single; ostrovskii2021efficient; thekumparampil2019efficient; zhao2020primal; nouiehed2019solving; yang2020catalyst; lin2020gradient; boct2020alternating, to name a few. To the best of our knowledge, thekumparampil2019efficient; yang2020catalyst; lin2020near achieved the best complexity in the deterministic case, while yang2020catalyst achieved the best complexity in the finite-sum case, and rafique2021weakly provided the best complexity in the purely stochastic case.

#### Uniform Convergence

A series of works from stochastic optimization and statistical learning theory studied uniform convergence on the worst-case differences between the population objective

and its empirical objective constructed via sample average approximation (SAA, also known as empirical risk minimization). Interested readers may refer to prominent results in statistical learning (fisher1922mathematical; vapnik1999overview; van2000asymptotic). For finite-dimensional problem, kleywegt2002sample showed that the sample complexity is to achieve an

-uniform convergence in high probability, i.e.,

. For nonconvex empirical objectives, mei2018landscape and davis2022graphical established sample complexity of uniform convergence measured by the stationarity for nonconvex smooth and weakly convex functions, respectively. For infinite-dimensional functional stochastic optimization with a finite VC-dimension, uniform convergence still holds (vapnik1999overview). In addition, wang2017differentially uses uniform convergence to demonstrate the generalization and the gradient complexity of differential private algorithms for stochastic optimization.

#### Stability-Based Generalization Bounds

Another line of research focuses on generalization bounds of stochastic optimization via the uniform stability of specific algorithms, including SAA (bousquet2002stability; shalev2009stochastic), stochastic gradient descent (hardt2016train; bassily2020stability), and uniformly stable algorithms (klochkov2021stability). Recently, a series of works further extended the analysis to understand the generalization performances of various algorithms in minimax problems. farnia2021train gave the generalization bound for the outputs of gradient-descent-ascent (GDA) and proximal-point algorithm (PPA) in both (strongly)-convex-(strongly)-concave and nonconvex-nonconcave smooth minimax problems. lei2021stability focused on GDA and provided a comprehensive study for different settings of minimax problems with various generalization measures on function value gaps. boob2021optimal provided stability and generalization results of extragradient algorithm (EG) in the smooth convex-concave setting. On the other hand, zhang2021generalization studied stability and generalization of the empirical minimax problem under the (strongly)-convex-(strongly)-concave setting, assuming that one can find the optimal solution to the empirical minimax problem.

## 2 Problem Setting

#### Notations

Throughout the paper, we use as the -norm, as the gradient of a function , for nonnegative functions and , we say if for some . We denote as the projection operator. Let denote the output of an algorithm on the empirical minimax problem (2) with dataset . Given , we say a function is -strongly convex if is convex, and it is -strongly concave if is -strongly convex. Function is -weakly convex if is convex (see more notations and standard definitions in Appendix A).

###### Definition 2.1 (Smooth Function)

We say a function is -smooth jointly in if the function is continuous differentiable, and there exists a constant such that for any , we have and .

By definition, it is easy to find that an -smooth function is also -weakly convex. Next we introduce the main assumptions used throughout the paper.

###### Assumption 2.1 (Main Settings)

We assume the following:

• [leftmargin = 2em]

• The function is -smooth jointly in for any .

• The function is -strongly concave in for any and any where .

• The gradient norms of and are bounded by respectively for any .

• The domains and are compact convex sets, i.e., there exists constants such that for any , and for any , , respectively.

Note that compact domain assumption is widely used in uniform convergence literature (kleywegt2002sample; davis2022graphical).

Under Assumption 2.1, the objective function is -smooth in and -strongly concave for any . When , we call the population minimax problem (1) a nonconvex-strongly-concave (NC-SC) minimax problem; when , we call it a nonconvex-concave (NC-C) minimax problem.

###### Definition 2.2 (Moreau Envelope)

For an -weakly convex function and , we use and to denote the the Moreau envelope of and the proximal point of for a given point , defined as following:

 Φλ(x)≜minz∈X{Φ(z)+12λ∥z−x∥2},proxλΦ(x)≜argminz∈X{Φ(z)+12λ∥z−x∥2}. (5)

Below we recall some important properties on the primal function and its Moreau envelope presented in the literature (davis2019stochastic; thekumparampil2019efficient; lin2020gradient).

###### Lemma 2.1 (Properties of Φ and Φλ)

In the NC-SC setting (), both and are -smooth with the condition number . In the NC-C setting (), the primal function is -weakly convex, its Moreau envelope is differentiable, Lipschitz smooth, also , , where and .

#### Performance Measurement

In the NC-SC setting, the primal functions and are both -smooth. Regarding the constraint, we measure the difference between the population and empirical minimax problems using the generalized gradient of the population and the empirical primal functions, i.e., , where . The following inequality summarized the relationship of measurements in term of generalized gradient and in terms of gradient used in Section 1.

 E∥∥GΦ(Ax(S))−GΦS(Ax(S))∥∥generalization error of Algorithm A≤E∥∇Φ(Ax(S))−∇ΦS(Ax(S))∥≤E[maxx∈X∥∇Φ(x)−∇ΦS(x)∥],algorithm-agnostic uniform convergence (6)

where the first inequality holds as projection is a non-expansive operator. The term in the left-hand side (LHS) above is the generalization error of an algorithm we desire in the NC-SC case.

For the NC-C case, the primal function is -weakly convex, we use the gradient of its Moreau Envelope to characterize the (near)-stationarity (davis2019stochastic). We measure the proximity between the population and empirical problems using the difference between the gradients of their respective Moreau envelopes. The generalization error and the uniform convergence in the NC-C case is given as follows:

 E∥∥∇Φ1/(2L)(Ax(S))−∇Φ1/(2L)S(Ax(S))∥∥generalization error of Algorithm A≤E[maxx∈X∥∥∇Φ1/(2L)(x)−∇Φ1/(2L)S(x)∥∥]% algorithm-agnostic uniform convergence. (7)

The term in the LHS above is the generalization error of an algorithm we desire in the NC-C case.

## 3 Uniform Convergence and Generalization Bounds

In this section, we discuss the sample complexity for achieving -uniform convergence and -generalization error for NC-SC and NC-C stochastic minimax optimization.

### 3.1 NC-SC Stochastic Minimax Optimization

Under the NC-SC setting, we demonstrate in the following theorem the uniform convergence between gradients of primal functions of the population and empirical minimax problems, which provides an upper bound on the generalization error for any algorithm . We defer the proof to Appendix B.

###### Theorem 3.1 (Uniform Convergence and Generalization Error, NC-SC)

Under Assumption 2.1 with , we have

 E[maxx∈X∥∇Φ(x)−∇ΦS(x)∥]=~O(d1/2κn−1/2). (8)

Furthermore, to achieve -uniform convergence and -generalization error for any algorithm such that the error , it suffices to have

 n=n∗NCSC≜~O(dκ2ϵ−2). (9)

To the best of our knowledge, it is the first uniform convergence and algorithm-agnostic generalization error bound result for NC-SC stochastic minimax problem. In comparison, existing works in the generalization error analysis (farnia2021train; lei2021stability) utilize stability arguments for certain algorithms and thus are algorithm-specific. zhang2021generalization establish algorithm-agnostic stability and generalization in the strongly-convex-strongly-concave regime, yet their analysis does not extend to the nonconvex regime. Our generalization results apply to any algorithms for solving finite-sum problems, especially the SOTA algorithms like Catalyst-SVRG (zhang2021complexity) and finite-sum version SREDA (luo2020stochastic). These algorithms are generally very complicated, and they lack stability-based generalization bounds analysis.

The achieved sample complexity further implies that for any algorithm that achieves an -stationarity point of the empirical minimax problem, its sample complexity for finding an -stationary point of the population minimax problem is . In terms of the dependence on the accuracy and the condition number , such sample complexity is better than the SOTA sample complexity results achieved via directly applying gradient-based methods on the population minimax optimization, i.e., by Stochastic Smoothed-AGDA (yang2022faster) and by SREDA (luo2020stochastic).

### 3.2 NC-C Stochastic Minimax Optimization

In this subsection, we derive the uniform convergence and algorithm-agnostic generalization bounds for NC-C stochastic minimax problems in the following theorem. Recall that the primal function is -weakly convex (thekumparampil2019efficient) and is not well-defined. We use the gradient of the Moreau envelope of the primal function as the measurement (davis2019stochastic).

###### Theorem 3.2 (Uniform Convergence and Generalization Error, NC-C)

Under Assumption 2.1 with , we have

 E[maxx∈X∥∥∇Φ1/(2L)S(x)−∇Φ1/(2L)(x)∥∥]=~O(d1/4n−1/4). (10)

Furthermore, to achieve -uniform convergence and -generalization error for any algorithm such that the error , it suffices to have

 n=n∗NCC≜~O(dϵ−4). (11)

#### Proof Sketch

The analysis of Theorem 3.2 consists of three parts. By the expression of the gradient of the Moreau envelope, it holds that when ,

 ∥∇ΦλS(x)−∇Φλ(x)∥≤1λ∥proxλΦ(x)−proxλΦS(x)∥.

We first use a -net  (vapnik1999overview) to handle the dependence issue between and .

Then we build up a connection between NC-C stochastic minimax optimization problems and NC-SC stochastic minimax optimization problems via adding an -regularization and carefully choosing a regularization parameter. The following lemma characterizes the distance between the proximal points of the primal function of the original NC-C problem and the regularized NC-SC problem . Note that the lemma may be of independent interest for the design and the analysis of gradient-based methods for NC-C problem.

###### Lemma 3.1

For , denote as the primal function of the regularized NC-C problem. It holds for that

 ∥proxλΦ(x)−proxλ^Φ(x)∥2≤νDYλ1−λ(L+ν).

This lemma implies that for small regularization parameter , the difference between the proximal point of the primal function of the NC-C problem and the primal function of the regularized NC-SC problem is going to be small.

Proof   Since is -smooth, it is obvious that is -smooth. By (thekumparampil2019efficient, Lemma 3), is -weakly convex in . Therefore, is -strongly convex in for any fixed . Denote

 ^y(x)≜argmaxy∈YF(x,y)−ν2∥y∥2,y∗(x)≜argmaxy∈YF(x,y). (12)

It holds that

 12(1/λ−(L+ν))∥proxλΦ(x)−proxλ^Φ(x)∥2 ≤ ^Φ(proxλΦ(x))+12λ∥proxλΦ(x)−x∥2−^Φ(proxλ^Φ(x))−12λ∥proxλ^Φ(x)−x∥2 = F(proxλΦ(x),^y(proxλΦ(x)))−ν2∥^y(proxλΦ(x))∥2+12λ∥proxλΦ(x)−x∥2 −F(proxλ^Φ(x),^y(proxλ^Φ(x)))+ν2∥^y(proxλ^Φ(x))∥2−12λ∥proxλ^Φ(x)−x∥2 ≤ F(proxλΦ(x),y∗(proxλΦ(x)))+12λ∥proxλΦ(x)−x∥2−ν2∥^y(proxλΦ(x))∥2 −F(proxλ^Φ(x),^y(proxλ^Φ(x)))−12λ∥proxλ^Φ(x)−x∥2+ν2∥^y(proxλ^Φ(x))∥2 ≤ F(proxλΦ(x),y∗(proxλΦ(x)))+12λ∥proxλΦ(x)−x∥2−ν2∥^y(proxλΦ(x))∥2 = Φ(proxλΦ(x))+12λ∥proxλΦ(x)−x∥2−Φ(proxλ^Φ(x))−12λ∥proxλ^Φ(x)−x∥2 +ν2∥y∗(proxλ^Φ(x))∥2−ν2∥^y(proxλΦ(x))∥2 ≤ ν2∥y∗(proxλ^Φ(x))∥2−ν2∥^y(proxλΦ(x))∥2 ≤ νDY2,

where the first inequality holds by strong convexity of and optimality of for , the first equality holds by definition of , the second inequality holds by optimality of , the third inequality holds by optimality of , the second equality holds by definition of , the fourth inequality holds by optimality of , the last inequality holds by compact domain , which concludes the proof.

It remains to characterize the distance between and and show that is a sub-Gaussian random variable. For the distance between and , by definition, it is equivalent to the difference between the optimal solutions on of strongly-convex strongly-concave (SC-SC) population minimax problem and its empirical minimax problem. We utilize the existing stability-based results for SC-SC minimax optimization (zhang2021generalization) to build the upper bound for the distance and show the variable is sub-Gaussian. The proof of Theorem 3.2 is deferred to Appendix C.

To the best of our knowledge, this is the first algorithm-agnostic generalization error result in NC-C stochastic minimax optimization. Similar to the NC-SC setting, Theorem 3.2 indicates that the sample complexity to guarantee an -generalization error in the NC-C case for any algorithm is . In comparison, it is much better than the sample complexity achieved by the SOTA stochastic approximation-based algorithms (rafique2021weakly) for NC-C stochastic minimax optimization for small accuracy and moderate dimension .

###### Remark 3.1 (Comparison Between Minimization, NC-SC, and NC-C Settings)

For general stochastic nonconvex optimization, the sample complexity of achieving uniform convergence is  (davis2022graphical; mei2018landscape). There are two key differences in minimax optimization.

1. The primal function is not in the form of averaging over samples and thus existing analysis for minimization problem is not directly applicable. Instead if we care about the uniform convergence in terms of the gradient of , i.e., , the existing analysis in mei2018landscape directly gives a sample complexity.

2. For a given , the optimal point differs from and such difference brings in an additional error term. In the NC-SC case, such error is upper bounded by , which is of the same scale of the error from . Thus the eventual uniform convergence bound is of the same order as that for minimization problem (mei2018landscape; davis2022graphical). However, in the NC-C case, may not be well defined. Instead, we bound the distance between

 ^y∗S(x)≜argmaxy∈YFS(x,y)−ν2∥y∥2and^y∗(x)≜argmaxy∈YF(x,y)−ν2∥y∥2

for a small . Such error is controlled by . Thus the sample complexity for achieving -uniform convergence for the NC-C case is large than that of the NC-SC case.

We leave it for future investigation to see if one could achieve smaller sample complexity in the NC-C case via a better characterization of the extra error brought in by in the NC-C setting.

### 3.3 Gradient Complexity for Solving Stochastic Nonconvex Minimax Optimization

The uniform convergence and the algorithm-agnostic generalization error shed light on the tightness of the complexity of algorithms for solving stochastic minimax optimization. We summarize related results in Table 1 and elaborate the details in this subsection.

Combining sample complexities for achieving -generalization error and gradient complexities of existing algorithms for solving empirical minimax problems, we can directly obtain gradient complexities of these algorithms for solving population minimax problems. Note that the SOTA gradient complexity for solving NC-SC empirical problems is  (luo2020stochastic)333Such gradient complexity holds when as mentioned in luo2020stochastic. Our sample complexity result in Theorem 3.1 aligns such requirements. Also the results therein assume average smoothness, which is a weaker condition than individual smoothness in our paper. and  (zhang2021complexity), while for solving NC-C empirical problems it is  (yang2020catalyst). We substitute the required sample size given by Theorem 3.1 and Theorem 3.2 to get the corresponding gradient complexity for solving the population minimax problem (1). Recall the definition of (near)-stationarity, the next theorem shows the achieved gradient complexity and we defer the proof to Appendix D..

###### Theorem 3.3 (Gradient Complexity of Specific Algorithms)

Under Assumption 2.1, we have:

• [leftmargin = 2em]

• In the NC-SC case, if we use the finite-sum version of SREDA proposed in luo2020stochastic for the empirical minimax problem (2) with , the algorithm output is an -stationary point of the population minimax problem (1), and the corresponding gradient complexity is .

• In the NC-C case, if we use the Catalyst-SVRG algorithm proposed in yang2020catalyst for the empirical minimax problem (2) with size , the algorithm output is an -stationary point of the population minimax problem (1), and the corresponding gradient complexity is .

#### Dependence on the Dimension d

The gradient complexities obtained in Theorem 3.3 come with a dependence on the dimension , which stems from the analysis of uniform convergence argument as it aims to bound the error on the worst-case . On the other hand, to achieve a small optimization error on the population minimax problem, it only requires a small generalization error on the specific output . Thus the gradient complexity obtained from uniform convergence has its own limitation.

Nevertheless, the obtained sample and gradient complexities are still meaningful in terms of the dependence on . In addition, we point out that the dependence on can generally be avoided if one directly analyzes some SGD-type methods for the population minimax problem. We have witnessed in various settings that the complexity bound of SAA has a dependence on dimension while there exist some SGD-type algorithms with dimension-free gradient complexities. See kleywegt2002sample and nemirovski2009robust for classical stochastic convex optimization, davis2022graphical and davis2019stochastic for stochastic nonconvex optimization.

On the other hand, there are several structured machine learning models that enjoy dimension-free uniform convergence results

(davis2022graphical; foster2018uniform; mei2018landscape). We leave the investigation of dimension-free uniform convergence for specific applications with nonconvex minimax structure as a future direction.

#### Matching SGD-Type Algorithms in Stochastic Nonconvex Minimax Problems

In fact, the above argument that one can get rid of dependence on in SGD-type algorithm analysis is already verified. In NC-SC stochastic minimax optimization, the stochastic version of the SREDA algorithm in luo2020stochastic achieves gradient complexity, which matches the first bullet point in Theorem 3.3 except for dependence on dimension. In the NC-C case, the PG-SMD algorithm proposed in rafique2021weakly achieves gradient complexity, which matches the second result in Theorem 3.3 while it is free of the dimension dependence.

We point out that the discussion above relies on the gradient complexities of existing SOTA algorithms in NC-SC or NC-C finite-sum minimax optimization, which may not be sharp enough in terms of the dependence on the condition number or the sample size . It is still possible to further improve the gradient complexity if one could design faster algorithms for solving empirical nonconvex minimax optimization problems.

###### Remark 3.2 (Tightness of Lower and Upper Complexity Bounds)

In the NC-SC setting, zhang2021complexity provides a lower complexity bound for NC-SC finite-sum problems as with the average smoothness assumption, which is strictly lower than the SOTA upper bounds in (luo2020stochastic; zhang2021complexity). If the lower bound is sharp in our setting, with the error decomposition (4), we can conjecture that there exists an algorithm for solving NC-SC stochastic minimax problems with a better gradient complexity as . For the NC-C setting, there is no specific lower complexity bound 444To the best of our knowledge, currently there is no lower bound result specifically for NC-C minimax optimization. The existing lower bounds for nonconvex minimization (carmon2019lower; carmon2019lowerII; fang2018spider; zhou2019lower; arjevani2019lower) and NC-SC minimax problems (zhang2021complexity; han2021lower; li2021complexity) are trivial lower bounds for nonconvex minimax problems., so it is still an open problem whether the currently SOTA complexity in yang2020catalyst is optimal. It remains an open question whether one can design an algorithm with improved complexity and what is the lower complexity bound of the NC-C setting .

On the other hand, the SOTA gradient complexity bound for NC-SC finite-sum problems is (luo2020stochastic) and (zhang2021complexity): one has a better dependence on the sample size and the other has a better dependence on the condition number . With the latter upper bound, the error decomposition (4) implies an gradient complexity, which is clearly sub-optimal in terms of the dependence on accuracy . Note that the -dependence of the gradient complexity induced by the former upper bound has hit the lower bound in nonconvex smooth optimization (arjevani2019lower). We conjecture that the gradient complexity achieved in luo2020stochastic has an optimal dependence on the sample size as it provides a matching dependence on accuracy with the lower bound when combined with our uniform convergence result.

## 4 Conclusion

In this paper, we take an initial step towards understanding the the uniform convergence and corresponding generalization performances of NC-SC and NC-C minimax problems measured by the first-order stationarity. We hope that this work will shed light on the design of algorithms with improved complexities for solving stochastic nonconvex minimax optimization.

Several future directions are worthy further investigation. It remains interesting to see whether we can improve the uniform convergence results under the NC-C setting, particularly the dependence on accuracy . In addition, is it possible to design algorithms for the NC-SC finite-sum setting with better complexities and close the gap from the lower bound. In terms of generalization bounds, it remains open to derive algorithm-specific stability-based generalization bounds under the stationarity measurement.

## Appendix A Additional Definitions and Tools

For convenience, we summarize the notations commonly used throughout the paper.

• [leftmargin = 2em]

• Population minimax problem and its primal function555Another commonly used convergence criterion in minimax optimization is the first-order stationarity of , i.e., and (or its corresponding gradient mapping) [lin2020gradient, xu2020unified]. We refer readers to lin2020gradient, yang2022faster for a thorough comparison of these two measurements. In this paper, we always stick to the convergence measured by the stationarity of the primal function.

 F(x,y)≜Eξf(x,y;ξ),Φ(x)≜maxy∈YF(x,y),y∗(x)≜argmaxy∈YF(x,y).
• Empirical minimax problem and its primal function

 FS(x,y)≜1nn∑i=1f(x,y;ξi),ΦS(x)≜maxy∈YFS(x,y),y∗S(x)≜argmaxy∈YFS(x,y).
• Moreau envelope and corresponding proximal point:

 Φλ(x)≜minz∈X{Φ(z)+12λ∥z−x∥2},proxλΦ(x)≜argminz∈X{Φ(z)+12λ∥z−x∥2},ΦλS(x)≜minz∈X{ΦS(z)+12λ∥z−x∥2},proxλΦS(x)≜argminz∈X{ΦS(z)+12λ∥z−x∥2}.

• : -norm.

• : the gradient of a function .

• : the projection operator.

• : the output of an algorithm on the empirical minimax problem (2) with dataset .

• NC / WC: nonconvex, weakly convex.

• NC-SC / NC-C: nonconvex-(strongly)-concave.

• SOTA: state-of-the-art.

• : dimension number of .

• : condition number , : Lipschitz smoothness parameter, : strong concavity parameter.

• hides poly-logarithmic factors.

• if for some and nonnegative functions and .

• We say a function is convex if and , we have .

• A function is -smooth666Here the smoothness definition for single-variable functions is subtly different from that of two-variable functions in Definition 2.1, so we list it here for completeness. if is continuously differentiable in and there exists a constant such that holds for any .

For completeness, we introduce the definition of a sub-Gaussian random variable and related lemma, which are important tools in the analysis.

###### Definition A.1 (Sub-Gaussian Random Variable)

A random variable

is a zero-mean sub-Gaussian random variable with variance proxy

if and either of the following two conditions hold:

 (a) E[exp(sη)]≤exp(σ2ηs22) for any s∈R;(b) P(|η|≥t)≤2exp(−t22σ2η) for any t>0.

We use the following McDiarmid’s inequality to show that a random variable is sub-Gaussian.

###### Lemma A.1 (McDiarmid’s inequality)

Let be independent random variables. Let be any function with the -bounded differences property: for every and every , and that differ only in the -th coordinate ( for all ), we have

 ∣∣h(η1,…,ηn)−h(η′1,…,η′n)∣∣≤ci.

For any , it holds that

 P(|h(η1,…,ηn)−Eh(η1,…,ηn)|≥t)≤2exp(−2t2∑ni=1c2i).
###### Lemma A.2 (Properties of Φ and Φλ, Restate)

In the NC-SC setting (), both and are -smooth with the condition number , both and are -Lipschitz continuous and . In the NC-C setting (), the primal function is -weakly convex, and its its Moreau envelope is differentiable, Lipschitz smooth, also

 ∇Φλ(x)=λ−1(x−^x),∥∥∇Φλ(x)∥∥≥dist(0,∂Φ(^x)), (13)

where and .

For completeness, we formally define the stationary point here. Note that the generalized gradient is defined on while the Moreau envelope is defined on the whole domain .

###### Definition A.2 (Stationary Point)

Let , for an -smooth function , we call a point an -stationary point of if , where is the gradient mapping (or generalized gradient) defined as ; for an -weakly convex function , we say a point an -(nearly)-stationary point of if .

## Appendix B Proof of Theorem 3.1

Proof   To derive the desired generalization bounds, we take an -net on so that there exists a for any such that . Note that such -net exists with for compact  [kleywegt2002sample]. Utilizing the definition of the -net, we have

 Emaxx∈X∥∇ΦS(x)−∇Φ(x)∥≤Emaxx∈X[∥∇ΦS(x)−∇ΦS(xk