1 Introduction
In this paper, we consider nonconvex stochastic minimax problems:
(1) 
where and () are two nonempty closed convex sets,
is a random variable following an unknown distribution
, and is continuously differentiable and Lipschitz smooth jointly in and for any . We denote the objective (1) as the population minimax problem. Throughout the paper, we focus on the case where is nonconvex in and (strongly) concave in , i.e., nonconvex(strongly)concave (NCSC / NCC). Such problems widely appear in practical applications like adversarial training (madry2018towards; wang2019convergence), generative adversarial networks (GANs)
(goodfellow2014generative; sanjabi2018convergence; lei2020sgd)(dai2017learning; dai2018sbeed; huang2020convergence) and robust training (sinha2018certifying). The distribution is often unknown and one generally only has access to a dataset consisting of i.i.d. samples from and instead solves the following empirical minimax problem:(2) 
Since functions and are nonconvex in and pursuing their global optimal solutions is intractable in general, instead one aims to design an algorithm that finds an stationary point,
(3) 
where and are primal functions, is the component of the output of any algorithm for solving (2), and is the (Fréchet) subdifferential of . When is nonsmooth, we resort to the gradient norm of its Moreau envelope to measure the firstorder stationarity as it provides an upper bound on (davis2019stochastic).
Take the NCSC setting as an example. The optimization error for solving the population minimax problem (1) consists of two terms^{2}^{2}2Here for simplicity of illustration, we assume there is no constraint and primal functions are differentiable, the detailed setting will be formally introduced in Section 2.:
(4) 
where the first term on the righthandside corresponds to the optimization error of solving the empirical minimax problem (2) and the second term corresponds to the generalization error. Such decomposition on the gradient norm has been studied recently in nonconvex minimization, e.g., foster2018uniform; mei2018landscape; davis2022graphical. Recently, there is a line of work that develops efficient algorithms for solving the empirical minimax problems, which gives a hint on the optimization error; see e.g, (luo2020stochastic; yang2020catalyst), just to list a few. However, a full characterization of the generalization error is still lacking.
Characterizing the generalization error is not easy as both and depend on the dataset , which induces some correlation. One way to address such dependence issue in generalization bounds is to establish the stability argument of specific algorithms in stochastic optimization (bousquet2002stability; shalev2010learnability; hardt2016train) and stochastic minimax optimization (farnia2021train; lei2021stability; boob2021optimal; yang2022differentially). However, these stabilitybased generalization bounds have several drawbacks:

Generally they require a casebycase analysis for different algorithms, i.e, these bounds are algorithmdependent.

Existing stability analysis only applies to simple gradientbased algorithms for minimization and minimax problems (note that for minimax optimization, simple algorithms such as stochastic gradient descent ascent often turns out to be suboptimal), yet such analysis can be difficult to generalize to more sophisticated stateoftheart algorithms.

Existing stability analysis generally requires specific parameters (e.g., stepsizes), which may misalign with those required for convergence analysis, thus making the generalization bounds less informative.

Existing stabilitybased generalization bounds generally use function valuebased gap as the measurement of the algorithm, which may not be suitable concerning the nonconvex landscape.
To the best of our knowledge, there is no generalization bound results measured by the firstorder stationarity in nonconvex minimax optimization.
To overcome these difficulties, we aim to derive generalization bounds via establishing the uniform convergence between the empirical minimax and the population minimax problem, i.e., . Note that uniform convergence is invariant to the choice of algorithms and provides an upper bound on the generalization error for any , thus the derived generalization bound is algorithmagnostic. Although uniform convergence has been extensively studied in the literature of stochastic optimization (kleywegt2002sample; mei2018landscape; davis2022graphical), a key difference in uniform convergence for minimax optimization is that the primal function cannot be written as the average over i.i.d. random functions and one needs to additionally characterize the differences between and . Thus techniques in uniform convergence for classical stochastic optimization are not directly applicable.
We are interested in both the sample complexity and gradient complexity for achieving stationarity convergence of the population minimax problem (1). Here the sample complexity refers to the number of samples , and the gradient complexity refers to the number of gradient evaluations of . Combining the derived generalization error with optimization error from existing algorithms in finitesum nonconvex minimax optimization, e.g., luo2020stochastic; yang2020catalyst; zhang2021complexity, one automatically obtains the sample and gradient complexities bounds of these algorithms for solving the population minimax problem.
1.1 Contributions
Our contributions are twofold:

[leftmargin = 2em]

We establish the first uniform convergence results between the population and the empirical nonconvex minimax optimization in NCSC and NCC settings, measured by the gradients of primal functions (or its Moreau envelope). It provides an algorithmagnostic generalization bound for any algorithms that solve the empirical minimax problem. Specifically, the sample complexities to achieve an uniform convergence and an generalization error are and for the NCSC and NCC settings, respectively.

Combined with algorithms for nonconvex finitesum minimax optimization, the generalization results further imply gradient complexities for solving NCSC and NCC stochastic minimax problems, respectively. See Table 1 for a summary. In terms of dependence on the accuracy and the condition number , the achieved sample complexities significantly improve over the sample complexities of SOTA SGDtype algorithms in literature (luo2020stochastic; yang2022faster; rafique2021weakly; lin2020gradient; boct2020alternating); and the achieved gradient complexities match with existing SOTA results. The dependence on the dimension may be avoided if one directly analyzes SGDtype algorithms as shown in literature on stochastic optimization (kleywegt2002sample; nemirovski2009robust; davis2022graphical; hu2020sample), and evidenced for both NCSC and NCC minimax problems in our paper.
1.2 Literature Review
Nonconvex Minimax Optimization
In the NCSC setting, many algorithms have been proposed, e.g., nouiehed2019solving; lin2020gradient; lin2020near; luo2020stochastic; yang2020global; boct2020alternating; xu2020unified; lu2020hybrid; yan2020optimal; guo2021novel; sharma2022federated. Among them, (zhang2021complexity) achieved the optimal complexity in the deterministic case by introducing the Catalyst acceleration scheme (lin2015universal; paquette2018catalyst) into minimax problems, and luo2020stochastic; zhang2021complexity achieved the best complexity in the finitesum case for now, which are and , respectively. For the purely stochastic NCSC minimax problems, yang2022faster introduced a stochastic smoothedAGDA algorithm, which achieves the best complexity, while luo2020stochastic achieves the best complexity if further assuming average smoothness. The lower bounds of NCSC problems in deterministic, finitesum, and stochastic settings have been extensively studied recently in zhang2021complexity; han2021lower; li2021complexity.
In general, NCC problems are harder than NCSC problems since its primal function can be both nonsmooth and nonconvex (thekumparampil2019efficient). Recent years witnessed a surge of algorithms for NCC problems in deterministic, finitesum, and stochastic settings, e.g., zhang2020single; ostrovskii2021efficient; thekumparampil2019efficient; zhao2020primal; nouiehed2019solving; yang2020catalyst; lin2020gradient; boct2020alternating, to name a few. To the best of our knowledge, thekumparampil2019efficient; yang2020catalyst; lin2020near achieved the best complexity in the deterministic case, while yang2020catalyst achieved the best complexity in the finitesum case, and rafique2021weakly provided the best complexity in the purely stochastic case.
Uniform Convergence
A series of works from stochastic optimization and statistical learning theory studied uniform convergence on the worstcase differences between the population objective
and its empirical objective constructed via sample average approximation (SAA, also known as empirical risk minimization). Interested readers may refer to prominent results in statistical learning (fisher1922mathematical; vapnik1999overview; van2000asymptotic). For finitedimensional problem, kleywegt2002sample showed that the sample complexity is to achieve anuniform convergence in high probability, i.e.,
. For nonconvex empirical objectives, mei2018landscape and davis2022graphical established sample complexity of uniform convergence measured by the stationarity for nonconvex smooth and weakly convex functions, respectively. For infinitedimensional functional stochastic optimization with a finite VCdimension, uniform convergence still holds (vapnik1999overview). In addition, wang2017differentially uses uniform convergence to demonstrate the generalization and the gradient complexity of differential private algorithms for stochastic optimization.StabilityBased Generalization Bounds
Another line of research focuses on generalization bounds of stochastic optimization via the uniform stability of specific algorithms, including SAA (bousquet2002stability; shalev2009stochastic), stochastic gradient descent (hardt2016train; bassily2020stability), and uniformly stable algorithms (klochkov2021stability). Recently, a series of works further extended the analysis to understand the generalization performances of various algorithms in minimax problems. farnia2021train gave the generalization bound for the outputs of gradientdescentascent (GDA) and proximalpoint algorithm (PPA) in both (strongly)convex(strongly)concave and nonconvexnonconcave smooth minimax problems. lei2021stability focused on GDA and provided a comprehensive study for different settings of minimax problems with various generalization measures on function value gaps. boob2021optimal provided stability and generalization results of extragradient algorithm (EG) in the smooth convexconcave setting. On the other hand, zhang2021generalization studied stability and generalization of the empirical minimax problem under the (strongly)convex(strongly)concave setting, assuming that one can find the optimal solution to the empirical minimax problem.
2 Problem Setting
Notations
Throughout the paper, we use as the norm, as the gradient of a function , for nonnegative functions and , we say if for some . We denote as the projection operator. Let denote the output of an algorithm on the empirical minimax problem (2) with dataset . Given , we say a function is strongly convex if is convex, and it is strongly concave if is strongly convex. Function is weakly convex if is convex (see more notations and standard definitions in Appendix A).
Definition 2.1 (Smooth Function)
We say a function is smooth jointly in if the function is continuous differentiable, and there exists a constant such that for any , we have and .
By definition, it is easy to find that an smooth function is also weakly convex. Next we introduce the main assumptions used throughout the paper.
Assumption 2.1 (Main Settings)
We assume the following:

[leftmargin = 2em]

The function is smooth jointly in for any .

The function is strongly concave in for any and any where .

The gradient norms of and are bounded by respectively for any .

The domains and are compact convex sets, i.e., there exists constants such that for any , and for any , , respectively.
Note that compact domain assumption is widely used in uniform convergence literature (kleywegt2002sample; davis2022graphical).
Under Assumption 2.1, the objective function is smooth in and strongly concave for any . When , we call the population minimax problem (1) a nonconvexstronglyconcave (NCSC) minimax problem; when , we call it a nonconvexconcave (NCC) minimax problem.
Definition 2.2 (Moreau Envelope)
For an weakly convex function and , we use and to denote the the Moreau envelope of and the proximal point of for a given point , defined as following:
(5) 
Below we recall some important properties on the primal function and its Moreau envelope presented in the literature (davis2019stochastic; thekumparampil2019efficient; lin2020gradient).
Lemma 2.1 (Properties of and )
In the NCSC setting (), both and are smooth with the condition number . In the NCC setting (), the primal function is weakly convex, its Moreau envelope is differentiable, Lipschitz smooth, also , , where and .
Performance Measurement
In the NCSC setting, the primal functions and are both smooth. Regarding the constraint, we measure the difference between the population and empirical minimax problems using the generalized gradient of the population and the empirical primal functions, i.e., , where . The following inequality summarized the relationship of measurements in term of generalized gradient and in terms of gradient used in Section 1.
(6) 
where the first inequality holds as projection is a nonexpansive operator. The term in the lefthand side (LHS) above is the generalization error of an algorithm we desire in the NCSC case.
For the NCC case, the primal function is weakly convex, we use the gradient of its Moreau Envelope to characterize the (near)stationarity (davis2019stochastic). We measure the proximity between the population and empirical problems using the difference between the gradients of their respective Moreau envelopes. The generalization error and the uniform convergence in the NCC case is given as follows:
(7) 
The term in the LHS above is the generalization error of an algorithm we desire in the NCC case.
3 Uniform Convergence and Generalization Bounds
In this section, we discuss the sample complexity for achieving uniform convergence and generalization error for NCSC and NCC stochastic minimax optimization.
3.1 NCSC Stochastic Minimax Optimization
Under the NCSC setting, we demonstrate in the following theorem the uniform convergence between gradients of primal functions of the population and empirical minimax problems, which provides an upper bound on the generalization error for any algorithm . We defer the proof to Appendix B.
Theorem 3.1 (Uniform Convergence and Generalization Error, NCSC)
Under Assumption 2.1 with , we have
(8) 
Furthermore, to achieve uniform convergence and generalization error for any algorithm such that the error , it suffices to have
(9) 
To the best of our knowledge, it is the first uniform convergence and algorithmagnostic generalization error bound result for NCSC stochastic minimax problem. In comparison, existing works in the generalization error analysis (farnia2021train; lei2021stability) utilize stability arguments for certain algorithms and thus are algorithmspecific. zhang2021generalization establish algorithmagnostic stability and generalization in the stronglyconvexstronglyconcave regime, yet their analysis does not extend to the nonconvex regime. Our generalization results apply to any algorithms for solving finitesum problems, especially the SOTA algorithms like CatalystSVRG (zhang2021complexity) and finitesum version SREDA (luo2020stochastic). These algorithms are generally very complicated, and they lack stabilitybased generalization bounds analysis.
The achieved sample complexity further implies that for any algorithm that achieves an stationarity point of the empirical minimax problem, its sample complexity for finding an stationary point of the population minimax problem is . In terms of the dependence on the accuracy and the condition number , such sample complexity is better than the SOTA sample complexity results achieved via directly applying gradientbased methods on the population minimax optimization, i.e., by Stochastic SmoothedAGDA (yang2022faster) and by SREDA (luo2020stochastic).
3.2 NCC Stochastic Minimax Optimization
In this subsection, we derive the uniform convergence and algorithmagnostic generalization bounds for NCC stochastic minimax problems in the following theorem. Recall that the primal function is weakly convex (thekumparampil2019efficient) and is not welldefined. We use the gradient of the Moreau envelope of the primal function as the measurement (davis2019stochastic).
Theorem 3.2 (Uniform Convergence and Generalization Error, NCC)
Under Assumption 2.1 with , we have
(10) 
Furthermore, to achieve uniform convergence and generalization error for any algorithm such that the error , it suffices to have
(11) 
Proof Sketch
The analysis of Theorem 3.2 consists of three parts. By the expression of the gradient of the Moreau envelope, it holds that when ,
We first use a net (vapnik1999overview) to handle the dependence issue between and .
Then we build up a connection between NCC stochastic minimax optimization problems and NCSC stochastic minimax optimization problems via adding an regularization and carefully choosing a regularization parameter. The following lemma characterizes the distance between the proximal points of the primal function of the original NCC problem and the regularized NCSC problem . Note that the lemma may be of independent interest for the design and the analysis of gradientbased methods for NCC problem.
Lemma 3.1
For , denote as the primal function of the regularized NCC problem. It holds for that
This lemma implies that for small regularization parameter , the difference between the proximal point of the primal function of the NCC problem and the primal function of the regularized NCSC problem is going to be small.
Proof Since is smooth, it is obvious that is smooth. By (thekumparampil2019efficient, Lemma 3), is weakly convex in . Therefore, is strongly convex in for any fixed . Denote
(12) 
It holds that
where the first inequality holds by strong convexity of and optimality of for , the first equality holds by definition of , the second inequality holds by optimality of , the third inequality holds by optimality of , the second equality holds by definition of , the fourth inequality holds by optimality of , the last inequality holds by compact domain , which concludes the proof.
It remains to characterize the distance between and and show that is a subGaussian random variable. For the distance between and , by definition, it is equivalent to the difference between the optimal solutions on of stronglyconvex stronglyconcave (SCSC) population minimax problem and its empirical minimax problem. We utilize the existing stabilitybased results for SCSC minimax optimization (zhang2021generalization) to build the upper bound for the distance and show the variable is subGaussian. The proof of Theorem 3.2 is deferred to Appendix C.
To the best of our knowledge, this is the first algorithmagnostic generalization error result in NCC stochastic minimax optimization. Similar to the NCSC setting, Theorem 3.2 indicates that the sample complexity to guarantee an generalization error in the NCC case for any algorithm is . In comparison, it is much better than the sample complexity achieved by the SOTA stochastic approximationbased algorithms (rafique2021weakly) for NCC stochastic minimax optimization for small accuracy and moderate dimension .
Remark 3.1 (Comparison Between Minimization, NCSC, and NCC Settings)
For general stochastic nonconvex optimization, the sample complexity of achieving uniform convergence is (davis2022graphical; mei2018landscape). There are two key differences in minimax optimization.

The primal function is not in the form of averaging over samples and thus existing analysis for minimization problem is not directly applicable. Instead if we care about the uniform convergence in terms of the gradient of , i.e., , the existing analysis in mei2018landscape directly gives a sample complexity.

For a given , the optimal point differs from and such difference brings in an additional error term. In the NCSC case, such error is upper bounded by , which is of the same scale of the error from . Thus the eventual uniform convergence bound is of the same order as that for minimization problem (mei2018landscape; davis2022graphical). However, in the NCC case, may not be well defined. Instead, we bound the distance between
for a small . Such error is controlled by . Thus the sample complexity for achieving uniform convergence for the NCC case is large than that of the NCSC case.
We leave it for future investigation to see if one could achieve smaller sample complexity in the NCC case via a better characterization of the extra error brought in by in the NCC setting.
3.3 Gradient Complexity for Solving Stochastic Nonconvex Minimax Optimization
The uniform convergence and the algorithmagnostic generalization error shed light on the tightness of the complexity of algorithms for solving stochastic minimax optimization. We summarize related results in Table 1 and elaborate the details in this subsection.
Combining sample complexities for achieving generalization error and gradient complexities of existing algorithms for solving empirical minimax problems, we can directly obtain gradient complexities of these algorithms for solving population minimax problems. Note that the SOTA gradient complexity for solving NCSC empirical problems is (luo2020stochastic)^{3}^{3}3Such gradient complexity holds when as mentioned in luo2020stochastic. Our sample complexity result in Theorem 3.1 aligns such requirements. Also the results therein assume average smoothness, which is a weaker condition than individual smoothness in our paper. and (zhang2021complexity), while for solving NCC empirical problems it is (yang2020catalyst). We substitute the required sample size given by Theorem 3.1 and Theorem 3.2 to get the corresponding gradient complexity for solving the population minimax problem (1). Recall the definition of (near)stationarity, the next theorem shows the achieved gradient complexity and we defer the proof to Appendix D..
Theorem 3.3 (Gradient Complexity of Specific Algorithms)
Under Assumption 2.1, we have:

[leftmargin = 2em]
Dependence on the Dimension
The gradient complexities obtained in Theorem 3.3 come with a dependence on the dimension , which stems from the analysis of uniform convergence argument as it aims to bound the error on the worstcase . On the other hand, to achieve a small optimization error on the population minimax problem, it only requires a small generalization error on the specific output . Thus the gradient complexity obtained from uniform convergence has its own limitation.
Nevertheless, the obtained sample and gradient complexities are still meaningful in terms of the dependence on . In addition, we point out that the dependence on can generally be avoided if one directly analyzes some SGDtype methods for the population minimax problem. We have witnessed in various settings that the complexity bound of SAA has a dependence on dimension while there exist some SGDtype algorithms with dimensionfree gradient complexities. See kleywegt2002sample and nemirovski2009robust for classical stochastic convex optimization, davis2022graphical and davis2019stochastic for stochastic nonconvex optimization.
On the other hand, there are several structured machine learning models that enjoy dimensionfree uniform convergence results
(davis2022graphical; foster2018uniform; mei2018landscape). We leave the investigation of dimensionfree uniform convergence for specific applications with nonconvex minimax structure as a future direction.Matching SGDType Algorithms in Stochastic Nonconvex Minimax Problems
In fact, the above argument that one can get rid of dependence on in SGDtype algorithm analysis is already verified. In NCSC stochastic minimax optimization, the stochastic version of the SREDA algorithm in luo2020stochastic achieves gradient complexity, which matches the first bullet point in Theorem 3.3 except for dependence on dimension. In the NCC case, the PGSMD algorithm proposed in rafique2021weakly achieves gradient complexity, which matches the second result in Theorem 3.3 while it is free of the dimension dependence.
We point out that the discussion above relies on the gradient complexities of existing SOTA algorithms in NCSC or NCC finitesum minimax optimization, which may not be sharp enough in terms of the dependence on the condition number or the sample size . It is still possible to further improve the gradient complexity if one could design faster algorithms for solving empirical nonconvex minimax optimization problems.
Remark 3.2 (Tightness of Lower and Upper Complexity Bounds)
In the NCSC setting, zhang2021complexity provides a lower complexity bound for NCSC finitesum problems as with the average smoothness assumption, which is strictly lower than the SOTA upper bounds in (luo2020stochastic; zhang2021complexity). If the lower bound is sharp in our setting, with the error decomposition (4), we can conjecture that there exists an algorithm for solving NCSC stochastic minimax problems with a better gradient complexity as . For the NCC setting, there is no specific lower complexity bound ^{4}^{4}4To the best of our knowledge, currently there is no lower bound result specifically for NCC minimax optimization. The existing lower bounds for nonconvex minimization (carmon2019lower; carmon2019lowerII; fang2018spider; zhou2019lower; arjevani2019lower) and NCSC minimax problems (zhang2021complexity; han2021lower; li2021complexity) are trivial lower bounds for nonconvex minimax problems., so it is still an open problem whether the currently SOTA complexity in yang2020catalyst is optimal. It remains an open question whether one can design an algorithm with improved complexity and what is the lower complexity bound of the NCC setting .
On the other hand, the SOTA gradient complexity bound for NCSC finitesum problems is (luo2020stochastic) and (zhang2021complexity): one has a better dependence on the sample size and the other has a better dependence on the condition number . With the latter upper bound, the error decomposition (4) implies an gradient complexity, which is clearly suboptimal in terms of the dependence on accuracy . Note that the dependence of the gradient complexity induced by the former upper bound has hit the lower bound in nonconvex smooth optimization (arjevani2019lower). We conjecture that the gradient complexity achieved in luo2020stochastic has an optimal dependence on the sample size as it provides a matching dependence on accuracy with the lower bound when combined with our uniform convergence result.
4 Conclusion
In this paper, we take an initial step towards understanding the the uniform convergence and corresponding generalization performances of NCSC and NCC minimax problems measured by the firstorder stationarity. We hope that this work will shed light on the design of algorithms with improved complexities for solving stochastic nonconvex minimax optimization.
Several future directions are worthy further investigation. It remains interesting to see whether we can improve the uniform convergence results under the NCC setting, particularly the dependence on accuracy . In addition, is it possible to design algorithms for the NCSC finitesum setting with better complexities and close the gap from the lower bound. In terms of generalization bounds, it remains open to derive algorithmspecific stabilitybased generalization bounds under the stationarity measurement.
References
Appendix A Additional Definitions and Tools
For convenience, we summarize the notations commonly used throughout the paper.

[leftmargin = 2em]

Population minimax problem and its primal function^{5}^{5}5Another commonly used convergence criterion in minimax optimization is the firstorder stationarity of , i.e., and (or its corresponding gradient mapping) [lin2020gradient, xu2020unified]. We refer readers to lin2020gradient, yang2022faster for a thorough comparison of these two measurements. In this paper, we always stick to the convergence measured by the stationarity of the primal function.

Empirical minimax problem and its primal function

Moreau envelope and corresponding proximal point:

: generalized gradient (gradient mapping) of a function .

: norm.

: the gradient of a function .

: the projection operator.

: the output of an algorithm on the empirical minimax problem (2) with dataset .

NC / WC: nonconvex, weakly convex.

NCSC / NCC: nonconvex(strongly)concave.

SOTA: stateoftheart.

: dimension number of .

: condition number , : Lipschitz smoothness parameter, : strong concavity parameter.

hides polylogarithmic factors.

if for some and nonnegative functions and .

We say a function is convex if and , we have .

A function is smooth^{6}^{6}6Here the smoothness definition for singlevariable functions is subtly different from that of twovariable functions in Definition 2.1, so we list it here for completeness. if is continuously differentiable in and there exists a constant such that holds for any .
For completeness, we introduce the definition of a subGaussian random variable and related lemma, which are important tools in the analysis.
Definition A.1 (SubGaussian Random Variable)
A random variable
is a zeromean subGaussian random variable with variance proxy
if and either of the following two conditions hold:We use the following McDiarmid’s inequality to show that a random variable is subGaussian.
Lemma A.1 (McDiarmid’s inequality)
Let be independent random variables. Let be any function with the bounded differences property: for every and every , and that differ only in the th coordinate ( for all ), we have
For any , it holds that
Lemma A.2 (Properties of and , Restate)
In the NCSC setting (), both and are smooth with the condition number , both and are Lipschitz continuous and . In the NCC setting (), the primal function is weakly convex, and its its Moreau envelope is differentiable, Lipschitz smooth, also
(13) 
where and .
For completeness, we formally define the stationary point here. Note that the generalized gradient is defined on while the Moreau envelope is defined on the whole domain .
Definition A.2 (Stationary Point)
Let , for an smooth function , we call a point an stationary point of if , where is the gradient mapping (or generalized gradient) defined as ; for an weakly convex function , we say a point an (nearly)stationary point of if .
Appendix B Proof of Theorem 3.1
Proof To derive the desired generalization bounds, we take an net on so that there exists a for any such that . Note that such net exists with for compact [kleywegt2002sample]. Utilizing the definition of the net, we have