# On Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions

We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonconvex functions. In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds. This class contains important examples such as ReLU neural networks and others with non-differentiable activation functions. First, we show that finding an ϵ-stationary point with first-order methods is impossible in finite time. Therefore, we introduce the notion of (δ, ϵ)-stationarity, a generalization that allows for a point to be within distance δ of an ϵ-stationary point and reduces to ϵ-stationarity for smooth functions. We propose a series of randomized first-order methods and analyze their complexity of finding a (δ, ϵ)-stationary point. Furthermore, we provide a lower bound and show that our stochastic algorithm has min-max optimal dependence on δ. Empirically, our methods perform well for training ReLU neural networks.

## Authors

• 12 publications
• 8 publications
• 65 publications
• 48 publications
• ### Can We Find Near-Approximately-Stationary Points of Nonsmooth Nonconvex Functions?

It is well-known that given a bounded, smooth nonconvex function, standa...
02/27/2020 ∙ by Ohad Shamir, et al. ∙ 0

• ### SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points

We analyze stochastic gradient algorithms for optimizing nonconvex probl...
04/19/2019 ∙ by Zhize Li, et al. ∙ 0

• ### Complexity Lower Bounds for Nonconvex-Strongly-Concave Min-Max Optimization

We provide a first-order oracle complexity lower bound for finding stati...
04/18/2021 ∙ by Haochuan Li, et al. ∙ 10

• ### Determination of Stationary Points and Their Bindings in Dataset using RBF Methods

Stationary points of multivariable function which represents some surfac...
09/06/2018 ∙ by Zuzana Majdisova, et al. ∙ 0

• ### Efficient Methods for Structured Nonconvex-Nonconcave Min-Max Optimization

The use of min-max optimization in adversarial training of deep neural n...
10/31/2020 ∙ by Jelena Diakonikolas, et al. ∙ 20

• ### Oracle Complexity in Nonsmooth Nonconvex Optimization

It is well-known that given a smooth, bounded-from-below, and possibly n...
04/14/2021 ∙ by Guy Kornowski, et al. ∙ 0

• ### Learning to Assign Orientations to Feature Points

We show how to train a Convolutional Neural Network to assign a canonica...
11/13/2015 ∙ by Kwang Moo Yi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Gradient based optimization underlies most of machine learning and it has attracted tremendous research attention over the years. While non-asymptotic complexity analysis of gradient based methods is well-established for convex and

smooth nonconvex problems, little is known for nonsmooth nonconvex problems. We summarize the known rates (black) in Table 1 based on the references (Nesterov, 2018; Carmon et al., 2017; Arjevani et al., 2019). Within the nonsmooth nonconvex setting, recent research results have focused on asymptotic convergence analysis (Benaïm et al., 2005; Kiwiel, 2007; Majewski et al., 2018; Davis et al., 2018; Bolte and Pauwels, 2019). Despite their advances, these results fail to address finite-time, non-asymptotic convergence rates. Given the widespread use of nonsmooth nonconvex problems in machine learning, a canonical example being deep ReLU neural networks, obtaining a non-asymptotic convergence analysis is an important open problem of fundamental interest. We tackle this problem for nonsmooth functions that are Lipschitz and directionally differentiable. This class is rich enough to cover common machine learning problems, including ReLU neural networks. Surprisingly, even for this seemingly restricted class, finding an -stationary point, i.e., a point for which , is intractable. In other words, no algorithm can guarantee to find an -stationary point within a finite number of iterations. This intractability suggests that, to obtain meaningful non-asymptotic results, we need to refine the notion of stationarity. We introduce such a notion and base our analysis on it, leading to the following main contributions of the paper:

• [leftmargin=1em]

• We show that a traditional -stationary point cannot be obtained in finite time (Theorem 5).

• We study the notion of -stationary points (see Definition 4). For smooth functions, this notion reduces to usual -stationarity by setting . We provide a lower bound on the number of calls if algorithms are only allowed access to a generalized gradient oracle.

• We propose a normalized “gradient descent” style algorithm that achieves complexity in finding a -stationary point in the deterministic setting.

• We propose a momentum based algorithm that achieves complexity in finding a

-stationary point in the stochastic finite variance setting.

As a proof of concept to validate our theoretical findings, we implement our stochastic algorithm and show that it matches the performance of empirically used SGD with momentum method for training ResNets on the Cifar10 dataset. Our results attempt to bridge the gap from recent advances in developing a non-asymptotic theory for nonconvex optimization algorithms to settings that apply to training deep neural networks, where, due to non-differentiability of the activations, most existing theory does not directly apply.

## 2 Preliminaries

In this section, we set up the notion of generalized directional derivatives that will play a central role in our analysis. Throughout the paper, we assume that the nonsmooth function is -Lipschitz continuous (more precise assumptions on the function class are outlined in §2.3).

###### Definition 1.

Given a point , and direction , the generalized directional derivative of is defined as

 f∘(x;d):=limsupy→x,t↓0f(y+td)−f(y)t.
###### Definition 2.

The generalized gradient of is defined as

 ∂f(x):={g∣⟨g,d⟩≤f∘(x,d), ∀d∈Rd}.

We recall below some basic properties of the generalized gradient, see e.g., (Clarke, 1990) for details.

###### Proposition 1 (Properties of generalized gradients).

1. is a nonempty, convex compact set. For all vectors

, we have .

2. .

3. is an upper-semicontinuous set valued map.

4. is differentiable almost everywhere (as it is -Lipschitz); let denote the convex hull, then we have that

 ∂f(x)=conv({g|g=limk→∞∇f(xk), xk→x}).
5. Let denote the unit Euclidean ball. Then,

 ∂f(x)=∩δ>0∪y∈x+δB∂f(y).
6. For any , there exists and such that

 f(y)−f(z)=⟨g,y−z⟩.

### 2.2 Directional derivatives

Since general nonsmooth functions can have arbitrarily large variations in their “gradients,” we must restrict the function class to be able to develop a meaningful complexity theory. We show below that directionally differentiable functions match this purpose well.

###### Definition 3.

A function is called directionally differentiable in the sense of Hadamard (cf. (Sova, 1964; Shapiro, 1990)) if for any mapping for which and , the following limit exists:

 f′(x;d)=limt→0+1t(f(φ(t))−f(x)). (1)

In the rest of the paper, we will say a function is directionally differentiable if it is directionally differentiable in the sense of Hadamard at all . This directional differentiabilility is also referred to as Hadamard semidifferentiability in (Delfour, 2019). Notably, such directional differentiability is satisfied by most problems of interest in machine learning. It includes functions such as that do not satisfy the so-called regularity inequality (equation (51) in (Majewski et al., 2018)). Moreover, it covers the class of semialgebraic functions, as well as o-minimally definable functions (see Lemma 6.1 in (Coste, 2000)) discussed in (Davis et al., 2018). Currently, we are unaware whether the notion of Whitney stratifiability (studied in some recent works on nonsmooth optimization) implies directional differentiability. A very important property of directional differentiability is that it is preserved under composition.

###### Lemma 2 (Chain rule).

Let be Hadamard directionally differentiable at , and be Hadamard directionally differentiable at . Then the composite mapping is Hadamard directionally differentiable at and

 (ψ∘ϕ)′x=ψ′ϕ(x)∘ϕ′x.

A proof of this lemma can be found in (Shapiro, 1990, Proposition 3.6). As a consequence, any neural network function composed of directionally differentiable functions, including ReLU/LeakyReLU, is directionally differentiable. Directional differentiability also implies key properties useful in the analysis of nonsmooth problems. In particular, it enables the use of (Lebesgue) path integrals as follows.

###### Lemma 3.

Given any , let , . If is directionally differentiable and Lipschitz, then

 f(y)−f(x) =∫[0,1]f′(γ(t);y−x)dt.

The following important lemma further connects directional derivatives with generalized gradients.

###### Lemma 4.

Assume that the directional derivative exists. For any , there exists s.t. .

### 2.3 Nonsmooth function class of interest

Throughout the paper, we focus on the set of Lipschitz, directionally differentiable and bounded (below) functions:

 F(Δ,L):={f| f is L-Lipschitz;f is directionally % differentiable;f(x0)−infxf(x)≤Δ}, (2)

where a function is Lipschitz if

 |f(x)−f(y)|≤L∥x−y∥,∀ x,y∈Rn.

As indicated previously, ReLU neural networks with bounded weight norms are included in this function class.

## 3 Stationary points and oracles

We now formally define our notion of stationarity and discuss the intractability of the standard notion. Afterwards, we formalize the optimization oracles and define measures of complexity for algorithms that use these oracles.

### 3.1 Stationary points

With the generalized gradient in hand, commonly a point is called stationary if Clarke (1990). A natural question is, what is the necessary complexity to obtain an -stationary point, i.e., a point for which

 min{∥g∥∣ g∈∂f(x)}≤ϵ.

It turns out that attaining such a point is intractable. In particular, there is no finite time algorithm that can guarantee -stationarity in the nonconvex nonsmooth setting. We make this claim precise in our first main result.

###### Theorem 5.

Given any algorithm  that accesses function value and generalized gradient of in each iteration, for any and for any finite iteration , there exists such that the sequence generated by on the objective does not contain any

-stationary point with probability more than

.

A key ingredient of the proof is that an algorithm  is uniquely determined by , the function values and gradients at the query points. For any two functions and that have the same function values and gradients at the same set of queried points , the distribution of the iterate generated by is identical for and . However, due to the richness of the class of nonsmooth functions, we can find and such that the set of -stationary points of and are disjoint. Therefore, the algorithm cannot find a stationary point with probability more than for both and simultaneously. Intuitively, such functions exist because a nonsmooth function could vary arbitrarily—e.g., a nonsmooth nonconvex function could have constant gradient norms except at the (local) extrema, as happens for a piecewise linear zigzag function. Moreover, the set of extrema could be of measure zero. Therefore, unless the algorithm lands exactly in this measure-zero set, it cannot find any -stationary point. Theorem 5 suggests the need for rethinking the definition of stationary points. Intuitively, even though we are unable to find an -stationary point, one could hope to find a point that is close to an -stationary point. This motivates us to adopt the following more refined notion:

###### Definition 4.

A point is called -stationary if

 d(0,∂f(x+δB))≤ϵ,

where is the Goldstein -subdifferential, introduced in Goldstein (1977).

In other words, a point is -stationary if we can find a point at most distance away from such that is -stationary. At first glance, this appears to be a weaker notion since if is -stationary, then it is also a -stationary point for any , but not vice versa. We show that the converse implication indeed holds, assuming smoothness.

###### Proposition 6.

The following statements hold:

1. [label=()]

2. -stationarity implies -stationarity for any .

3. If is smooth with an -Lipschitz gradient and if is (, )-stationary, then is also -stationary, i.e.

 d(0,∂f(x+ϵ3LB))≤ϵ3⟹∥∇f(x)∥≤ϵ.

Consequently, the two notions of stationarity are equivalent for differentiable functions. It is then natural to ask: does -stationarity permit a finite time analysis? The answer is positive, as we will show later, revealing an intrinsic difference between the two notions of stationarity. Besides providing algorithms, in Theorem 11 we also prove an lower bound on the dependency of for algorithms that can only access a generalized gradient oracle. We also note that -stationarity behaves well as .

###### Lemma 7.

The set converges as as

 limδ↓0∂f(x+δB)=∂f(x).

Lemma 7 enables a straightforward routine for transforming non-asymptotic analyses for finding -stationary points to asymptotic results for finding -stationary points. Indeed, assume that a finite time algorithm for finding -stationary points is provided. Then, by repeating the algorithm with decreasing , (e.g., ), any accumulation points of the repeated algorithm is an -stationary point with high probability.

###### Assumption 1.

Given , the oracle returns a function value , and a generalized gradient ,

 (fx,gx)=O(x,d),

such that

1. [label=()]

2. In the deterministic setting, the oracle returns

 fx=f(x),gx∈∂f(x) satisfying ⟨gx,d⟩=f′(x,d).
3. In the stochastic finite-variance setting, the oracle only returns a stochastic gradient with , where satisfies . Moreover, the variance is bounded. In particular, no function value is accessible.

We remark that one cannot generally evaluate the generalized gradient in practice at any point where is not differentiable. When the function

is not directionally differentiable, one needs to incorporate gradient sampling to estimate

(Burke et al., 2002). Our oracle queries only an element of the generalized gradient and is thus weaker than querying the entire set . Still, finding a vector such that equals the directional derivative is non-trivial in general. Yet, when the objective function is a composition of directionally differentiable functions, such as ReLU neural networks, and if a closed form directional derivative is available for each function in the composition, then we can find the desired by appealing to the chain rule in Lemma 2. This property justifies our choice of oracles.

### 3.3 Algorithm class and complexity measures

An algorithm maps a function to a sequence of points in . We denote to be the mapping from previous iterations to . Each

can potentially be a random variable, due to the stochastic oracles or algorithm design. Let

be the filtration generated by such that is adapted to . Based on the definition of the oracle, we assume that the iterates follow the structure

 xk+1=A(k)(x1,g1,f1,x2,g2,f2,...,xk,gk,fk), (3)

where , and the point and direction are (stochastic) functions of the iterates . For a random process , we define the complexity of for a function as the value

 Tδ,ϵ({xt}t∈N,f):=inf{t∈N ∣ Prob{d(0,∂f(x+δB))≥ϵ   for all k≤t}≤13}. (4)

Let denote the sequence of points generated by algorithm for function . Then, we define the iteration complexity of an algorithm class on a function class as

 N(A,F,ϵ,δ):=infA∈Asupf∈FTδ,ϵ(A[f,x0],f). (5)

At a high level, (5) is the minimum number of oracle calls required for a fixed algorithm to find a -stationary point with probability at least for all functions is class .

## 4 Deterministic Setting

For optimizing -smooth functions, a crucial inequality is

 f(x−1L∇f(x))−f(x)≤−12L∥∇f(x)∥2. (6)

In other words, either the gradient is small or the function value decreases sufficiently along the negative gradient. However, when the objective function is nonsmooth, this descent property is no longer satisfied. Thus, defining an appropriate descent direction is non-trivial. Our key innovation is to solve this problem via randomization. More specifically, in our algorithm, Interpolated Normalized Gradient Descent (

Ingd), we derive a local search strategy to find the descent direction at an iterate . The vector plays the role of descent direction and we sequentially update it until the condition

 f(xt,k)−f(xt)<−δ∥mt,k∥4, (descent condition)

is satisfied. To connect with the descent property (6), observe that when is smooth, with and , (descent condition) is the same as (6) up to a factor . This connection motivates our choice of descent condition. When the descent condition is satisfied, the next iterate  is obtained by taking a normalized step from along the direction . Otherwise, we stay at and continue the search for a descent direction. We raise special attention to the fact that inside the -loop, the iterates are always obtained by taking a normalized step from . Thus, all the inner iterates have distance exactly from . To update the descent direction, we incorporate a randomized strategy. We randomly sample an interpolation point on the segment and evaluate the generalized gradient at this random point . Then, we update the descent direction as a convex combination of and the previous direction . Due to lack of smoothness, the violation of the descent condition does not directly imply that is small. Instead, the projection of the generalized gradient is small along the direction on average. Hence, with a proper linear combination, the random interpolation allows us to guarantee the decrease of in expectation. This reasoning allows us to derive the non-asymptotic convergence rate in high probability.

###### Theorem 8.

In the deterministic setting and with Assumption 1(a), the Ingd algorithm with parameters and finds a -stationary point for function class with probability using at most

 192ΔL2ϵ3δlog(4Δγδϵ)oracle calls.

Since we introduce random sampling for choosing the interpolation point, even in the deterministic setting we can only guarantee a high probability result. The detailed proof is deferred to Appendix C. A sketch of the proof is as follows. Since for any , the interpolation point is inside the ball . Hence for any . In other words, as soon as (line 7), the reference point is -stationary. If this is not true, i.e., , then we check whether (descent condition) holds, in which case

 f(xt,k)−f(xt)<−δ∥mt,k∥4<−ϵδ4.

Knowing that the function value is lower bounded, this can happen at most times. Thus, for at least one , the local search inside the while loop is not broken by the descent condition. Finally, given that and the descent condition is not satisfied, we show that

 E[∥mt,k+1∥2]≤(1−E[∥mt,k∥2]3L2)E[∥mt,k∥2]

This implies that follows a decrease of order . Hence with , we are guaranteed to find with high probability.

###### Remark 9.

If the problem is smooth, the descent condition is always satisfied in one iteration. Hence the global complexity of our algorithm reduces to . Due to the equivalence of the notions of stationarity (Prop. 6), with , our algorithm recovers the standard convergence rate for finding an -stationary point. In other words, our algorithm can adapt to the smoothness condition.

## 5 Stochastic Setting

In the deterministic setting one of the key ingredients used Ingd

is to check whether the function value decreases sufficiently. However, evaluating the function value can be computationally expensive, or even infeasible in the stochastic setting. For example, when training neural networks, evaluating the entire loss function requires going through all the data, which is impractical. As a result, we do not assume access to function value in the stochastic setting and instead propose a variant of

Ingd that only relies on gradient information.

One of the challenges of using stochastic gradients is the noisiness of the gradient evaluation. To control the variance of the associated updates, we introduce into the normalized step size:

 ηt=1p∥mt∥+q.

A similar strategy is used in adaptive methods like Duchi et al. (2011); Kingma and Ba (2015) to prevent instability. Here, we show that the constant allows us to control the variance of . In particular, it implies the bound

 E[∥xt+1−xt∥2]≤G2q,

where is a trivial upper-bound on the expected norm of any sampled gradient . Another substantial change (relative to Ingd) is the removal of the explicit local search, since the stopping criterion can now no longer be tested without access to the function value. Instead, one may view as an implicit local search with respect to the reference point . In particular, we show that when the direction has a small norm, then is a -stationary point, but not . This discrepancy explains why we output instead of . In the deterministic setting, the direction inside each local search is guaranteed to belong to . Hence, controlling the norm of implies the -stationarity of . In the stochastic case, however, we have two complications. First, only the expectation of the gradient evaluation satisfies the membership . Second, the direction is a convex combination of all the previous gradients , with all coefficients being nonzero. In contrast, we use a re-initialization in the deterministic setting. We overcome these difficulties and their ensuing subtleties to finally obtain the following complexity result:

###### Theorem 10.

In the stochastic setting, with Assumption 1(b), the Stochastic-Ingd algorithm (Algorithm 2) with parameters , , , , , ensures

 1TT∑t=1E[∥mt∥]≤ϵ4.

In other words, the number of gradient calls to achieve a stationary point is upper bounded by

For readability, the constants in Theorem 10 have not been optimized. The high level idea of the proof is to relate to the function value decrease , and then to perform a telescopic sum. We would like to emphasize the use of the adaptive step size and the momentum term . These techniques arise naturally from our goal to find a -stationary point. The step size helps us ensure that the distance moved is at most , and hence we are certain that adjacent iterates are close to each other. The momentum term serves as a convex combination of generalized gradients, as postulated by Definition 4. Further, even though the parameter does not directly influence the updates of our algorithm, it plays an important role in understanding our algorithm. Indeed, we show that

 d(E[mt|xt−K],∂f(xt−K+δB))≤ϵ16.

In other words, the conditional expectation is approximately in the -subdifferential at . This relationship is non-trivial. On one hand, by imposing , we ensure that are inside the -ball of center . On the other hand, we guarantee that the contribution of to is small, providing an appropriate upper bound on the coefficient . These two requirements help balance the different parameters in our final choice. Details of the proof may be found in Appendix D. Recall that we do not access the function value in this stochastic setting, which is a strength of the algorithm. In fact, we can show that our dependence is tight, when the oracle has only access to generalized gradients.

###### Theorem 11 (Lower bound on δ dependence).

Let denote the class of algorithms defined in Section 3.2 and denote the class of functions defined in Equation (2). Assume and . Then the iteration complexity is lower bounded by if the algorithm only has access to generalized gradients.

The proof is inspired by Theorem 1.1.2 in Nesterov (2018). We show that unless more than different points are queried, we can construct two different functions in the function class that have gradient norm at all the queried points, and the stationary points of both functions are away. For more details, see Appendix E. This theorem also implies the negative result for finite time analyses that we showed in Theorem 5. Indeed, when an algorithm finds an -stationary point, the point is also a -stationary for any . Thus, the iteration complexity must be at least , i.e., no finite time algorithm can guarantee to find an -stationary point. Before moving on to the experimental section, we would like to make several comments related to different settings. First, since the stochastic setting is strictly stronger than the deterministic setting, the stochastic variant Stochastic-INGD is applicable to the deterministic setting too. Moreover, the analysis can be extended to , which leads to a complexity of . This is the same as the deterministic algorithm. However, the stochastic variant does not adapt to the smoothness condition. In other words, even if the function is differentiable, we will not obtain a faster convergence rate. In particular, if the function is smooth, by using the equivalence of the types of stationary points, Stochastic-INGD finds an -stationary point in while standard SGD enjoys a convergence rate. We do not know whether a better convergence result is achievable, as our lower bound does not provide an explicit dependency on ; we leave this as a future research direction.

## 6 Experiments

In this section, we evaluate the performance of our proposed algorithm Stochastic Ingd on image classification tasks. We train the ResNet20 (He et al., 2016) model on the CIFAR10 (Krizhevsky and Hinton, 2009) classification dataset. The dataset contains 50k training images and 10k test images in 10 classes. We implement Stochastic Ingd

in PyTorch with the inbuilt auto differentiation algorithm

Paszke et al. (2017). We remark that except on the kink points, the auto differentiation matches the generalized gradient oracle, which justifies our choice. We benchmark the experiments with two popular machine learning optimizers, SGD with momentum and ADAM Kingma and Ba (2015)

. We train the model for 100 epochs with the standard hyper-parameters from the Github repository

:

• [leftmargin=1em]

• For SGD with momentum, we initialize the learning rate as , momentum as and reduce the learning rate by 10 at epoch 50 and 75. The weight decay parameter is set to .

• For ADAM, we use constant the learning rate , betas in , and weight decay parameter and for the best performance.

• For Stochastic-Ingd, we use , , , and weight decay parameter .

The training and test accuracy for all three algorithms are plotted in Figure 1. We observe that Stochastic-Ingd matches the SGD baseline and outperforms the ADAM algorithm in terms of test accuracy. The above results suggests that the experimental implications of our algorithm could be interesting, but we leave a more systematic study as future direction.

## 7 Conclusions and Future Directions

In this paper, we investigate the complexity of finding first order stationary points of nonconvex nondifferentiable functions. We focus in particular on Hadamard semi-differentiable functions, which we suspect is perhaps the most general class of functions for which the chain rule of calculus holds—see the monograph (Delfour, 2019). We further extend the standard definition of -stationary points for smooth functions into a new notion of -stationary points. We justify our definition by showing that no algorithm can find a stationary point for any in a finite number of iterations and conclude that a positive is necessary for a finite time analysis. Using the above definition and a more refined gradient oracle, we prove that the proposed algorithms find stationary points within iterations in the deterministic setting and with iterations in the stochastic setting. Our results provide the first non-asymptotic analysis of nonconvex optimization algorithms in the general Lipschitz continuous setting. Yet, they also open further questions. The first question is whether the current dependence on in our complexity bound is optimal. A future research direction is to try to find provably faster algorithms or construct adversarial examples that close the gap between upper and lower bounds on . Second, the rate we obtain in the deterministic case requires function evaluations and is randomized, leading to high probability bounds. Can similar rates be obtained by an algorithm oblivious to the function value? Another possible direction would be to obtain a deterministic convergence result. More specialized questions include whether one can remove the logarithmic factors from our bounds. Aside from the above questions on the rate, we can take a step back and ask high-level questions. Are there better alternatives to the current definition of

-stationary points? One should also investigate whether everywhere directional differentiability is necessary. In addition to the open problems listed above, our work uncovers another very interesting observation. In the standard stochastic, nonconvex, and smooth setting, stochastic gradient descent is known to be theoretically optimal

(Arjevani et al., 2019), while widely used practical techniques such as momentum-based and adaptive step size methods usually lead to worse theoretical convergence rates. In our proposed setting, momentum and adaptivity naturally show up in algorithm design, and become necessary for the convergence analysis. Hence we believe that studying optimization under more relaxed assumptions may lead to theorems that can better bridge the widening theory-practice divide in optimization for training deep neural networks, and ultimately lead to better insights for practitioners.

## 8 Acknowledgement

SS acknowledges support from an NSF-CAREER Award (Number 1846088) and an Amazon Research Award. AJ acknowledges support from an MIT-IBM-Exploratory project on adaptive, robust, and collaborative optimization.

## References

• N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, and T. Ma (2017) Finding approximate local minima faster than gradient descent. In

Proceedings of the 49th Annual ACM Symposium on Theory of Computing

,
pp. 1195–1199. Cited by: §1.1.
• Z. Allen-Zhu (2018) How to make the gradients small stochastically: even faster convex and nonconvex SGD. In Advances in Neural Information Processing Systems, pp. 1157–1167. Cited by: §1.1.
• Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth (2019) Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365. Cited by: §1.1, §1, §7.
• A. Beck and N. Hallak (2020) On the convergence to stationary points of deterministic and randomized feasible descent directions methods. SIAM Journal on Optimization 30 (1), pp. 56–79. Cited by: §1.1.
• M. Benaïm, J. Hofbauer, and S. Sorin (2005) Stochastic approximations and differential inclusions. SIAM Journal on Control and Optimization 44 (1), pp. 328–348. Cited by: §1.1, §1.
• J. Bolte and E. Pauwels (2019)

Conservative set valued fields, automatic differentiation, stochastic gradient method and deep learning

.
arXiv preprint arXiv:1909.10300. Cited by: §1.1, §1.
• J. Bolte, S. Sabach, M. Teboulle, and Y. Vaisbourd (2018) First order methods beyond convexity and Lipschitz gradient continuity with applications to quadratic inverse problems. SIAM Journal on Optimization 28 (3), pp. 2131–2151. Cited by: §1.1.
• J. V. Burke, F. E. Curtis, A. S. Lewis, M. L. Overton, and L. E. Simões (2018) Gradient sampling methods for nonsmooth optimization. arXiv preprint arXiv:1804.11003. Cited by: §1.1.
• J. V. Burke, A. S. Lewis, and M. L. Overton (2002) Approximating subdifferentials by random sampling of gradients. Mathematics of Operations Research 27 (3), pp. 567–584. Cited by: §3.2.
• Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford (2017) Lower bounds for finding stationary points I. Mathematical Programming, pp. 1–50. Cited by: §1.1, §1.
• Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford (2018) Accelerated methods for nonconvex optimization. SIAM Journal on Optimization 28 (2), pp. 1751–1772. Cited by: §1.1.
• F. H. Clarke (1990) Optimization and nonsmooth analysis. Vol. 5, Siam. Cited by: §2.1, §2.1, §3.1.
• M. Coste (2000) An introduction to o-minimal geometry. Istituti editoriali e poligrafici internazionali Pisa. Cited by: §2.2.
• H. Daneshmand, J. Kohler, A. Lucchi, and T. Hofmann (2018) Escaping saddles with stochastic gradients. In International Conference on Machine Learning, pp. 1163–1172. Cited by: §1.1.
• D. Davis, D. Drusvyatskiy, S. Kakade, and J. D. Lee (2018) Stochastic subgradient method converges on tame functions. Foundations of Computational Mathematics, pp. 1–36. Cited by: §1.1, §1, §2.2.
• D. Davis and D. Drusvyatskiy (2019) Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization 29 (1), pp. 207–239. Cited by: §1.1.
• M. C. Delfour (2019) Introduction to optimization and Hadamard semidifferential calculus. SIAM. Cited by: §2.2, §7.
• Y. Drori and O. Shamir (2019) The complexity of finding stationary points with stochastic gradient descent. arXiv preprint arXiv:1910.01845. Cited by: §1.1.
• D. Drusvyatskiy and C. Paquette (2019) Efficiency of minimizing compositions of convex functions and smooth maps. Mathematical Programming 178 (1-2), pp. 503–558. Cited by: §1.1.
• J. C. Duchi and F. Ruan (2018) Stochastic methods for composite and weakly convex optimization problems. SIAM Journal on Optimization 28 (4), pp. 3229–3259. Cited by: §1.1.
• J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research 12 (Jul), pp. 2121–2159. Cited by: §5.
• C. Fang, C. J. Li, Z. Lin, and T. Zhang (2018) Spider: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pp. 689–699. Cited by: §1.1.
• C. Fang, Z. Lin, and T. Zhang (2019) Sharp analysis for nonconvex sgd escaping from saddle points. In Conference on Learning Theory, Cited by: §1.1.
• D. Foster, A. Sekhari, O. Shamir, N. Srebro, K. Sridharan, and B. Woodworth (2019) The complexity of making the gradient small in stochastic convex optimization. arXiv preprint arXiv:1902.04686. Cited by: §1.1.
• R. Ge, F. Huang, C. Jin, and Y. Yuan (2015)

.
In Conference on Learning Theory, pp. 797–842. Cited by: §1.1.
• S. Ghadimi and G. Lan (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23 (4), pp. 2341–2368. Cited by: §1.1.
• A. Goldstein (1977) Optimization of lipschitz continuous functions. Mathematical Programming 13 (1), pp. 14–22. Cited by: Definition 4.
• K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition

,
pp. 770–778. Cited by: §6.
• C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan (2017) How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1724–1732. Cited by: §1.1.
• D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §5, §6.
• K. C. Kiwiel (2007) Convergence of the gradient sampling algorithm for nonsmooth nonconvex optimization. SIAM Journal on Optimization 18 (2), pp. 379–388. Cited by: §1.1, §1.
• A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §6.
• S. Majewski, B. Miasojedow, and E. Moulines (2018) Analysis of nonsmooth stochastic approximation: the differential inclusion approach. arXiv preprint arXiv:1805.01916. Cited by: §1.1, §1, §2.2.
• R. Mifflin (1977) An algorithm for constrained optimization with semismooth functions. Mathematics of Operations Research 2 (2), pp. 191–207. Cited by: §1.1.
• Y. Nesterov (2018) Lectures on convex optimization. Vol. 137, Springer. Cited by: §B.1, §1, §5.
• L. M. Nguyen, M. van Dijk, D. T. Phan, P. H. Nguyen, T. Weng, and J. R. Kalagnanam (2019) Optimal finite-sum smooth non-convex optimization with SARAH. arXiv preprint arXiv:1901.07648. Cited by: §1.1.
• A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §6.
• S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola (2016) Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pp. 314–323. Cited by: §1.1.
• A. Shapiro (1990) On concepts of directional differentiability. Journal of optimization theory and applications 66 (3), pp. 477–487. Cited by: §2.2, Definition 3.
• M. Sova (1964) General theory of differentiation in linear topological spaces. Czechoslovak Mathematical Journal 14, pp. 485–508. Cited by: Definition 3.
• S. Zhang and N. He (2018) On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization. arXiv preprint arXiv:1806.04781. Cited by: §1.1.
• D. Zhou, P. Xu, and Q. Gu (2018) Stochastic nested variance reduction for nonconvex optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3925–3936. Cited by: §1.1.

## Appendix A Proof of Lemmas in Preliminaries

### a.1 Proof of Lemma 3

###### Proof.

Let for , then is -Lipschitz implying that is absolutely continuous. Thus from the fundamental theorem of calculus (Lebesgue), has a derivative almost everywhere, and the derivative is Lebesgue integrable such that

 g(t)=g(0)+∫t0g′(s)ds.

Moreover, if is differentiable at , then

 g′(t)=limδt→0g(t+δt)−g(t)δt=limδt→0f(x+(t+δt)(y−x))−f(x+t(y−x))δt=f′(x+t(y−x),y−x).

Since this equality holds almost everywhere, we have

 f(y)−f(x)=g(1)−g(0)=∫10g′(t)dt=∫10f′(x+t(y−x),y−x)dt.

### a.2 Proof of Lemma 4

###### Proof.

For any as given in Definition 3, let . Denote . By Proposition 1.6, we know that there exists such that

 f(xk)−f(x)=⟨gk,j,xk−x⟩.

By the existence of directional derivative, we know that

 limk→∞⟨gk,j,d⟩=limk→∞⟨gk,j,tkd⟩tk=f′(x,d)

is in a bounded set with norm less than L. The Lemma follows by the fact that any accumulation point of is in due to upper-semicontinuity of . ∎

## Appendix B Proof of Lemmas in Algorithm Complexity

### b.1 Proof of Theorem 5

Our proof strategy is similar to Theorem 1.1.2 in Nesterov (2018), where we use the resisting strategy to prove lower bound. Given a one dimensional function , let be the sequence of points queried in ascending order instead of query order. We assume without loss of generality that the initial point is queried and is an element of (otherwise, query the initial point first before proceeding with the algorithm). Then we define the resisting strategy: always return

 f(x)=0,and∇f(x)=L.

If we can prove that for any set of points , there exists two functions such that they satisfy the resisting strategy , and that the two functions do not share any common stationary points, then we know no randomized/deterministic can return an stationary points with probability more than for both functions simultaneously. In other word, no algorithm that query points can distinguish these two functions. Hence we proved the theorem following the definition of complexity in (5) with . All we need to do is to show that such two functions exist in the Lemma below.

###### Lemma 12.

Given a finite sequence of real numbers , there is a family of functions such that for any ,

 fθ(xk)=0and∇fθ(xk)=L

and for sufficiently small, the set of -stationary points of are all disjoint, i.e -stationary points of -stationary points of for any .

###### Proof.

Up to a permutation of the indices, we could reorder the sequence in the increasing order. WLOG, we assume is increasing. Let . For any , we define by

 fθ(x) =−L(x−x1+2θδ)forx∈(−∞,x1−θδ] fθ(x) =L(x−xk)forx∈[xk−θδ,xk+xk+12−θδ] fθ(x) =−L(x−xk+1+2θδ)forx∈[xk+xk+12−θδ,xk+1−θδ] fθ(x) =L(x−xK)x∈[xK+θδ,+∞).

It is clear that is directional differentiable at all point and . Moreover, the minimum . This implies that . Note that or except at the local extremum. Therefore, for any the set of -stationary points of are exactly

 {ϵ-stationary points of fθ}={xk−θδ|k∈[1,K]}∪{xk+xk+12−θδ|k∈[1,K−1]},

which is clearly distinct for different choice of . ∎

### b.2 Proof of Proposition 6

###### Proof.

When is stationary, we have . By definition, we could find such that . This means, there exists , and such that and

 g=k∑i=1αi∇f(xi)

Therefore

 ∥∇f(x)∥ ≤∥g∥+∥∇f(x)−g∥ ≤2ϵ3+k∑i=1αi∥∇f(x)−∇f(xk)∥ ≤2ϵ3+k∑i=1αiL∥x−xk∥ ≤2ϵ3+k∑i=1αiLϵ3L=ϵ.

Therefore, is an -stationary point in the standard sense. ∎

### b.3 Proof of Lemma 7

###### Proof.

First, we show that the limit exists. By Lipschitzness and Jenson inequality, we know that lies in a bounded ball with radius . For any sequence of with , we know that Therefore, the limit exists by the monotone convergence theorem.
Next, we show that For one direction, we show that . This follows by proposition 1.5 and the fact that

 ∪y∈x+δB∂f(y)⊆conv(∪y∈x+δB∂f(y))=∂f(x+δB).

Next, we show the other direction . By upper semicontinuity, we know that for any , there exists such that

 ∪y∈x+δB∂f(y)⊆∂f(x)+ϵB.

Then by convexity of and , we know that their Minkowski sum is convex. Therefore, we conclude that for any , there exists such that

 ∂f(x+δB)=conv(∪y∈x+δB∂f(y))⊆∂f(x)+ϵB.

## Appendix C Proof of Theorem 8

Before we prove the theorem, we first analyze how many times the algorithm iterates in the while loop.

###### Lemma 13.

Let . Given ,

 E[∥mt,K∥2]≤ϵ216.

where for convenience of analysis, we define for all if the -loop breaks at . Consequently, for any , with probability , there are at most restarts of the while loop at the -th iteration.

###### Proof.

Let , then We denote as the event that -loop does not break at , i.e. and . It is clear that . Let . Note that