# Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition

We provide a theoretical explanation for the fast convergence of gradient clipping and adaptively scaled gradient methods commonly used in neural network training. Our analysis is based on a novel relaxation of gradient smoothness conditions that is weaker than the commonly used Lipschitz smoothness assumption. We validate the new smoothness condition in experiments on large-scale neural network training applications where adaptively-scaled methods have been empirically shown to outperform standard gradient based algorithms. Under this new smoothness condition, we prove that two popular adaptively scaled methods, gradient clipping and normalized gradient, converge faster than the theoretical lower bound of fixed-step gradient descent. We verify this fast convergence empirically in neural network training for language modeling and image classification.

## Authors

• 12 publications
• 9 publications
• 65 publications
• 48 publications
• ### On the Convergence of Step Decay Step-Size for Stochastic Optimization

The convergence of stochastic gradient descent is highly dependent on th...
02/18/2021 ∙ by Xiaoyu Wang, et al. ∙ 0

• ### Predictive Local Smoothness for Stochastic Gradient Methods

Stochastic gradient methods are dominant in nonconvex optimization espec...
05/23/2018 ∙ by Jun Li, et al. ∙ 0

• ### Concavifiability and convergence: necessary and sufficient conditions for gradient descent analysis

Convergence of the gradient descent algorithm has been attracting renewe...
05/28/2019 ∙ by Thulasi Tholeti, et al. ∙ 0

• ### Improved Analysis of Clipping Algorithms for Non-convex Optimization

Gradient clipping is commonly used in training deep neural networks part...
10/05/2020 ∙ by Bohang Zhang, et al. ∙ 0

• ### Lipschitz standardization for robust multivariate learning

Current trends in machine learning rely on out-of-the-box gradient-based...
02/26/2020 ∙ by Adrián Javaloy, et al. ∙ 5

We present a strikingly simple proof that two rules are sufficient to au...
10/21/2019 ∙ by Yura Malitsky, et al. ∙ 0

• ### Failures of Gradient-Based Deep Learning

In recent years, Deep Learning has become the go-to solution for a broad...
03/23/2017 ∙ by Shai Shalev-Shwartz, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We study gradient-based optimization algorithms for minimizing a differentiable nonconvex function , where can potentially be stochastic, i.e., . Such choices of

cover a wide range of problems in machine learning; as a result their study motivates a vast body of current optimization literature. Classical approaches for minimizing

(Duchi et al., 2011), ADAM (Kingma and Ba, 2014)

, and RMSProp

discuss the gradient explosion problem in recurrent models and consider clipping as an intuitive trick to work around the explosion. We formalize this argument and prove that clipped GD can be arbitrarily faster than ordinary GD. By examining the smoothness condition and providing new convergence bounds on adaptively-scaled methods, we hope this work can help close the following gap between theory and practice. On the one hand, powerful techniques such as Nesterov’s momentum and variance reduction have been proposed to theoretically accelerate convex and nonconvex optimization. However, these techniques, at least for now, seem to have limited applicability in deep learning

(Defazio and Bottou, 2018). On the other hand, some widely used empirical techniques (e.g., heavy-ball momentum, adaptivity) do not have theoretical acceleration guarantees. We suspect that one of the many reasons is the misalignment of the problem assumptions. Our work demonstrates that the concept of acceleration critically relies on the problem assumptions and that the standard global Lipschitz-gradient may not hold in the case of some applications.

### 1.1 Contributions

We now summarize the main contributions of this paper as follows:

• We propose a new smoothness condition that allows the local smoothness constant to change and increase with the gradient norm. This condition is strictly weaker than the standard Lipschitz-gradient assumption, and it is supported by empirical evidence in neural network training.

• We provide a convergence rate for clipped GD under our smoothness assumption (Theorem 3).

• We prove an upper-bound (Theorem 7) and a lower-bound (Theorem 5) on the convergence rate of GD under our relaxed smoothness assumption. The lower-bound demonstrates that GD with fixed step size can be arbitrarily slower than clipped GD.

• We show that stochastic clipped GD converges at the expected rate (Theorem 8). We explain why our proof does not apply to SGD with fixed step sizes, outlining the key hurdles.

We support our proposed theory with several experiments. Since gradient clipping is widely used in training recurrent models for natural language processing, we validate our smoothness condition (see Assumption

3) in this setting; we observe that the smoothness grows with gradient norms along the training trajectory (Fig. 0(a)). Additional experiments suggest that clipping allows the training trajectory to cross non-smooth regions of the loss, thereby accelerating convergence. Moreover, we show that clipped GD can converge faster (in training) than momentum-SGD, and achieve the same generalization performance as a strong baseline algorithm (e.g.,

test accuracy in 200 epochs for ResNet20 on Cifar10 dataset). Please see Section

6 for more details.

## 2 Problem setup and algorithms

In this section, we set up the problem and introduce our new and relaxed smoothness assumption. Recall that we wish to solve the non-convex optimization problem

 minx∈Rdf(x).

In general this problem is intractable; so, instead of seeking a global optimum, we seek an -stationary point, i.e., a point such that . Furthermore, we assume that the following conditions hold in the neighborhood of the sublevel set 111The constant “” in the expression (1) is arbitrary and can be replaced by any fixed positive constant. for a given initialization , where

 S:={x∣∃ y such that f(y)≤f(x0), and ∥x−y∥≤1}. (1)
###### Assumption 1.

Function is lower bounded by .

###### Assumption 2.

Function is twice differentiable.

The above assumptions are standard. Below we introduce our new relaxed smoothness assumption.

###### Assumption 3 ((L0,l1)-smoothness).

is -smooth, if there exist positive constants and such that .

Section 3 will motivate Assumption 3 and discuss how it relaxes the canonical Lipschitz-gradient assumption and enlarges the class of functions considered. We note here a brief point regarding Assumption 2: The -smoothness can be generalized to once differentiable functions by replacing Assumption 3 with the following definition:

 limsupδ→→0∥∇f(x)−∇f(x+δ)∥∥δ∥≤L1∥∇f(x)∥+L0.

This condition implies that is locally Lipschitz, and hence almost everywhere differentiable. All our results can go through by handling the integrations more carefully. But to avoid such complications and simplify exposition, we assume that the function is twice differentiable.

In this section, we review a few of the well-known variants of gradient based algorithms that relate to this work. We start with ordinary gradient descent,

 xk+1=xk−η∇f(xk), (2)

where is a fixed step size. This algorithm is the baseline algorithm used in neural network training. Many modifications of it have been proposed to stabilize or accelerate training. One such technique , of particular importance to this work, is clipped gradient descent. The update for clipped GD can be written as

 xk+1=xk−hc∇f(xk),where hc:=min{ηc,γηc∥∇f(x)∥}. (3)

Another algorithm that is less common in practice but has attracted theoretical interest is normalized gradient descent. The update for normalized GD method can be written as

 xk+1=xk−hn∇f(xk),where hn:=ηn∥∇f(x)∥+β. (4)

Clipped GD and NGD are almost equivalent. Indeed, if we set and , then

 12hc≤hn≤2hc.

Therefore, clipped GD is equivalent to NGD up to a constant factor in the step size choice. Consequently, the convergence rates in Section 4 and Section 5 for clipped GD also apply to NGD. Thus, we omit repeating the analysis for conciseness.

## 3 Relaxed smoothness condition and motivations

In this section, we discuss and motivate the relaxed smoothness condition in Assumption 3. We start with the traditional definition of smoothness, recalling how it leads to the step size choice in GD.

### 3.1 Function smoothness (Lipschitz gradients)

Recall that we wish to solve The objective is called -smooth if

 ∥∇f(x)−∇f(y)∥≤L∥x−y∥,for all x,y∈Rd. (5)

For twice differentiable functions, condition (5) is equivalent to . Under this smoothness, one can show the the following well-known upper-bound:

 f(y)≤f(x)+⟨∇f(x),y−x⟩+L2∥y−x∥2. (6)

Suppose we set ; then, we can pick the step size to minimize the corresponding upper bound (6) by solving for , to obtain

 h∗=argminhf(x)−h∥∇f(x)∥2+Lh22∥∇f(x)∥2=1L. (7)

This choice of leads to GD with a fixed step. Carmon et al. (2017) show that GD with is up to a constant optimal for optimizing smooth nonconvex functions. Noting this optimality relation between -smoothness and step size choice, we are led to ask the question: “Is clipped gradient descent optimized for a different smoothness condition?” We answer this question in Section 3.2. The usual -smoothness assumption (5) enables clean theoretical analysis but has its limitations. Assuming existence of a global constant that upper bounds the variation of gradient is very restrictive. For example, simple polynomials such as break the assumption. One workaround is to assume that exists in a compact region, and either prove that the iterates do not escape the region or run projection-based algorithms. However, such assumption can make very large and slow down the rate. In Section 4, we will show that a slow rate is unavoidable for gradient descent with fixed step size, whereas clipped gradient descent can greatly improve the dependency on . Moreover, though the bound (6) is optimal in the worst case, it can be too conservative. It is true that within any compact region, the function smoothness is bounded. However, the local smoothness can vary drastically (see Figure 1 for example). Gradient based methods can speed up convergence by taking larger steps in flat regions. Intuitively, this is why adaptive gradient methods can be faster.

### 3.2 Relaxed smoothness assumption

We now return to the question raised in Section 3.1. As step sizes for clipped GD and NGD are related by a constant factor, we answer the question by studying NGD. Inspired by the quadratic (6), assume that optimizes the quadratic function (cf. upper-bound (6)):

 f(x)−h∥∇f(x)∥2+L(x)h22∥∇f(x)∥2.

Then we can deduce that

 L(x)=∥∇f(x)∥+βη. (8)

Based on the intuition from (8), we propose the following relaxed smoothness condition.

###### Definition 1.

A second order differentiable function is -smooth if

 ∥∇2f(x)∥≤L0+L1∥∇f(x)∥.

Definition 1 strictly relaxes the usual -smoothness. There are two ways to interpret the relaxation: First, when we focus on a compact region, we can balance the constants and such that while . Second, there exist functions that are -smooth globally, but not -smooth. Hence the constant for -smoothness gets larger as the compact set increases but and stay fixed. An example is given in Lemma 2.

###### Remark 1.

It is worth noting that we do not need the Hessian operator norm and gradient norm to necessarily satisfy a linear relation. As long as they are positively correlated, clipped gradient descent can be shown to achieve faster rate than fixed step size gradient descent. We use the linear relationship for simplicity of exposition.

###### Lemma 2.

Let be the univariate polynomial . When , then is -smooth for some and but not -smooth.

###### Proof.

The first claim follows from . The second claim follows by the unboundedness of . ∎

### 3.3 Smoothness in neural networks

We showed that our proposed smoothness condition relaxes the traditional smoothness assumption and is naturally motivated by normalized gradient descent. In this section, we argue that it captures the structure of neural network training. To justify our claim, in Figure 0(a) we empirically show

that a strong linear correlation exists between the gradient norm and the estimated local smoothness for LSTM-based language-model training when gradient clipping is applied. For more details of the experiment, please refer to Section

6. Below we develop some high-level intuition for this phenomenon. We conjecture that the said positive correlation results from the common components in expressions of the gradient and the Hessian. We illustrate the reasoning behind this conjecture by considering an -layer linear network with quadratic loss—a similar computation also holds for nonlinear networks. The training loss of a deep linear network is , where denotes labels, denotes the input data matrix, and denotes the weights in the layer. By (Lemma 4.3 Kawaguchi, 2016), we know that

 ∇vec(wi)L(Y,f(X))=((Wℓ⋯Wi+1)⊗(Wi−1⋯W2W1X)T)Tvec(f(X)−Y),

where flattens a matrix in

into a vector in

; denotes the Kronecker product. For constants such that , the second order derivative

 ∇vec(wj) ∇vec(wi)L(Y,f(X))= ((Wℓ⋯Wi+1)⊗(Wi−1⋯W2W1X)T)T((Wℓ⋯Wj+1)⊗(Wj−1⋯W2W1X)T)+ ((Wj−1⋯Wi+1)⊗(Wi−1⋯W2W1X))(I⊗((f(X)−Y)Wℓ⋯Wj+1)).

When , the second term equals . Based on the above expressions, we notice that the gradient norm and Hessian norm may be positively correlated due to the following two observations. First, the gradient and the Hessian share many components such as the matrix product of weights across layers. Second, if one naively upper bounds the norm using Cauchy-Schwarz, then both upper-bounds would be monotonically increasing with respect to and

## 4 Convergence in the full batch setting

In this section, we analyze the convergence rates of GD and clipped GD under our proposed conditions. We bound the number of iterations required by algorithms to find an -stationary point.

We start by analyzing the clipped GD algorithm with update defined in equation (3).

###### Theorem 3.

Assume that Assumptions 1, 2, and 3 hold in set defined in  (1). With the parameters clipped GD (Algorithm 3) terminates in

 20L0(f(x0)−f∗)ϵ2+20max{1,L21}(f(x0)−f∗)L0  iterations.

The proof of Theorem 3 starts by bounding the Hessian norm in a neighborhood of the current iterate using Grönwall’s inequality. Afterwards we may use the standard proof of gradient descent and show that function value decreases at each iteration. Details are included in Appendix B.

### 4.2 Gradient descent with fixed step size

Gradient descent with a fixed step size is known to converge to first order -stationary points in iterations for smooth nonconvex functions. By the following theorem of Carmon et al. (2017), this rate is up to a constant optimal.

###### Theorem 4 (Thm 1 in (Carmon et al., 2017)).

For any deterministic first-order optimization algorithm using gradient oracles, the iteration complexity to optimize an -smooth function to an -stationary point is at least

 c0L(f(x0)−f∗)ϵ−2,

for some numerical constant .

However, we will show below that gradient descent is suboptimal under our relaxed -smoothness condition. In particular, to prove the convergence rate for gradient descent with fixed step size, we need to make an additional assumption on gradient norms.

###### Assumption 4.

Given an initialization , we assume that

 M:=sup{∥∇f(x)∥ | x such that f(x)≤f(x0)}<∞.

The next theorem states that this assumption is necessary. Particularly, we show that gradient descent with fixed step size cannot converge faster than when . Therefore, GD can be arbitrarily slower than clipped GD under our relaxed smoothness assumption.

###### Theorem 5.

Let be the class of objectives satisfying Assumptions 1, 2, 3, and 4 with fixed constants , , . If GD with fixed step size is convergent for any function in , then there is a function such that GD with a fixed step size takes at least

 L1M(f(x0)−f∗−5ϵ/8)8ϵ2(logM+1)

iterations to converge to an -stationary point.

The proof starts with an exponentially growing function and shows that the step size for gradient descent must be small. The small step size leads to very slow convergence for another almost linear function with a small gradient. Details of this construction can be found in Appendix C.

###### Remark 6.

Theorems 4 and 5 together show that gradient descent with a fixed step size cannot converge to an -stationary point faster than . Recall that clipped GD algorithm converges as . This rate shows that clipped GD converges much faster than GD when is large, or in other words, when the problem has a poor initialization.

Below, we provide an iteration upper bound for the fixed-step gradient descent update (2).

###### Theorem 7.

Suppose assumptions 1, 2, 3 and 4 hold in set defined in (1). If we pick parameters such that , then GD with a fixed step size defined in Algorithm 2 terminates in

 4(ML1+L0)(f(x0)−f∗)ϵ2iterations.

Please refer to Appendix D for the proof. Theorem 7 shows that gradient descent with a fixed step size converges in iterations. This suggests that the lower bound in Remark 6 is tight up to a log factor in .

## 5 Convergence in the stochastic setting

###### Assumption 5.

, that is, we have unbiased stochastic gradients.

###### Assumption 6.

. This implies that .

Bounded gradient is a strong assumption but it is commonly used in proving convergence for adaptive gradient methods (see (Reddi et al., 2019; Ward et al., 2018; Zhou et al., 2018)). In the analysis below, we only discuss the case when , otherwise the condition is equivalent to smooth. The main result of this section is the following convergence guarantee for stochastic clipped GD (based on the stochastic version of the update (3)).

###### Theorem 8.

Let Assumptions 1, 2, 3, 5, and 6 hold globally. Let where . Stochastic clipped GD after iterations of update (3) satisfies

 1TT∑i=1E[∥∇f(xk)∥2]≤4√T[(f(x0)−f∗+(4L1G3(1+1b(1−b))+5L0G2)].

As a result, the algorithm converges to an -stationary point in iterations.

The convergence proof critically relies on the fact that the update distance in each iteration has a fixed upper-bound due to clipping. However, the bounded radius causes a problem in the proof for fixed-step-size SGD. If we only assume bounded second moments of the stochastic gradient oracle, we cannot control the distance between the current point and the updated point. Hence one cannot apply Lemma

9 the same way as for (10) in Appendix E. Though we cannot prove the convergence of SGD with fixed step size, one also cannot theoretically rule out the possibility that it converges. Nevertheless, if we additionally assume that the noise in stochastic gradient oracle is sub-Gaussian, then we can show that SGD with fixed step size converges at rate . In order to avoid diversion from discussing adaptive methods, we omit including this analysis for conciseness.

## 6 Experiments

In this section, we summarize our experimental findings on the positive correlation between gradient norm and local smoothness. We then show that clipping accelerates convergence during neural network training. Our experiments are based on two tasks: language modeling and image classification. We run language modeling on the Penn Treebank (PTB) (Mikolov et al., 2010) dataset with AWD-LSTM models (Merity et al., 2018). For image classification, we train ResNet20 (He et al., 2016) on the Cifar10 dataset (Krizhevsky and Hinton, 2009). Details about the smoothness estimation and experimental setups are explained in Appendix F. First, our experiments test whether the local smoothness constant increases with the gradient norm, as suggested by the relaxed smoothness conditions defined in Section 3. To do so, we evaluate both quantities at points generated by the optimization procedure. We then scatter the local smoothness constants against the gradient norms in Figure 1 and Figure 2. Note that the plots are on a log-scale. A linear scale plot is shown in Appendix Figure 4. We notice that the correlation exists in the default training procedure for language modeling (see Figure 0(a)) but not in the default training for image classification (see Figure 1(a)). This difference aligns with the fact that gradient clipping is widely used in language modeling but is less popular in ResNet training, offering empirical support to our theoretical findings. We further investigate the cause of correlation. The plots in Figures 1 and 2 show that correlation appears when the models are trained with clipped GD and large learning rates. We propose the following explanation. Clipping enables the training trajectory to stably traverse non-smooth regions. Hence, we can observe that gradient norms and smoothness are positively correlated in Figures 0(a) and 1(c). Without clipping, the optimizer has to adopt a small learning rate and stays in a region where local smoothness does not vary much, otherwise the sequence diverges, and a different learning rate is used. Therefore, in other plots of Figures 1 and 2, the correlation is much weaker. As positive correlations are present in both language modeling and image classification experiments with large step sizes, our next set of experiments checks whether clipping helps accelerate convergence as predicted by our theory. From Figure 3, we find that the ability to traverse non-smooth regions indeed accelerates convergence. Because gradient clipping is a standard practice in language modeling, the LSTM models trained with clipping achieve the best validation performance and the fastest training loss convergence as expected. For image classification, surprisingly, clipped GD also achieves the fastest convergence and matches the test performance of SGD+momentum. These plots show that clipping can accelerate convergence and achieve good test performance at the same time. We do not analyze theory of this generalization capability as it is beyond the scope of this work.

## 7 Discussion

Much progress has been made to close the gap between upper and lower oracle complexities for first order smooth optimization. The works dedicated to this goal provide important insights and tools for us to understand the optimization procedure. However, there is another gap that separates theoretically accelerated algorithms from empirically fast algorithms. This work aim to close this gap. Specifically, we proposed a relaxed smoothness assumption that is supported by empirical evidence. We analyzed a simple but widely used optimization technique known as gradient clipping and provided theoretical guarantee that clipping can accelerate gradient descent. This phenomenon aligns remarkably well with empirical observations. There is still much to be explored in this direction. First, though our smoothness condition relaxes the usual Lipschitz assumption, it is unclear if there is a better condition that matches the experimental observations while also enabling a clean theoretical analysis. Second, we only studied the convergence of clipped gradient descent. Studying the convergence properties of other techniques such as momentum, coordinate-wise learning rates ( more generally, preconditioning) and variance reduction is also interesting. Finally, the most important question is: “can we design fast algorithm based on relaxed conditions and actually achieve faster convergence in neural network training?” Our experiments also have notable implications. First, though advocating clipped gradient descent in Resnet training is not a main point of this work, it is interesting to note that gradient descent and clipped gradient descent with large step sizes can achieve a similar test performance as momentum-SGD. Second, we learned that the performance of the baseline algorithm can actually beat some recently proposed algorithms. Therefore, when we design or learn about new algorithms, we need to pay extra attention to check whether the baseline algorithms are properly tuned.

## Appendix B Proof of Theorem 3

We start by proving a lemma that is repeatedly used in later proofs. The lemma bounds the gradient in a neighborhood of the current point by Grönwall’s inequality.

###### Lemma 9.

Given such that , for any such that , we have .

###### Remark 10.

Note that the constant “1” comes from the definition of in (1). If Assumption 3 holds globally, then we do not need to constrain . This version will be used in Theorem 8.

###### Proof.

Let be a curve defined below,

 γ(t)=t(x+−x)+x, t∈[0,1].

Then we have

 ∇f(γ(t))=∫t0∇(2)f(γ(τ))(x+−x)dτ+∇f(γ(0)).

By Cauchy-Schwarz’s inequality, we get

 ∥∇f(γ(t))∥ ≤∥x+−x∥∫t0∥∇(2)f(γ(τ))∥dτ+∥∇f(x)∥ ≤1L1∫t0(L0+L1∥∇f(γ(τ))∥)dτ+∥∇f(x)∥.

The second inequality follows by Assumption 3. Then we can apply the integral form of Grönwall’s inequality and get

 ∥∇f(γ(t))∥ ≤L0L1+∥∇f(x)∥+∫t0(L0L1+∥∇f(x)∥)exp(t−τ)dτ.

The Lemma follows by setting . ∎

### b.1 Proof of the theorem

We parameterize the path between and its updated iterate as follows:

 γ(t)=t(xk+1−xk)+xk,∀t∈[0,1].

Since , using Taylor’s theorem, the triangle inequality, and Cauchy-Schwarz, we obtain

 f(xk+1)≤f(xk)−hk∥∇f(xk)∥2+∥xk+1−xk∥22∫10∥∇2f(γ(t))∥dt.

Since

 hk≤γη∥∇f(x)∥≤min{1∥∇f(x)∥,1L1∥∇f(xk)∥},

we know by Lemma 9

 ∥∇f(γ(t)∥≤4(L0L1+∥∇f(x)∥).

Then by Assumption 3, we have

 f(xk+1)≤f(xk)−hk∥∇f(xk)∥2+5L0+4L1∥∇f(xk)∥2∥∇f(xk)∥2h2k.

Therefore, as long as (which follows by our choice of ), we have

 f(xk+1)≤f(xk)−hk∥∇f(xk)∥22.

When , we have

 hk∥∇f(xk)∥22≥L020max{1,L21}.

When , we have

 hk∥∇f(xk)∥22≥∥∇f(xk)∥220L0≥ϵ220L0.

Therefore,

 f(xk+1)≤f(xk)−min{L020max{1,L21},ϵ220L0}.

Assume that the algorithm doesn’t terminate in iterations. By doing a telescopic sum, we get

 T−1∑k=0f(xk+1)−f(xk)≤−Tmin{L020max{1,L21},ϵ220L0}.

Rearrange and we get

 T≤20L0(f(x0)−f∗)ϵ2+20max{1,L21}(f(x0)−f∗)L0.

## Appendix C Proof of Theorem 5

We will prove a lower bound for the convergence rate of GD with fixed step size. The high level idea is that if GD converges for all functions satisfying the assumptions, then the step size needs to be small. However, this small step size will lead to very slow convergence for another function. We start with a function that grows exponentially. Let be fixed constants. Pick the initial point . Let the objective be defined as follows,

 f(x)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩e−L1xL1e,for x<−1L1,L1x22+12L1,for x∈[−1L1,1L1],eL1xL1e,for x>1L1.

We notice that the function satisfies the assumptions with constants

 L0=1,L1>1,M>1. (9)

When , we would have . By symmetry of the function and the super-linear growth of the gradient norm, we know that the iterates will diverge. Hence, in order for gradient descent with a fixed step size to converge, must be small enough. Formally,

 h≤2x0M=2log(M)+2ML1.

Now, let’s look at a different objective that grows slowly.

 f(x)=⎧⎪ ⎪⎨⎪ ⎪⎩−2ϵ(x+1)+5ϵ4,for% x<−1,ϵ4(6x2−x4),for x∈[−1,1],2ϵ(x−1)+5ϵ4,for x>1.

This function is also second order differentiable and satisfies the assumptions with constants in (9). If we set for some constant , we know that . With the step size choice , we know that in each step, . Therefore, for ,

 ∥∇f(xk)∥=2ϵ.

## Appendix D Proof of Theorem 7

We start by parametrizing the function value along the update,

 f(γ(t)):=f(xk−th∇f(xk)),t∈[0,1].

Note that with this parametrization, we have . Now we would like to argue that if , then . Assume by contradiction that this is not true. Then there exists such that . Since can be made arbitrarily small below a threshold, we assume . Denote

 t∗=inf{t | ∥∇f(x(t))∥≥M+ϵ}.

The value exists by continuity of as a function of . Then we know by Assumption 4 that . However, by Taylor expansion, we know that

 f(x(t∗)) ≤f(xk)−th∥∇f(xk)∥2+(th)2∥∇f(xk)∥2∫t0∥∇(2)f(x(τ))∥dτ ≤f(xk)−th∥∇f(xk)∥2+(th)2∥∇f(xk)∥2(L1(M+ϵ)+L0) ≤f(xk).

The last inequality follows by . Hence we get a contradiction and conclude that for all , . Therefore, following the above inequality and Assumption 3, we get

 f(xk+1) ≤f(xk)−h∥∇f(xk)∥2+h2L1M+L02∥∇f(xk)∥2 ≤f(xk)−ϵ24(ML1+L0).

The conclusion follows by the same argument as in Theorem 3 via a telescopic sum over .

## Appendix E Proof of Theorem 8

By the fact that

 h≤ηηL1∥gk∥+b≤1L1∥gk∥,

we know . Hence by Lemma 9, we know

 f(xk+1)≤f(xk)−hk⟨gk,∇f(xk)⟩+h2k∥gk∥22(5L0+4L1∥∇f(xk)∥). (10)

Let be a filtration such that is generated by . Then after taking the expectation we get

 E[f(xk+1)|Fk]≤ f(xk)−E[hk⟨gk,∇f(xk)⟩|Fk]+ E[h2k∥gk∥22|Fk](5L0+4L1∥∇f(xk)∥) ≤ f(xk)−E[hk]∥∇f(xk)∥2+E[hk⟨∇f(xk)−gk,∇f(xk)⟩|Fk]+ E[h2k∥gk∥22|Fk](5L0+4L1∥∇f(xk)∥).

Notice that . Inspired by the proof in (Ward et al., 2018), we get by ,

 E[f(xk+1)|Fk]≤ f(xk)−E[hk]∥∇f(xk)∥2+E[(hk−η)⟨∇f(xk)−gk,∇f(xk)⟩|Fk]+ (11) E[h2k∥gk∥22|Fk](5L0+4L1∥∇f(xk)∥).

We further prove in Lemma 11 that

 E[(hk−η)⟨∇f(xk)−gk,∇f(xk)⟩|Fk]≤4η2L1G3b(1−b). (12)

We also notice that . Then rearrange (11) and we get

 E[hk|Fk]∥∇f(xk)∥2≤f(xk)−E[f(xk+1)|Fk]+4η2L1G3b(1−b)+η2G2(5L0+4L1G).

Furthermore, we know that

 E[hk|Fk] =E[min{η,ηηL1∥gk∥+b|Fk}]≥E[η21{ηL1∥gk∥≤1}|Fk] (13) =η2Pr{∥gk∥≤1L1η}≥η2(1−GL1η). (14)

The last inequality follows by and Markov inequality. When , by telescoping the inequality (13), we get

 η4T∑i=1E[∥∇f(xk)∥2]≤f(x0)−f∗+η2T(4L1G3(1+1b(1−b))+5L0G2) 1TT∑i=1E[∥∇f(xk)∥2]≤4ηT[(f(x0)−f∗+η2T(4L1G3(1+1b(1−b))+5L0G2)].

The result follows by setting .

### e.1 Technical lemma

Here we complete the proof for Theorem 8 by proving the inequality  (12).

###### Lemma 11.

The following inequality holds in the context of Theorem  8.

 E[(hk−η)⟨∇f(xk)−gk,∇f(xk)⟩|Fk]≤4η2L1G3b(1−b).
###### Proof.

Notice that

 E[(hk−η)⟨∇f(xk)−gk,∇f(xk)⟩|Fk] = E[(hk−η)⟨∇f(xk)−gk,∇f(xk)⟩1{∥gk∥≤1−bηL1}|Fk] +E[(hk−η)⟨∇f(xk)−gk,∇f(xk)⟩1{∥gk∥>1−bηL1}|Fk].

Since , we know that

 E[(hk−η)⟨∇f(xk)−gk,∇f(xk)⟩|Fk] (15) = E[(hk−η)⟨∇f(xk),∇f(xk)⟩1{∥gk∥>1−bηL1}|Fk]+E[(hk−η)⟨−gk,∇f(xk)⟩1{∥gk∥>1−bηL1}|Fk].