Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition

05/28/2019 ∙ by Jingzhao Zhang, et al. ∙ MIT 0

We provide a theoretical explanation for the fast convergence of gradient clipping and adaptively scaled gradient methods commonly used in neural network training. Our analysis is based on a novel relaxation of gradient smoothness conditions that is weaker than the commonly used Lipschitz smoothness assumption. We validate the new smoothness condition in experiments on large-scale neural network training applications where adaptively-scaled methods have been empirically shown to outperform standard gradient based algorithms. Under this new smoothness condition, we prove that two popular adaptively scaled methods, gradient clipping and normalized gradient, converge faster than the theoretical lower bound of fixed-step gradient descent. We verify this fast convergence empirically in neural network training for language modeling and image classification.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study gradient-based optimization algorithms for minimizing a differentiable nonconvex function , where can potentially be stochastic, i.e., . Such choices of

cover a wide range of problems in machine learning; as a result their study motivates a vast body of current optimization literature. Classical approaches for minimizing

include gradient descent (GD) and stochastic gradient descent (SGD). More recently, adaptive gradient methods, e.g., Adagrad 

(Duchi et al., 2011), ADAM (Kingma and Ba, 2014)

, and RMSProp 

(Tieleman and Hinton, 2012), have gained popularity due to their empirical performance, in particular, their faster convergence in complex optimization problems such as adversarial training and language modeling. Adaptive methods differ from GD and SGD in that they allow step-sizes to depend on past gradients and to vary across coordinates. Previous analysis has shown that adaptive methods are more robust to variation in hyper-parameters (Ward et al., 2018) and adapt to sparse gradients (Duchi et al., 2011). We will later provide a more detailed review of related literature in Appendix A. In this paper, we focus on a subclass of adaptive gradient methods and theoretically justify their empirical effectiveness and applicability. Specifically, we show that adaptively scaled gradient methods converge arbitrarily faster than fixed-step gradient descent.This result is shown to hold under a novel smoothness condition that is strictly weaker than the standard Lipschitz-gradient assumption pervasive in the literature, hence it captures many functions that are not globally Lipschitz smooth. More importantly, the proposed smoothness condition is validated by precisely in the same type of experiments in neural network training for which there is empirical evidence that adaptive gradient methods perform superior to gradient methods. More specifically, we analyze convergence properties of a widely used technique, clipped gradient descent. In terms of step size choice, gradient clipping is up to constant factors equivalent to normalized gradient descent (NGD), a canonical adaptive method that is widely used in practice. Instead of using constant step sizes, clipped GD adaptively chooses a step size based on the (stochastic) gradient norm. Even though clipping is a standard practice in tasks such as language models (e.g. Merity et al., 2018; Gehring et al., 2017; Peters et al., 2018), it lacks a solid theoretical grounding. Goodfellow et al. (2016); Pascanu et al. (2013, 2012)

discuss the gradient explosion problem in recurrent models and consider clipping as an intuitive trick to work around the explosion. We formalize this argument and prove that clipped GD can be arbitrarily faster than ordinary GD. By examining the smoothness condition and providing new convergence bounds on adaptively-scaled methods, we hope this work can help close the following gap between theory and practice. On the one hand, powerful techniques such as Nesterov’s momentum and variance reduction have been proposed to theoretically accelerate convex and nonconvex optimization. However, these techniques, at least for now, seem to have limited applicability in deep learning 

(Defazio and Bottou, 2018). On the other hand, some widely used empirical techniques (e.g., heavy-ball momentum, adaptivity) do not have theoretical acceleration guarantees. We suspect that one of the many reasons is the misalignment of the problem assumptions. Our work demonstrates that the concept of acceleration critically relies on the problem assumptions and that the standard global Lipschitz-gradient may not hold in the case of some applications.

1.1 Contributions

We now summarize the main contributions of this paper as follows:

  • We propose a new smoothness condition that allows the local smoothness constant to change and increase with the gradient norm. This condition is strictly weaker than the standard Lipschitz-gradient assumption, and it is supported by empirical evidence in neural network training.

  • We provide a convergence rate for clipped GD under our smoothness assumption (Theorem 3).

  • We prove an upper-bound (Theorem 7) and a lower-bound (Theorem 5) on the convergence rate of GD under our relaxed smoothness assumption. The lower-bound demonstrates that GD with fixed step size can be arbitrarily slower than clipped GD.

  • We show that stochastic clipped GD converges at the expected rate (Theorem 8). We explain why our proof does not apply to SGD with fixed step sizes, outlining the key hurdles.

We support our proposed theory with several experiments. Since gradient clipping is widely used in training recurrent models for natural language processing, we validate our smoothness condition (see Assumption 

3) in this setting; we observe that the smoothness grows with gradient norms along the training trajectory (Fig. 0(a)). Additional experiments suggest that clipping allows the training trajectory to cross non-smooth regions of the loss, thereby accelerating convergence. Moreover, we show that clipped GD can converge faster (in training) than momentum-SGD, and achieve the same generalization performance as a strong baseline algorithm (e.g.,

test accuracy in 200 epochs for ResNet20 on Cifar10 dataset). Please see Section 

6 for more details.

2 Problem setup and algorithms

In this section, we set up the problem and introduce our new and relaxed smoothness assumption. Recall that we wish to solve the non-convex optimization problem

In general this problem is intractable; so, instead of seeking a global optimum, we seek an -stationary point, i.e., a point such that . Furthermore, we assume that the following conditions hold in the neighborhood of the sublevel set 111The constant “” in the expression (1) is arbitrary and can be replaced by any fixed positive constant. for a given initialization , where

(1)
Assumption 1.

Function is lower bounded by .

Assumption 2.

Function is twice differentiable.

The above assumptions are standard. Below we introduce our new relaxed smoothness assumption.

Assumption 3 (-smoothness).

is -smooth, if there exist positive constants and such that .

Section 3 will motivate Assumption 3 and discuss how it relaxes the canonical Lipschitz-gradient assumption and enlarges the class of functions considered. We note here a brief point regarding Assumption 2: The -smoothness can be generalized to once differentiable functions by replacing Assumption 3 with the following definition:

This condition implies that is locally Lipschitz, and hence almost everywhere differentiable. All our results can go through by handling the integrations more carefully. But to avoid such complications and simplify exposition, we assume that the function is twice differentiable.

2.1 Gradient descent algorithms

In this section, we review a few of the well-known variants of gradient based algorithms that relate to this work. We start with ordinary gradient descent,

(2)

where is a fixed step size. This algorithm is the baseline algorithm used in neural network training. Many modifications of it have been proposed to stabilize or accelerate training. One such technique , of particular importance to this work, is clipped gradient descent. The update for clipped GD can be written as

(3)

Another algorithm that is less common in practice but has attracted theoretical interest is normalized gradient descent. The update for normalized GD method can be written as

(4)

Clipped GD and NGD are almost equivalent. Indeed, if we set and , then

Therefore, clipped GD is equivalent to NGD up to a constant factor in the step size choice. Consequently, the convergence rates in Section 4 and Section 5 for clipped GD also apply to NGD. Thus, we omit repeating the analysis for conciseness.

3 Relaxed smoothness condition and motivations

In this section, we discuss and motivate the relaxed smoothness condition in Assumption 3. We start with the traditional definition of smoothness, recalling how it leads to the step size choice in GD.

3.1 Function smoothness (Lipschitz gradients)

Recall that we wish to solve The objective is called -smooth if

(5)

For twice differentiable functions, condition (5) is equivalent to . Under this smoothness, one can show the the following well-known upper-bound:

(6)

Suppose we set ; then, we can pick the step size to minimize the corresponding upper bound (6) by solving for , to obtain

(7)

This choice of leads to GD with a fixed step. Carmon et al. (2017) show that GD with is up to a constant optimal for optimizing smooth nonconvex functions. Noting this optimality relation between -smoothness and step size choice, we are led to ask the question: “Is clipped gradient descent optimized for a different smoothness condition?” We answer this question in Section 3.2. The usual -smoothness assumption (5) enables clean theoretical analysis but has its limitations. Assuming existence of a global constant that upper bounds the variation of gradient is very restrictive. For example, simple polynomials such as break the assumption. One workaround is to assume that exists in a compact region, and either prove that the iterates do not escape the region or run projection-based algorithms. However, such assumption can make very large and slow down the rate. In Section 4, we will show that a slow rate is unavoidable for gradient descent with fixed step size, whereas clipped gradient descent can greatly improve the dependency on . Moreover, though the bound (6) is optimal in the worst case, it can be too conservative. It is true that within any compact region, the function smoothness is bounded. However, the local smoothness can vary drastically (see Figure 1 for example). Gradient based methods can speed up convergence by taking larger steps in flat regions. Intuitively, this is why adaptive gradient methods can be faster.

3.2 Relaxed smoothness assumption

We now return to the question raised in Section 3.1. As step sizes for clipped GD and NGD are related by a constant factor, we answer the question by studying NGD. Inspired by the quadratic (6), assume that optimizes the quadratic function (cf. upper-bound (6)):

Then we can deduce that

(8)

Based on the intuition from (8), we propose the following relaxed smoothness condition.

Definition 1.

A second order differentiable function is -smooth if

Definition 1 strictly relaxes the usual -smoothness. There are two ways to interpret the relaxation: First, when we focus on a compact region, we can balance the constants and such that while . Second, there exist functions that are -smooth globally, but not -smooth. Hence the constant for -smoothness gets larger as the compact set increases but and stay fixed. An example is given in Lemma 2.

Remark 1.

It is worth noting that we do not need the Hessian operator norm and gradient norm to necessarily satisfy a linear relation. As long as they are positively correlated, clipped gradient descent can be shown to achieve faster rate than fixed step size gradient descent. We use the linear relationship for simplicity of exposition.

Lemma 2.

Let be the univariate polynomial . When , then is -smooth for some and but not -smooth.

Proof.

The first claim follows from . The second claim follows by the unboundedness of . ∎

3.3 Smoothness in neural networks

We showed that our proposed smoothness condition relaxes the traditional smoothness assumption and is naturally motivated by normalized gradient descent. In this section, we argue that it captures the structure of neural network training. To justify our claim, in Figure 0(a) we empirically show

that a strong linear correlation exists between the gradient norm and the estimated local smoothness for LSTM-based language-model training when gradient clipping is applied. For more details of the experiment, please refer to Section 

6. Below we develop some high-level intuition for this phenomenon. We conjecture that the said positive correlation results from the common components in expressions of the gradient and the Hessian. We illustrate the reasoning behind this conjecture by considering an -layer linear network with quadratic loss—a similar computation also holds for nonlinear networks. The training loss of a deep linear network is , where denotes labels, denotes the input data matrix, and denotes the weights in the layer. By (Lemma 4.3 Kawaguchi, 2016), we know that

where flattens a matrix in

into a vector in

; denotes the Kronecker product. For constants such that , the second order derivative

When , the second term equals . Based on the above expressions, we notice that the gradient norm and Hessian norm may be positively correlated due to the following two observations. First, the gradient and the Hessian share many components such as the matrix product of weights across layers. Second, if one naively upper bounds the norm using Cauchy-Schwarz, then both upper-bounds would be monotonically increasing with respect to and

4 Convergence in the full batch setting

In this section, we analyze the convergence rates of GD and clipped GD under our proposed conditions. We bound the number of iterations required by algorithms to find an -stationary point.

4.1 Clipped gradient descent

We start by analyzing the clipped GD algorithm with update defined in equation (3).

Theorem 3.

Assume that Assumptions 1, 2, and 3 hold in set defined in  (1). With the parameters clipped GD (Algorithm 3) terminates in

The proof of Theorem 3 starts by bounding the Hessian norm in a neighborhood of the current iterate using Grönwall’s inequality. Afterwards we may use the standard proof of gradient descent and show that function value decreases at each iteration. Details are included in Appendix B.

4.2 Gradient descent with fixed step size

Gradient descent with a fixed step size is known to converge to first order -stationary points in iterations for smooth nonconvex functions. By the following theorem of Carmon et al. (2017), this rate is up to a constant optimal.

Theorem 4 (Thm 1 in (Carmon et al., 2017)).

For any deterministic first-order optimization algorithm using gradient oracles, the iteration complexity to optimize an -smooth function to an -stationary point is at least

for some numerical constant .

However, we will show below that gradient descent is suboptimal under our relaxed -smoothness condition. In particular, to prove the convergence rate for gradient descent with fixed step size, we need to make an additional assumption on gradient norms.

Assumption 4.

Given an initialization , we assume that

The next theorem states that this assumption is necessary. Particularly, we show that gradient descent with fixed step size cannot converge faster than when . Therefore, GD can be arbitrarily slower than clipped GD under our relaxed smoothness assumption.

Theorem 5.

Let be the class of objectives satisfying Assumptions 1, 2, 3, and 4 with fixed constants , , . If GD with fixed step size is convergent for any function in , then there is a function such that GD with a fixed step size takes at least

iterations to converge to an -stationary point.

The proof starts with an exponentially growing function and shows that the step size for gradient descent must be small. The small step size leads to very slow convergence for another almost linear function with a small gradient. Details of this construction can be found in Appendix C.

Remark 6.

Theorems 4 and 5 together show that gradient descent with a fixed step size cannot converge to an -stationary point faster than . Recall that clipped GD algorithm converges as . This rate shows that clipped GD converges much faster than GD when is large, or in other words, when the problem has a poor initialization.

Below, we provide an iteration upper bound for the fixed-step gradient descent update (2).

Theorem 7.

Suppose assumptions 1, 2, 3 and 4 hold in set defined in (1). If we pick parameters such that , then GD with a fixed step size defined in Algorithm 2 terminates in

Please refer to Appendix D for the proof. Theorem 7 shows that gradient descent with a fixed step size converges in iterations. This suggests that the lower bound in Remark 6 is tight up to a log factor in .

5 Convergence in the stochastic setting

In the stochastic setting, we assume access to the stochastic gradient instead of the exact gradient . For simplicity, we denote below. We need the following assumptions.

Assumption 5.

, that is, we have unbiased stochastic gradients.

Assumption 6.

. This implies that .

Bounded gradient is a strong assumption but it is commonly used in proving convergence for adaptive gradient methods (see (Reddi et al., 2019; Ward et al., 2018; Zhou et al., 2018)). In the analysis below, we only discuss the case when , otherwise the condition is equivalent to smooth. The main result of this section is the following convergence guarantee for stochastic clipped GD (based on the stochastic version of the update (3)).

Theorem 8.

Let Assumptions 1, 2, 3, 5, and 6 hold globally. Let where . Stochastic clipped GD after iterations of update (3) satisfies

As a result, the algorithm converges to an -stationary point in iterations.

The convergence proof critically relies on the fact that the update distance in each iteration has a fixed upper-bound due to clipping. However, the bounded radius causes a problem in the proof for fixed-step-size SGD. If we only assume bounded second moments of the stochastic gradient oracle, we cannot control the distance between the current point and the updated point. Hence one cannot apply Lemma 

9 the same way as for (10) in Appendix E. Though we cannot prove the convergence of SGD with fixed step size, one also cannot theoretically rule out the possibility that it converges. Nevertheless, if we additionally assume that the noise in stochastic gradient oracle is sub-Gaussian, then we can show that SGD with fixed step size converges at rate . In order to avoid diversion from discussing adaptive methods, we omit including this analysis for conciseness.

6 Experiments

(a) Learning rate 30, with clipping.
(b) Learning rate 2, without clipping.
(c) Learning rate 2, with clipping.
Figure 1: Gradient norm vs smoothness on log scale for LM training. The dot color indicates the iteration number. Darker ones correspond to earlier iterations. Note that the spans of and axis are not fixed.
(a) SGD with momentum.
(b) Learning rate 1, without clipping.
(c) Learning rate 5, with clipping.
Figure 2: Gradient norm vs smoothness on linear scale for ResNet20 training. The dot color indicates the iteration number.
(a) Training loss of LSTM with different optimization parameters.
(b) Validation loss of LSTM with different optimization parameters.
(c) Training loss of ResNet20 with different optimization parameters.
(d) Test accuracy of ResNet20 with different optimization parameters.
Figure 3: Training and validation loss obtained with different training methods for LSTM and ResNet training. The validation loss plots the cross entropy. The training loss additionally includes the weight regularization term. In the legend, ‘lr30clip0.25’ denotes that clipped SGD uses step size and that the norm of the stochastic gradient is threshold by . In ResNet training, we threshold the stochastic gradient norm at when clipping is applied.

In this section, we summarize our experimental findings on the positive correlation between gradient norm and local smoothness. We then show that clipping accelerates convergence during neural network training. Our experiments are based on two tasks: language modeling and image classification. We run language modeling on the Penn Treebank (PTB) (Mikolov et al., 2010) dataset with AWD-LSTM models (Merity et al., 2018). For image classification, we train ResNet20 (He et al., 2016) on the Cifar10 dataset (Krizhevsky and Hinton, 2009). Details about the smoothness estimation and experimental setups are explained in Appendix F. First, our experiments test whether the local smoothness constant increases with the gradient norm, as suggested by the relaxed smoothness conditions defined in Section 3. To do so, we evaluate both quantities at points generated by the optimization procedure. We then scatter the local smoothness constants against the gradient norms in Figure 1 and Figure 2. Note that the plots are on a log-scale. A linear scale plot is shown in Appendix Figure 4. We notice that the correlation exists in the default training procedure for language modeling (see Figure 0(a)) but not in the default training for image classification (see Figure 1(a)). This difference aligns with the fact that gradient clipping is widely used in language modeling but is less popular in ResNet training, offering empirical support to our theoretical findings. We further investigate the cause of correlation. The plots in Figures 1 and 2 show that correlation appears when the models are trained with clipped GD and large learning rates. We propose the following explanation. Clipping enables the training trajectory to stably traverse non-smooth regions. Hence, we can observe that gradient norms and smoothness are positively correlated in Figures 0(a) and 1(c). Without clipping, the optimizer has to adopt a small learning rate and stays in a region where local smoothness does not vary much, otherwise the sequence diverges, and a different learning rate is used. Therefore, in other plots of Figures 1 and 2, the correlation is much weaker. As positive correlations are present in both language modeling and image classification experiments with large step sizes, our next set of experiments checks whether clipping helps accelerate convergence as predicted by our theory. From Figure 3, we find that the ability to traverse non-smooth regions indeed accelerates convergence. Because gradient clipping is a standard practice in language modeling, the LSTM models trained with clipping achieve the best validation performance and the fastest training loss convergence as expected. For image classification, surprisingly, clipped GD also achieves the fastest convergence and matches the test performance of SGD+momentum. These plots show that clipping can accelerate convergence and achieve good test performance at the same time. We do not analyze theory of this generalization capability as it is beyond the scope of this work.

7 Discussion

Much progress has been made to close the gap between upper and lower oracle complexities for first order smooth optimization. The works dedicated to this goal provide important insights and tools for us to understand the optimization procedure. However, there is another gap that separates theoretically accelerated algorithms from empirically fast algorithms. This work aim to close this gap. Specifically, we proposed a relaxed smoothness assumption that is supported by empirical evidence. We analyzed a simple but widely used optimization technique known as gradient clipping and provided theoretical guarantee that clipping can accelerate gradient descent. This phenomenon aligns remarkably well with empirical observations. There is still much to be explored in this direction. First, though our smoothness condition relaxes the usual Lipschitz assumption, it is unclear if there is a better condition that matches the experimental observations while also enabling a clean theoretical analysis. Second, we only studied the convergence of clipped gradient descent. Studying the convergence properties of other techniques such as momentum, coordinate-wise learning rates ( more generally, preconditioning) and variance reduction is also interesting. Finally, the most important question is: “can we design fast algorithm based on relaxed conditions and actually achieve faster convergence in neural network training?” Our experiments also have notable implications. First, though advocating clipped gradient descent in Resnet training is not a main point of this work, it is interesting to note that gradient descent and clipped gradient descent with large step sizes can achieve a similar test performance as momentum-SGD. Second, we learned that the performance of the baseline algorithm can actually beat some recently proposed algorithms. Therefore, when we design or learn about new algorithms, we need to pay extra attention to check whether the baseline algorithms are properly tuned.

References

Appendix A More related work on accelerating gradient methods

Variance reduction. Many efforts have been made to accelerate gradient-based methods. One elegant approach is variance reduction (e.g. Schmidt et al., 2017; Johnson and Zhang, 2013; Defazio et al., 2014; Bach and Moulines, 2013; Konečnỳ and Richtárik, 2013; Xiao and Zhang, 2014; Gong and Ye, 2014; Fang et al., 2018). This technique aims to solve stochastic and finite sum problems by averaging the noise in the stochastic oracle via utilizing the smoothness of the objectives. Momentum methods. Another line of work focuses on achieving acceleration with momentum. Polyak (1964) showed that momentum can accelerate optimization for quadratic problems; later, Nesterov (1983) designed a variation that provably accelerate any smooth convex problems. Based on Nesterov’s work, much theoretical progress was made to accelerate different variations of the original smooth convex problems (e.g. Ghadimi and Lan, 2016, 2012; Beck and Teboulle, 2009; Shalev-Shwartz and Zhang, 2014; Jin et al., 2018; Carmon et al., 2018; Allen-Zhu, 2017; Lin et al., 2015; Nesterov, 2012). Adaptive step sizes. The idea of varying step size in each iteration has long been studied. Armijo (1966) proposed the famous backtracking line search algorithm to choose step size dynamically. Polyak (1987) proposed a strategy to choose step size based on function suboptimality and gradient norm. More recently, Duchi et al. (2011) designed the Adagrad algorithm that can utilize the sparsity in stochastic gradients. Since last year, there has been a surge in studying the theoretical properties of adaptive gradient methods. One starting point is (Reddi et al., 2019), which pointed out that ADAM is not convergent and proposed the AMSGrad algorithm to fix the problem. Ward et al. (2018); Li and Orabona (2018) prove that Adagrad converges to stationary point for nonconvex stochastic problems. Zhou et al. (2018) generalized the result to a class of algorithms named Padam. Zou et al. (2018) gives sufficient condition for proving the convergence of Adam. Staib et al. (2019) shows that adaptive methods can escape saddle point faster than SGD under certain conditions. In addition, Levy (2016) showed that normalized gradient descent may have better convergence rate in presence of injected noise. However, the rate comparison is under dimension dependent setting. Hazan et al. (2015) studied the convergence of normalized gradient descent for quasi-convex functions.

Appendix B Proof of Theorem 3

We start by proving a lemma that is repeatedly used in later proofs. The lemma bounds the gradient in a neighborhood of the current point by Grönwall’s inequality.

Lemma 9.

Given such that , for any such that , we have .

Remark 10.

Note that the constant “1” comes from the definition of in (1). If Assumption 3 holds globally, then we do not need to constrain . This version will be used in Theorem 8.

Proof.

Let be a curve defined below,

Then we have

By Cauchy-Schwarz’s inequality, we get

The second inequality follows by Assumption 3. Then we can apply the integral form of Grönwall’s inequality and get

The Lemma follows by setting . ∎

b.1 Proof of the theorem

We parameterize the path between and its updated iterate as follows:

Since , using Taylor’s theorem, the triangle inequality, and Cauchy-Schwarz, we obtain

Since

we know by Lemma 9

Then by Assumption 3, we have

Therefore, as long as (which follows by our choice of ), we have

When , we have

When , we have

Therefore,

Assume that the algorithm doesn’t terminate in iterations. By doing a telescopic sum, we get

Rearrange and we get

Appendix C Proof of Theorem 5

We will prove a lower bound for the convergence rate of GD with fixed step size. The high level idea is that if GD converges for all functions satisfying the assumptions, then the step size needs to be small. However, this small step size will lead to very slow convergence for another function. We start with a function that grows exponentially. Let be fixed constants. Pick the initial point . Let the objective be defined as follows,

We notice that the function satisfies the assumptions with constants

(9)

When , we would have . By symmetry of the function and the super-linear growth of the gradient norm, we know that the iterates will diverge. Hence, in order for gradient descent with a fixed step size to converge, must be small enough. Formally,

Now, let’s look at a different objective that grows slowly.

This function is also second order differentiable and satisfies the assumptions with constants in (9). If we set for some constant , we know that . With the step size choice , we know that in each step, . Therefore, for ,

Appendix D Proof of Theorem 7

We start by parametrizing the function value along the update,

Note that with this parametrization, we have . Now we would like to argue that if , then . Assume by contradiction that this is not true. Then there exists such that . Since can be made arbitrarily small below a threshold, we assume . Denote

The value exists by continuity of as a function of . Then we know by Assumption 4 that . However, by Taylor expansion, we know that

The last inequality follows by . Hence we get a contradiction and conclude that for all , . Therefore, following the above inequality and Assumption 3, we get

The conclusion follows by the same argument as in Theorem 3 via a telescopic sum over .

Appendix E Proof of Theorem 8

By the fact that

we know . Hence by Lemma 9, we know

(10)

Let be a filtration such that is generated by . Then after taking the expectation we get

Notice that . Inspired by the proof in (Ward et al., 2018), we get by ,

(11)

We further prove in Lemma 11 that

(12)

We also notice that . Then rearrange (11) and we get

Furthermore, we know that

(13)
(14)

The last inequality follows by and Markov inequality. When , by telescoping the inequality (13), we get

The result follows by setting .

e.1 Technical lemma

Here we complete the proof for Theorem 8 by proving the inequality  (12).

Lemma 11.

The following inequality holds in the context of Theorem  8.

Proof.

Notice that

Since , we know that

(15)