# Accelerated Almost-Sure Convergence Rates for Nonconvex Stochastic Gradient Descent using Stochastic Learning Rates

Large-scale optimization problems require algorithms both effective and efficient. One such popular and proven algorithm is Stochastic Gradient Descent which uses first-order gradient information to solve these problems. This paper studies almost-sure convergence rates of the Stochastic Gradient Descent method when instead of deterministic, its learning rate becomes stochastic. In particular, its learning rate is equipped with a multiplicative stochasticity, producing a stochastic learning rate scheme. Theoretical results show accelerated almost-sure convergence rates of Stochastic Gradient Descent in a nonconvex setting when using an appropriate stochastic learning rate, compared to a deterministic-learning-rate scheme. The theoretical results are verified empirically.

## Authors

• 2 publications
• 3 publications
• 7 publications
10/20/2021

### Stochastic Learning Rate Optimization in the Stochastic Approximation and Online Learning Settings

In this work, multiplicative stochasticity is applied to the learning ra...
05/23/2018

### Predictive Local Smoothness for Stochastic Gradient Methods

Stochastic gradient methods are dominant in nonconvex optimization espec...
05/22/2017

### Training Deep Networks without Learning Rates Through Coin Betting

Deep learning methods achieve state-of-the-art performance in many appli...
02/14/2020

06/25/2020

### Automatic Tuning of Stochastic Gradient Descent with Bayesian Optimisation

Many machine learning models require a training procedure based on runni...
09/30/2020

### Gradient Descent-Ascent Provably Converges to Strict Local Minmax Equilibria with a Finite Timescale Separation

We study the role that a finite timescale separation parameter τ has on ...
09/04/2021

### On Faster Convergence of Scaled Sign Gradient Descent

Communication has been seen as a significant bottleneck in industrial ap...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Solving large-scale optimization problems usually requires the optimization algorithm to process large amounts of data which can lead to long computational times as well as large memory requirements. For this reason, simple and cost-effective algorithms need to be used. The most prominent examples are the so-called gradient algorithms which utilize gradient information to iteratively estimate a minimizer for the optimization problem, with perhaps the most prominent algorithm being Stochastic Gradient Descent (SGD), initially introduced in

Robbins and Monro (1951).

Even though in most optimization algorithms (including SGD) commonly the learning rate (where is the iteration count) also referred to as step size, is taken to be deterministic, the novelty in this work is that it introduces the problem where the learning rate becomes stochastic by equipping it with multiplicative stochasticity. In general, it has been observed that minimization performance can be noticeably improved under appropriate learning schemes e.g., under deterministic adaptive learning rate schemes. Examples include ADAM Kingma and Ba (2015) and some variants (e.g., AMSGrad Reddi and others (2018) or ADAMW Loshchilov and Hutter (2019)) or precursors (e.g., McMahan and Streeter (2010), Duchi and others (2011)). This work demonstrates that performance can be significantly improved if instead of just adaptive, the learning rate becomes both adaptive and stochastic.

In specific, the learning rate instead of now becomes (where e.g., or , with a positive constant; with some abuse of terminology, will be referred to as step size) and where

is a random variable, referred to as stochasticity factor (SF) in the rest of the paper. Because of the multiplicative nature of the SF the stochastic learning rate examined in this work will be referred to as Multiplicative-Stochastic-Learning-Rate (MSLR). A stochastic learning rate whose SF is a uniformly distributed random variable at each iteration will be referred to as Uniform-Multiplicative-Stochastic-Learning-Rate (UMSLR). This work uses MSLR on SGD to prove accelerated almost sure convergence rates in a nonconvex and smooth setting compared to a deterministic learning rate.

The SGD algorithm has been a traditional topic of research investigated in numerous works, e.g., Bertsekas and Tsitsiklis (2000); Zhou and others (2017), Loizou and Richtárik (2018). In Mertikopoulos and others (2020) it is shown that the gradient norm of SGD converges almost surely (a.s.). Loizou and others (2021) show convergence without knowledge of the smoothness constant using Line-Search Nocedal and Wright (2006) and Polyak stepsizes Polyak (1987). Almost sure convergence rates have been provided in works such as Bottou (2003); Nguyen and others (2018). The authors in Sebbouh and others (2021) derive almost sure convergence rates for the minimum squared gradient norm of SGD in a noncovex and smooth setting by utilizing a convergence result from Robbins and Siegmund (1971). In-expectation convergence analyses have been made in works such as Nemirovski and others (2009); Moulines and Bach (2011); Bottou and others (2018) and references therein. In-expectation convergence analysis of SGD for stochastic-learning-rates was made in Mamalis and others (2021).

In summary, this work provides accelerated a.s. convergence rates for SGD in the nonconvex and smooth setting using MSLR, compared to the deterministic learning rate case. Experiments demonstrate the improved a.s. convergence rates empirically. In more detail, the main contributions of this paper are:

• The introduction of the notion of stochastic learning rate schemes. Note that in this work a stochastic learning rate scheme should not depend on the stochasticity induced by the samples and how they are processed (e.g., as is usually the case with the computation of the gradients in most of the stochastic gradient methods). The stochasticity in the learning rates should be able to be directly controlled to be of any distribution, a requirement whose importance is discussed in the next point. It should be noted that this is in contrast to the stochasticity induced by randomly sampling the dataset which can be unable or more difficult to possess a (discrete) distribution with any prespecified property, regardless of the sampling patterns.

• Accelerated a.s. convergence rates for the SGD algorithm. The introduction of stochastic learning rate schemes can be rigorously shown to provide better convergence rates than the deterministic learning rate versions of the SGD algorithm in the nonconvex and smooth settings for the latter. In specific, moments of the distribution of the SF enter the a.s. convergence rate of SGD, directly affecting its rates. This means that by choosing the distribution of the learning rate stochasticity, i.e., of the SF, to possess appropriate prespecified properties, the a.s. convergence rates of the resulting stochastic learning rate algorithm can be improved compared to its deterministic counterpart. In absence of an SF the a.s. convergence rates reduce to those of the deterministic learning rate algorithms as expected.

• Empirically demonstrating that MSLR schemes, and in specific UMSLR, exhibit significantly improved optimization performance for SGD in accordance to the theoretical results, for some popular datasets.

The remainder of the paper is organized as follows. Section 2 states the context of this work in mathematical terms, along with a lemma and necessary assumption. The mathematical derivation of the almost sure convergence rates of the SGD algorithm with MSLR is given in Section 3. Section 4

constructs an appropriate SF which provides accelerated a.s. convergence rates for SGD, and provides a discussion on selecting hyperparameters that complement the SF distribution in accelerating performance. In Section

5, the experimental results for SGD using a UMSLR scheme are presented and compared to the deterministic-learning rate case. Section 6 concludes the paper with avenues of future work.

## 2 Problem Formulation and Assumptions

Let be a random variable with distribution , a set in , and a function that depends on and . Then the optimization problem is:

 minx∈RdEv∼D.[fv(x)], (1)

Define . In the well-known case of Empirical Risk Minimization (ERM) , where represents a random sample from the available training data and the parameters to be learned. Assume is smooth with being its smoothness constant. It is assumed that at least one minimizer to (1) exists yielding . Then, the SGD algorithm is given by:

 xk+1=xk−ηkuk∇fvk(xk). (2)

The stochastic learning rate is where denotes the SF, and

is deterministic and will be referred to as stepsize. This learning rate scheme will be referred to as MSLR. Typical assumptions made in the context of stochastic approximations include either bounded gradients or bounded variance of the gradients. Below, a more general assumption is given:

###### Assumption 1 (Expected Smoothness).

There exist constants A,B,C s.t. for all :

 Ev[∥∇fv(x)∥2]≤A(f(x)−f∗)+B∥∇f(x)∥2+C. (3)

This assumption is introduced in Khaled and Richtárik (2020) where is called Expected Smoothness and wherein the properties and significance of this assumption, especially for settings such as ERM, are discussed. Next, a lemma is presented:

###### Lemma 1.

Consider a filtration , the nonnegative sequences of -adapted processes , and , and a sequence of positive numbers such that almost surely and , and:

 ∀k∈N,E[Vk+1|Fk]+Uk+1≤(1+γk)Vk+Zk. (4)

Then converges and almost surely.

This lemma (Robbins and Siegmund (1971)) is used extensively in the proofs of the theorems of this paper. The following assumption includes standard conditions for the step size that appear in the stochastic approximations literature and which usually hold in practice.

###### Assumption 2.

The sequence is decreasing, , and .

Moreover, for the expected value of the SF the following assumption is made.

###### Assumption 3.

The sequence is monotone.

At this point it is noted that distributions for which (3) holds are able to be constructed. Discussion on the distribution constructed in this work is deferred to Section 4.

## 3 Accelerated Nonconvex SGD Almost-Sure Convergence Rates

This sections presents results for accelerated a.s. convergence rates of the SGD algorithm in the nonconvex case. The first result is formulated as follows:

###### Theorem 1.

Consider the iterates of (2). Assume that Assumption 1 holds. Assume that the stochasticity factor is bounded, i.e., with . Assume that the first moment of the stochasticity factor is bounded, i.e., with . Choose stochasticity factor which satisfies 3. Assume that the stepsizes verify 2.

2.1a. If is decreasing, if is increasing, if , and if for all , then:

 mint=0,…,k−1∥∇f(xt)∥2=o((Eu[uk]−Varu[uk])−1∑k−1t=0ηt)a.s. (5)

2.1b. If is decreasing, and if for all , then:

 mint=0,…,k−1∥∇f(xt)∥2=o(1Eu[uk]∑k−1t=0ηt)a.s. (6)

2.2. If is increasing, if is decreasing, if , if , and if for all , then:

 mint=0,…,k−1∥∇f(xt)∥2=o(Eu[uk]−Varu[uk]∑k−1t=0ηt)a.s. (7)

Several remarks on the SGD convergence rate results in the nonconvex case are in order. First, the assumptions of Theorem 1 guarantee the a.s. convergence rates presented in the theorem. However, whether the MSLR a.s. convergence rates are accelerated or not compared to the deterministic-learning-rate case, whose rate is (Sebbouh and others (2021)), depends on the choice of the SF distribution of MSLR SGD. To that end, it can be readily observed what properties the SF needs to satisfy so that the MSLR scheme produces faster a.s. convergence rates than the deterministic-learning-rate case. Firstly, for 2.1a, for the moments of the SF it should be that . Secondly, for 2.1b, it should be that and thirdly, for 2.2, it should be that . However, the last condition can be equivalently written as . This is implied by the two conditions and (for 2.2, these two acceleration conditions happen to also be theorem assumptions, but this is not true in general), given that . The reason that the SF is taken to satisfy these two conditions instead of , which is a weaker condition, is because in practice they can be more easily checked for adaptive distributions, e.g., as in the case of a uniformly distributed SF discussed in Section 4 in more detail.

Second, the requirements of Theorem 1 are sufficient but not necessary, i.e., they can be replaced by weaker assumptions and Theorem 1 will still hold. Nonetheless, the weaker assumptions are not as easily verified theoretically, and perhaps experimentally, as the assumptions currently given. For example, the mean and variance monotonicity assumptions along with the assumptions that follow them in 2.1a and 2.2 could be replaced by the assumptions and respectively. These are satisfied when respectively and hold, where the derivatives are with respect to . This is since if, e.g., then which gives the required weaker assumption for 2.1a (respectively for 2.2 by replacing with in the previous). However, given that checking assumptions which include both the derivatives of the mean and variance of some adaptive distributions can become complex quickly, the current assumptions that consider the monotonicity of each of the mean and variance separately, are deemed simpler to check than the assumptions dealing with a combination of the derivatives of these moments.

Third, it is noted that the various requirements that appear in Theorem 1 are consistent. In 2.1a, cannot escape to infinity since it is less than a decreasing and positive . In 2.1b, cannot diverge to negative infinity since it is positive, and in 2.2, cannot diverge to infinity since it is less than unity. Moreover, in the denominator of 2.1a the difference appears whereas in 2.1b the larger quantity appears. This should not be taken that 2.1b provides faster convergence rates than 2.1a since in case 2.1a the stepsize, and therefore the term , is larger than in case 2.1b so there is a trade-off between the acceleration provided by 2.1a and 2.1b (e.g., one case where the latter gives faster convergence rates would be when ).

Finally, when the SF becomes constant unity, the deterministic-learning-rate a.s. convergence rate is recovered by the MSLR SGD a.s. convergence rates in Theorem 1 since for it is and yielding .

All in all, Theorem 1 demonstrates that for appropriate choices of SFs the MSLR scheme accelerates the a.s. convergence rates of SGD compared to the a.s. convergence rates of its deterministic-learning-rate counterpart, meanwhile using either the same or larger stepsizes.

###### Proof.

The proof follows the proof of Lemma 2 in Khaled and Richtárik (2020). Starting with (2), equation (46) in Khaled and Richtárik (2020) becomes:

 f(xk+1) (8) =f(xk)−ukηk⟨∇f(xk),g(xk)⟩+Lu2kη2k2∥g(xk)∥2

which ultimately results in:

 Ek[f(xk+1)−f∗]+ηk2(2uk−u2kLBηk)∥∇f(xk)∥2 (9) ≤(1+u2kη2kAL)(f(xk)−f∗)+u2kη2kLC2.

### 3.1 Case 1a: Eu[uk+1]≤Eu[uk] and ηk≤1LB.

Using in (9):

 Ek[f(xk+1)−f∗]+ηk2(2uk−u2k)∥∇f(xk)∥2≤(1+u2kη2kAL)(f(xk)−f∗)+u2kη2kLC2. (10)

From and Assumption 2 it is that . Also it is that , which means . Thus, from Lemma 1, converges a.s. Taking expectations with respect to yields:

 Ek[f(xk+1)−f∗]+ηk2(2Eu[uk]−Eu[u2k]Eu[uk])∥∇f(xk)∥2 (11) ≤(1+Eu[u2k]η2kAL)(f(xk)−f∗)+Eu[u2k]η2kLC2.

Using :

 Ek[f(xk+1)−f∗]+Eu[uk]ηk2(2−Eu[uk]+Varu[uk]Eu[uk])∥∇f(xk)∥2 (12) ≤(1+Eu[u2k]η2kAL)(f(xk)−f∗)+Eu[u2k]η2kLC2.

This means:

 Ek[f(xk+1)−f∗]+Eu[uk]ηk2(1−Varu[uk]Eu[uk])∥∇f(xk)∥2 (13) ≤(1+Eu[u2k]η2kAL)(f(xk)−f∗)+Eu[u2k]η2kLC2.

Then, let:

 wk=2ηk∑kt=0ηt,g0=∥∇f(x0)∥2, (14) gk+1=(1−wk)gk+wk∥∇f(xk)∥2

for all . Moreover, since is a linear combination of it is that:

 gk=∑k−1t=0~wt∥∇f(xt)∥2 (15)

for some sequence with , and where since is decreasing. Then, using (14) to replace in (13) yields:

 Ek[f(xk+1)−f∗]+Eu[uk](1−Varu[uk]Eu[uk])∑kt=0ηt2gk+1+ηk2Eu[uk](1−Varu[uk]Eu[uk])gk (16) ≤(1+Eu[u2k]η2kAL)(f(xk)−f∗)+Eu[uk](1−Varu[uk]Eu[uk])∑k−1t=0ηt2gk+Eu[u2k]η2kLC2.

Since and :

 Ek[f(xk+1)−f∗]+Eu[uk+1](1−Varu[uk+1]Eu[uk+1])∑kt=0ηt2gk+1+ηk2Eu[uk](1−Varu[uk]Eu[uk])gk (17) ≤(1+Eu[u2k]η2kAL)(f(xk)−f∗)+Eu[uk](1−Varu[uk]Eu[uk])∑k−1t=0ηt2gk+Eu[u2k]η2kLC2

and since :

 Ek[f(xk+1)−f∗]+Eu[uk+1](1−Varu[uk+1]Eu[uk+1])∑kt=0ηt2gk+1+ηk2Eu[uk](1−Varu[uk]Eu[uk])gk (18) ≤(1+Eu[u2k]η2kAL)(f(xk)−f∗)+(1+Eu[u2k]η2kAL)Eu[uk](1−Varu[uk]Eu[uk])∑k−1t=0ηt2gk +Eu[u2k]η2kLC2.

Or equivalently:

 Ek[(f(xk+1)−f∗)+φ1(k+1)]+ηk2Eu[uk](1−Varu[uk]Eu[uk])gk (19) ≤(1+Eu[u2k]η2kAL)((f(xk)−f∗)+φ1(k))+Eu[u2k]η2kLC2

where . Then, from which gives , and from Assumption 2 it is that , which means . Moreover, it is that from Assumption 2. Then, from Lemma 1 this means that a.s. and also that:

 ((f(xk)−f∗)+φ1(k))converges a.s.

This gives that converges a.s. since it is that converges a.s. Then, algebraically manipulating it is that:

 (20)

Thus, from Assumption 2, i.e., from . Therefore:

 gk=o⎛⎜ ⎜⎝1Eu[uk](1−Varu[uk]Eu[uk])∑k−1t=0ηt⎞⎟ ⎟⎠a.s. (21)

that is:

 gk=o(1(Eu[uk]−Varu[uk])∑k−1t=0ηt)a.s. (22)

Moreover, from (15) it is that , which yields:

 mint=0,…,k−1∥∇f(xt)∥2=o((Eu[uk]−Varu[uk])−1∑k−1t=0ηt), (23)

almost surely.

### 3.2 Case 1b: Eu[uk+1]≤Eu[uk] and ηk≤1LBc2.

Using in (9):

 Ek[f(xk+1)−f∗]+ukηk2(2−c2LBηk)∥∇f(xk)∥2 (24) ≤(1+u2kη2kAL)(f(xk)−f∗)+u2kη2kLC2.

Using results in:

 Ek[f(xk+1)−f∗]+ukηk2∥∇f(xk)∥2 (25) ≤(1+u2kη2kAL)(f(xk)−f∗)+u2kη2kLC2.

By taking expectations with respect to :

 Ek[f(xk+1)−f∗]+Eu[uk]ηk2∥∇f(xk)∥2 (26) ≤(1+Eu[u2k]η2kAL)(f(xk)−f∗)+Eu[u2k]η2kLC2.

From Assumption 2 it is that . Also it is that , which means . Thus from Lemma 1, converges a.s. Using (14) to replace in (26) gives:

 Ek[f(xk+1)−f∗]+Eu[uk]∑kt=0ηt2gk+1+ηk2Eu[uk]gk (27) ≤(1+Eu[u2k]η2kAL)(f(xk)−f∗)+Eu[uk]∑k−1t=0ηt2gk+Eu[u2k]η2kLC2.

Since :

 Ek[f(xk+1)−f∗]+Eu[uk+1]∑kt=0ηt2gk+1+ηk2Eu[uk]gk (28) ≤(1+Eu[u2k]η2kAL)(f(xk)−f∗)+Eu[uk](1−Varu[uk]Eu[uk]2)∑k−1t=0ηt2gk+Eu[u2k]η2kLC2

and since :

 Ek[f(xk+1)−f∗+φ2(k+1)]+ηk2Eu[uk]gk (29) ≤(1+Eu[u2k]η2kAL)((f(xk)−f∗)+φ2)+Eu[u2k]η2kLC2

where . Then from which gives and from Assumption 2 it is that , which means . Moreover, it is that from Assumption 2. Then, from Lemma 1 this means that a.s. and also that converges a.s. This gives that converges a.s. since it is also that converges a.s. This yields for :

 limk→∞ηk∑k−1t=0ηtEu[uk]∑k−1t=0ηtgk=limk→∞ηkEu[uk]gk=0.

This means that from Assumption 2, i.e., from