Solving large-scale optimization problems usually requires the optimization algorithm to process large amounts of data which can lead to long computational times as well as large memory requirements. For this reason, simple and cost-effective algorithms need to be used. The most prominent examples are the so-called gradient algorithms which utilize gradient information to iteratively estimate a minimizer for the optimization problem, with perhaps the most prominent algorithm being Stochastic Gradient Descent (SGD), initially introduced inRobbins and Monro (1951).
Even though in most optimization algorithms (including SGD) commonly the learning rate (where is the iteration count) also referred to as step size, is taken to be deterministic, the novelty in this work is that it introduces the problem where the learning rate becomes stochastic by equipping it with multiplicative stochasticity. In general, it has been observed that minimization performance can be noticeably improved under appropriate learning schemes e.g., under deterministic adaptive learning rate schemes. Examples include ADAM Kingma and Ba (2015) and some variants (e.g., AMSGrad Reddi and others (2018) or ADAMW Loshchilov and Hutter (2019)) or precursors (e.g., McMahan and Streeter (2010), Duchi and others (2011)). This work demonstrates that performance can be significantly improved if instead of just adaptive, the learning rate becomes both adaptive and stochastic.
In specific, the learning rate instead of now becomes (where e.g., or , with a positive constant; with some abuse of terminology, will be referred to as step size) and where
is a random variable, referred to as stochasticity factor (SF) in the rest of the paper. Because of the multiplicative nature of the SF the stochastic learning rate examined in this work will be referred to as Multiplicative-Stochastic-Learning-Rate (MSLR). A stochastic learning rate whose SF is a uniformly distributed random variable at each iteration will be referred to as Uniform-Multiplicative-Stochastic-Learning-Rate (UMSLR). This work uses MSLR on SGD to prove accelerated almost sure convergence rates in a nonconvex and smooth setting compared to a deterministic learning rate.
The SGD algorithm has been a traditional topic of research investigated in numerous works, e.g., Bertsekas and Tsitsiklis (2000); Zhou and others (2017), Loizou and Richtárik (2018). In Mertikopoulos and others (2020) it is shown that the gradient norm of SGD converges almost surely (a.s.). Loizou and others (2021) show convergence without knowledge of the smoothness constant using Line-Search Nocedal and Wright (2006) and Polyak stepsizes Polyak (1987). Almost sure convergence rates have been provided in works such as Bottou (2003); Nguyen and others (2018). The authors in Sebbouh and others (2021) derive almost sure convergence rates for the minimum squared gradient norm of SGD in a noncovex and smooth setting by utilizing a convergence result from Robbins and Siegmund (1971). In-expectation convergence analyses have been made in works such as Nemirovski and others (2009); Moulines and Bach (2011); Bottou and others (2018) and references therein. In-expectation convergence analysis of SGD for stochastic-learning-rates was made in Mamalis and others (2021).
In summary, this work provides accelerated a.s. convergence rates for SGD in the nonconvex and smooth setting using MSLR, compared to the deterministic learning rate case. Experiments demonstrate the improved a.s. convergence rates empirically. In more detail, the main contributions of this paper are:
The introduction of the notion of stochastic learning rate schemes. Note that in this work a stochastic learning rate scheme should not depend on the stochasticity induced by the samples and how they are processed (e.g., as is usually the case with the computation of the gradients in most of the stochastic gradient methods). The stochasticity in the learning rates should be able to be directly controlled to be of any distribution, a requirement whose importance is discussed in the next point. It should be noted that this is in contrast to the stochasticity induced by randomly sampling the dataset which can be unable or more difficult to possess a (discrete) distribution with any prespecified property, regardless of the sampling patterns.
Accelerated a.s. convergence rates for the SGD algorithm. The introduction of stochastic learning rate schemes can be rigorously shown to provide better convergence rates than the deterministic learning rate versions of the SGD algorithm in the nonconvex and smooth settings for the latter. In specific, moments of the distribution of the SF enter the a.s. convergence rate of SGD, directly affecting its rates. This means that by choosing the distribution of the learning rate stochasticity, i.e., of the SF, to possess appropriate prespecified properties, the a.s. convergence rates of the resulting stochastic learning rate algorithm can be improved compared to its deterministic counterpart. In absence of an SF the a.s. convergence rates reduce to those of the deterministic learning rate algorithms as expected.
Empirically demonstrating that MSLR schemes, and in specific UMSLR, exhibit significantly improved optimization performance for SGD in accordance to the theoretical results, for some popular datasets.
The remainder of the paper is organized as follows. Section 2 states the context of this work in mathematical terms, along with a lemma and necessary assumption. The mathematical derivation of the almost sure convergence rates of the SGD algorithm with MSLR is given in Section 3. Section 4
constructs an appropriate SF which provides accelerated a.s. convergence rates for SGD, and provides a discussion on selecting hyperparameters that complement the SF distribution in accelerating performance. In Section5, the experimental results for SGD using a UMSLR scheme are presented and compared to the deterministic-learning rate case. Section 6 concludes the paper with avenues of future work.
2 Problem Formulation and Assumptions
Let be a random variable with distribution , a set in , and a function that depends on and . Then the optimization problem is:
Define . In the well-known case of Empirical Risk Minimization (ERM) , where represents a random sample from the available training data and the parameters to be learned. Assume is smooth with being its smoothness constant. It is assumed that at least one minimizer to (1) exists yielding . Then, the SGD algorithm is given by:
The stochastic learning rate is where denotes the SF, and
is deterministic and will be referred to as stepsize. This learning rate scheme will be referred to as MSLR. Typical assumptions made in the context of stochastic approximations include either bounded gradients or bounded variance of the gradients. Below, a more general assumption is given:
Assumption 1 (Expected Smoothness).
There exist constants A,B,C s.t. for all :
This assumption is introduced in Khaled and Richtárik (2020) where is called Expected Smoothness and wherein the properties and significance of this assumption, especially for settings such as ERM, are discussed. Next, a lemma is presented:
Consider a filtration , the nonnegative sequences of -adapted processes , and , and a sequence of positive numbers such that almost surely and , and:
Then converges and almost surely.
This lemma (Robbins and Siegmund (1971)) is used extensively in the proofs of the theorems of this paper. The following assumption includes standard conditions for the step size that appear in the stochastic approximations literature and which usually hold in practice.
The sequence is decreasing, , and .
Moreover, for the expected value of the SF the following assumption is made.
The sequence is monotone.
3 Accelerated Nonconvex SGD Almost-Sure Convergence Rates
This sections presents results for accelerated a.s. convergence rates of the SGD algorithm in the nonconvex case. The first result is formulated as follows:
Consider the iterates of (2). Assume that Assumption 1 holds. Assume that the stochasticity factor is bounded, i.e., with . Assume that the first moment of the stochasticity factor is bounded, i.e., with . Choose stochasticity factor which satisfies 3. Assume that the stepsizes verify 2.
2.1a. If is decreasing, if is increasing, if , and if for all , then:
2.1b. If is decreasing, and if for all , then:
2.2. If is increasing, if is decreasing, if , if , and if for all , then:
Several remarks on the SGD convergence rate results in the nonconvex case are in order. First, the assumptions of Theorem 1 guarantee the a.s. convergence rates presented in the theorem. However, whether the MSLR a.s. convergence rates are accelerated or not compared to the deterministic-learning-rate case, whose rate is (Sebbouh and others (2021)), depends on the choice of the SF distribution of MSLR SGD. To that end, it can be readily observed what properties the SF needs to satisfy so that the MSLR scheme produces faster a.s. convergence rates than the deterministic-learning-rate case. Firstly, for 2.1a, for the moments of the SF it should be that . Secondly, for 2.1b, it should be that and thirdly, for 2.2, it should be that . However, the last condition can be equivalently written as . This is implied by the two conditions and (for 2.2, these two acceleration conditions happen to also be theorem assumptions, but this is not true in general), given that . The reason that the SF is taken to satisfy these two conditions instead of , which is a weaker condition, is because in practice they can be more easily checked for adaptive distributions, e.g., as in the case of a uniformly distributed SF discussed in Section 4 in more detail.
Second, the requirements of Theorem 1 are sufficient but not necessary, i.e., they can be replaced by weaker assumptions and Theorem 1 will still hold. Nonetheless, the weaker assumptions are not as easily verified theoretically, and perhaps experimentally, as the assumptions currently given. For example, the mean and variance monotonicity assumptions along with the assumptions that follow them in 2.1a and 2.2 could be replaced by the assumptions and respectively. These are satisfied when respectively and hold, where the derivatives are with respect to . This is since if, e.g., then which gives the required weaker assumption for 2.1a (respectively for 2.2 by replacing with in the previous). However, given that checking assumptions which include both the derivatives of the mean and variance of some adaptive distributions can become complex quickly, the current assumptions that consider the monotonicity of each of the mean and variance separately, are deemed simpler to check than the assumptions dealing with a combination of the derivatives of these moments.
Third, it is noted that the various requirements that appear in Theorem 1 are consistent. In 2.1a, cannot escape to infinity since it is less than a decreasing and positive . In 2.1b, cannot diverge to negative infinity since it is positive, and in 2.2, cannot diverge to infinity since it is less than unity. Moreover, in the denominator of 2.1a the difference appears whereas in 2.1b the larger quantity appears. This should not be taken that 2.1b provides faster convergence rates than 2.1a since in case 2.1a the stepsize, and therefore the term , is larger than in case 2.1b so there is a trade-off between the acceleration provided by 2.1a and 2.1b (e.g., one case where the latter gives faster convergence rates would be when ).
Finally, when the SF becomes constant unity, the deterministic-learning-rate a.s. convergence rate is recovered by the MSLR SGD a.s. convergence rates in Theorem 1 since for it is and yielding .
All in all, Theorem 1 demonstrates that for appropriate choices of SFs the MSLR scheme accelerates the a.s. convergence rates of SGD compared to the a.s. convergence rates of its deterministic-learning-rate counterpart, meanwhile using either the same or larger stepsizes.
3.1 Case 1a: and .
Using in (9):
for all . Moreover, since is a linear combination of it is that:
Since and :
and since :
This gives that converges a.s. since it is that converges a.s. Then, algebraically manipulating it is that:
Thus, from Assumption 2, i.e., from . Therefore:
Moreover, from (15) it is that , which yields:
3.2 Case 1b: and .
Using in (9):
Using results in:
By taking expectations with respect to :
and since :
where . Then from which gives and from Assumption 2 it is that , which means . Moreover, it is that from Assumption 2. Then, from Lemma 1 this means that a.s. and also that converges a.s. This gives that converges a.s. since it is also that converges a.s. This yields for :
This means that from Assumption 2, i.e., from . Therefore: