The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure

There is a stark disparity between the step size schedules used in practical large scale machine learning and those that are considered optimal by the theory of stochastic approximation. In theory, most results utilize polynomially decaying learning rate schedules, while, in practice, the "Step Decay" schedule is among the most popular schedules, where the learning rate is cut every constant number of epochs (i.e. this is a geometrically decaying schedule). This work examines the step-decay schedule for the stochastic optimization problem of streaming least squares regression (both in the non-strongly convex and strongly convex case), where we show that a sharp theoretical characterization of an optimal learning rate schedule is far more nuanced than suggested by previous work. We focus specifically on the rate that is achievable when using the final iterate of stochastic gradient descent, as is commonly done in practice. Our main result provably shows that a properly tuned geometrically decaying learning rate schedule provides an exponential improvement (in terms of the condition number) over any polynomially decaying learning rate schedule. We also provide experimental support for wider applicability of these results, including for training modern deep neural networks.

• 57 publications
• 60 publications
• 14 publications
• 49 publications
02/18/2021

On the Convergence of Step Decay Step-Size for Stochastic Optimization

The convergence of stochastic gradient descent is highly dependent on th...
07/22/2019

Stochastic algorithms with geometric step decay converge linearly on sharp functions

Stochastic (sub)gradient methods require step size schedule tuning to pe...
03/01/2021

Acceleration via Fractal Learning Rate Schedules

When balancing the practical tradeoffs of iterative methods for large-sc...
07/09/2021

REX: Revisiting Budgeted Training with an Improved Schedule

Deep learning practitioners often operate on a computational and monetar...
02/24/2020

The Two Regimes of Deep Network Training

Learning rate schedule has a major impact on the performance of deep lea...
03/09/2020

Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule

While the generalization properties of neural networks are not yet well ...
06/25/2020

Automatic Tuning of Stochastic Gradient Descent with Bayesian Optimisation

Many machine learning models require a training procedure based on runni...

1 Introduction

Large scale machine learning and deep learning rely almost exclusively on stochastic optimization methods, primarily SGD

(Robbins and Monro, 1951) and variants. Such methods are heavily tuned to the problem at hand (often with parallelized hyper-parameter searches (Li et al., 2017)). There are two predominant approaches in stochastic optimization: those methods which decay learning rate schedules to achieve the best performance (Krizhevsky et al., 2012; Sutskever et al., 2013; Kidambi et al., 2018) and those which rely on various forms of approximate preconditioning (Duchi et al., 2011; Tieleman and Hinton, 2012; Kingma and Ba, 2014) to obtain reasonably accurate results on classes of problem instances (often) with minimal hyper-parameter tuning. This work examines the former class of methods, where our goal is to present a more refined characterization of optimal learning rate schedules, through both sharp theoretical analysis (on the special case of convex quadratics) and empirical studies.

In this work, we will restrict our attention to only the SGD algorithm where we are concerned with the behavior of the final iterate (i.e. the last point when we choose to terminate the algorithm). While the majority of (minimax optimal) theoretical results for SGD focus on iterate averaging techniques (e.g. Polyak and Juditsky (1992)), practical implementations of SGD predominantly return the final iterate of the SGD procedure. Thus, it is of importance (both from theoretical and practical perspectives) to quantify what is achievable with the final iterate of an SGD procedure.

In theory, it is known that final iterate (Robbins and Monro, 1951) of SGD will (asymptotically) converge to the (local) minimizer only if the learning rates are not summable but are square summable (the former condition being one so that the initial conditions are forgotten and the latter condition being one so that the error due to the noise goes to zero) (Kushner and Clark, 1978; Kushner and Yin, 2003). In particular, much of the theoretically studied learning schedules are of the form for some and  (Robbins and Monro, 1951; Polyak and Juditsky, 1992) – we refer to these schedules as polynomial decay schemes; such polynomial decay schemes are convergent due to that they are not summable but are square summable. Furthermore, it is known that such polynomial decay schemes can yield near-minimax optimal rates (up to log factors) on the final iterate for certain classes of non-smooth stochastic convex optimization problems (Shamir and Zhang, 2012), with/without strong convexity.

In practice, a widely used stepsize schedule involves cutting the learning rate (by a constant factor) every constant number of epochs; such schemes are referred to as “Step Decay” schedules . Clearly, such a scheme is geometrically decaying the learning rate, and, therefore, it is a non-convergent scheme (from the stochastic approximation perspective). However, in practice, the schedule at which the rate is (geometrically) cut is tuned  to obtain good performance when the algorithm is terminated (Krizhevsky et al., 2012; He et al., 2016b)

, as opposed to one that obtains the best rates in the limit of a large number of updates. Such schemes are widely used to the extent that these are available as a standard option in popular deep learning libraries such as PyTorch

.

Our contributions:

This work establishes near optimality of the step-decay schedule (Algorithm 1) on the final iterate of an SGD procedure (with a known

time horizon). In particular, the variance on the final iterate of a step-decay schedule is shown to offer an

exponential improvement over that of standard polynomially decaying step size schemes standard in the theory of stochastic approximation (Kushner and Yin, 2003). Figure 1 illustrates that this difference is evident (empirically) even when optimizing a two-dimensional convex quadratic. Table 1 provides a summary.

Our main contributions are as follows:

• Sub-optimality of polynomially decaying learning rate schemes: For the case of optimizing strongly convex quadratics, this work shows that the final iterate of a polynomially decaying stepsize scheme (i.e. with , with ) is off the statistical minimax rate by a factor of the condition number of the problem. For the non-strongly convex case of optimizing quadratics, any polynomially decaying stepsize scheme can achieve a rate no better than (up to factors), while the statistical minimax rate is . We would like to make a note here that our main theorem 2, for the non-strongly convex case of quadratics, offers a rate on the initial error (i.e., the bias term) that is off the best known rate (Bach and Moulines, 2013) (that employs iterate averaging) by a dimension factor.

• Near-optimality of the step-decay scheme: Given a fixed end time , the step-decay scheme (algorithm 1) presents a final iterate that is off the statistical minimax rate by just a factor for optimizing both strongly convex and non-strongly convex case of quadratics 555This dependence can be improved to of the condition number of the problem (for the strongly convex case) using a more refined stepsize decay scheme., thus indicating vast improvements over polynomially decaying stepsize schedules. Algorithm 1 is rather straightforward and employs the knowledge of just an initial learning rate and number of iterations for its implementation.

• SGD has to query bad points (or iterates) infinitely often: For the case of optimizing strongly convex quadratics, this work shows that any stochastic gradient procedure (in a sense) must query sub-optimal iterates (off by nearly a condition number) infinitely often.

Table 1 summarizes this paper’s results. Note that the sub-optimality of standard polynomially decaying stepsizes for classes of smooth and strongly convex optimization doesn’t contradict the (minimax) optimality results in stochastic approximation (Polyak and Juditsky, 1992). Iterate averaging coupled with polynomially decaying learning rates clearly does achieve minimax optimal statistical rates in the limit (Ruppert, 1988; Polyak and Juditsky, 1992). In fact, recent results for the special case of quadratics indicate that a constant learning rate coupled with iterate averaging achieves anytime minimax optimal statistical rates (as opposed to results that work with the knowledge of the time horizon) (Bach and Moulines, 2013; Jain et al., 2016, 2017b). However, as mentioned previously, this work deals with the behavior of the final iterate (i.e. without iterate averaging) of a stochastic gradient procedure, which is clearly of relevance to practice.

Extending results on the performance of Step Decay schemes to more general convex optimization problems, beyond stochastic optimization of quadratics, is an important future direction.

Related work:

Stochastic Gradient Descent (SGD) and the problem of stochastic approximation was introduced in the work of Robbins and Monro (1951). This work elaborates on stepsize schemes satisfied by asymptotically convergent stochastic gradient methods: we refer to these schemes as “convergent” stepsize sequences. The asymptotic statistical optimality of SGD equipped with larger stepsize sequences and iterate averaging was shown in Ruppert (1988); Polyak and Juditsky (1992). In terms of oracle models and notions of optimality, there exists two lines of thought, as elaborated below. See also Jain et al. (2017b) for a detailed discussion in this regard.

One line of thought considers the goal of matching the excess risk of the statistically optimal estimator

(Anbar, 1971; Kushner and Clark, 1978; Polyak and Juditsky, 1992) on every problem instance. Several recent works (Bach and Moulines, 2013; Frostig et al., 2015; Dieuleveut et al., 2016; Jain et al., 2016, 2017b) present non-asymptotic results work in this oracle model, in conjunction with iterate averaging, and achieve minimax rates (on a per-problem basis) (Lehmann and Casella, 1998; Kushner and Clark, 1978). This paper studies the final iterate of SGD and understands its behavior with both the standard polynomially decaying stepsizes and the step decay schedule under this oracle model.

The other line of thought designs algorithms under worst case assumptions such as bounded noise, with the goal to match lower bounds provided in Nemirovsky and Yudin (1983); Raginsky and Rakhlin (2011); Agarwal et al. (2012). Working in this oracle model, various asymptotic properties of convergent learning rate schemes in stochastic approximation literature have been studied in great detail (Kushner and Clark, 1978; Ljung et al., 1992; Bharath and Borkar, 1999; Kushner and Yin, 2003; Lai, 2003), for broad function classes. Using iterate averaged SGD, efforts of Lacoste-Julien et al. (2012); Rakhlin et al. (2012); Ghadimi and Lan (2012, 2013); Bubeck (2014); Dieuleveut et al. (2016) achieve (near-)minimax rates for various problem classes. The work of Shamir and Zhang (2012) is closest in spirit to our work (despite working with a different oracle model), and presents near minimax rates (up to factors) using the final iterate of an SGD procedure for non-smooth stochastic optimization with/without strong convexity assumptions. Note that the work of Harvey et al. (2018) established a lower bound indicating that the final iterate of an SGD method suffers an extra logarithmic dependence on the time (under specific classes of polynomially decaying stepsizes, and when the end time is not known), as established by the work of Shamir and Zhang (2012) over the minimax rate (Nemirovsky and Yudin, 1983; Raginsky and Rakhlin, 2011; Agarwal et al., 2012) in the context of SGD with “standard” polynomially decaying stepsizes when optimizing non-smooth objectives.

Paper organization:

Section 2 describes notation and problem setup. Section 3 presents our results on the sub-optimality of polynomial decay schemes and the near optimality of the step decay scheme. Section 3.3 presents results on the anytime behavior of SGD (i.e. the asymptotic/infinite horizon case). Section 4 presents experimental results and Section 5 presents conclusions.

2 Problem Setup

Notation: We present the setup and associated notation in this section. We represent scalars with normal font etc., vectors with boldface lowercase characters etc. and matrices with boldface uppercase characters etc. We represent positive semidefinite (PSD) ordering between two matrices using . The symbol represents that the direction of inequality holds for some universal constant.

Our theoretical results focus on the stochastic approximation problem of (streaming) least squares regression and this involves minimizing the following expected square loss objective:

 minw∈Rdf(w) where f(w)def=12E(x,y)∼D[(y−w⊤x)2]. (3)

Note that the hessian of the problem . In this paper, we are provided access to stochastic gradients that involves sampling a fresh example input-output pair

and using this to compute an unbiased estimator of the gradient of the objective

. This stochastic gradient , evaluated at some iterate is represented as:

 ˆ∇f(wt)=−(yt−⟨wt,xt⟩)⋅xt. (4)

Our goal in this paper is to consider the stochastic gradient descent method (Robbins and Monro, 1951), wherein, given an initial iterate and step size sequence , we perform the following update:

With regards to examples drawn from the underlying distribution , the input and the output are related to each other as:

 y=⟨w∗,x⟩+ϵ,

where, is the noise on the example pair and is a minimizer of the objective . We assume that this noise satisfies the following condition:

 Σdef=E[ˆ∇f(w∗)ˆ∇f(w∗)⊤]=E(x,y)∼D[(y−⟨w∗,x⟩)2xx⊤]⪯σ2H. (5)

Next, we assume that covariates within the samples

satisfy the following fourth moment inequality:

 E[∥x∥2xx⊤]⪯R2 H (6)

This assumption is satisfied, for instance, when the norm of the covariates , but holds true even in more general situations (i.e. this assumption is more general than a bounded norm assumption).

Finally, note that both the conditions 5 and 6 are fairly general and used in several recent works (Bach and Moulines, 2013; Jain et al., 2016, 2017b) that present a sharp analysis of SGD (and its variants) for the streaming least squares regression problem. Next, we denote by

 μdef=λmin% (H),Ldef=λmax(H), and ,κdef=R2μ

the smallest eigenvalue, largest eigenvalue and condition number of

respectively. in the strongly convex case but not necessarily so in the non-strongly convex case (in section 3, the non-strongly convex quadratic objective is referred to as the “smooth” case).

Let . The excess risk of an iterate is given by . It is well known that given accesses to the stochastic gradient oracle in equation 4, any algorithm that uses these stochastic gradients and outputs has sub-optimality that is lower bounded by . More concretely, we have that (Van der Vaart, 2000)

 limt→∞E[f(ˆwt)]−f(w∗)σ2d/t≥1.

There exists schemes that achieve this rate of e.g., constant step size SGD with averaging (Ruppert, 1988; Polyak and Juditsky, 1992; Bach and Moulines, 2013). This rate of is called the statistical minimax rate.

3 Main results

In this section, we will present the main results of this paper. We begin with the sub-optimality of polynomially decaying stepsizes 3.1, the (surprising) near-optimal behavior of the step-decay schedule 3.2, followed by the fundamental limitation that plagues SGD in making it query points with highly sub-optimal function values infinitely often.

3.1 Suboptimality of polynomial decay schemes

This paper begins by showing that there exist problem instances where traditional polynomial decay schemes that are presented by the theory of stochastic approximation Robbins and Monro (1951); Polyak and Juditsky (1992) i.e., those of the form , for any choice of and are significantly suboptimal (by a factor of the condition number of the problem) compared to the statistical minimax rate (Kushner and Clark, 1978).

Theorem 1.

Under assumptions 56,there exists a class of problem instances where the following lower bounds hold on the final iterate of a Stochastic Gradient procedure with polynomially decaying stepsizes when given access to the oracle as written in equation 4.

Strongly convex case: Suppose . For any condition number , there exists a problem instance with initial suboptimality such that, for any , and for all and , and for the learning rate scheme , we have

 E[f(wT)]−f(w∗)≥exp(−TκlogT)(f(w0)−f(w∗))+σ2d64⋅κT.

Smooth case: For any fixed , there exists a problem instance such that, for all and , and for the learning rate scheme , we have

 E[f(wT)]−f(w∗)≥(R2⋅∥w0−w∗∥2+σ2d)⋅1√TlogT.

In both the cases above, the statistical minimax rate is . In the strongly convex case, we have a suboptimality factor of and in the smooth case, we have a suboptimality factor of .

3.2 Near optimality of Step Decay schemes

This section presents results on the Step Decay schedules. In particular, given the knowledge of an end time when the algorithm is terminated, the step decay learning rate schedule (Algorithm 1) offers significant improvements over standard polynomially decaying stepsize schemes, and obtains near minimax rates (off by only a factor).

Theorem 2.

Suppose we are given access to the stochastic gradient oracle 4 satisfying assumptions 5 and 6. Running Algorithm 1 with an initial stepsize of allows the algorithm to achieve the following excess risk guarantees.

• Strongly convex case: Suppose . We have:

• Smooth case: We have:

 E[f(wT)]−f(w∗)≤2⋅(R2d⋅∥w0−w∗∥2+2σ2d)⋅logTT

We would like to make a note that, while the above theorem presents significant improvements over standard polynomial decay (or constant learning rate schemes (Polyak and Juditsky, 1992; Bach and Moulines, 2013; Défossez and Bach, 2015; Jain et al., 2016)) with iterate averaging, the result presents a worse rate on the initial error (by a dimension factor) in the smooth case, compared to the best known result (Bach and Moulines, 2013), which relies heavily on iterate averaging to remove this factor. It is an open question with regards to whether this factor can actually be improved or not. The above result shows that the Step Decay scheme significantly improves over polynomial decay schemes, which are plagued by a polynomial dependence of a condition number on the variance of the final iterate. Furthermore, note that Algorithm 1 just requires access to (just as standard SGD for least squares (Bach and Moulines, 2013; Jain et al., 2016)) and the knowledge of the end time and doesn’t require access to the strong convexity parameter, in contrast to standard results for the strongly convex setting (for e.g. Rakhlin et al. (2012); Shamir and Zhang (2012); Lacoste-Julien et al. (2012); Bubeck (2014)), which achieve rates given access to the strong convexity parameter (which is often harder to obtain in practice), and, more often, using iterate averaging. These results are off from statistical minimax rates achieved using iterate averaging  (Kushner and Clark, 1978; Polyak and Juditsky, 1992) by only a factor. Note that this factor can be improved to a factor for the strongly convex quadratic case by using an additional polynomial decay scheme in the beginning before switching to the Step Decay scheme.

Proposition 3.

Suppose we are given access to the stochastic gradient oracle 4 satisfying assumptions 5 and 6. Let and let . For any problem and fixed time horizon , there exists a learning rate scheme that achieves

 E[f(wT)]−f(w∗)≤2exp(−T/(6κlogκ))⋅(f(w0)−f(w∗))+100log2κ⋅σ2dT.

Note that to in order to have improved the dependence on the variance from (in theorem 2) to (in proposition 3), we do require access to the strong convexity parameter in addition to and knowledge of the end time . However, this is indeed the case even for standard analyses for the strongly convex setting, say, Rakhlin et al. (2012); Shamir and Zhang (2012); Lacoste-Julien et al. (2012); Bubeck (2014).

As a final remark, recall that our results in this section (on step decay schemes) assumed the knowledge of a fixed time horizon. In contrast, most results SGD’s averaged iterate obtain anytime (i.e., limiting/infinite horizon) guarantees. Can we hope to achieve such guarantees with the final iterate?

3.3 SGD queries bad points infinitely often

Our main result in this section shows that obtaining near statistical minimax rates with the final iterate is not possible without knowledge of the time horizon . More concretely, we show the following limitation of SGD for the strongly convex quadratic case: for any learning rate sequence (be it polynomially decaying or step-decay), SGD requires to query a point with sub-optimality at least for infinitely many time steps .

Theorem 4.

Suppose we are given access to a stochastic gradient oracle 4 satisfying assumptions 56. There exists a universal constant , and a problem instance, such that for SGD algorithm with any for all 666Learning rate more than will make the algorithm diverge., we have

 limsupT→∞E[f(wT)]−f(w∗)(σ2d/T)≥Cκlog(κ+1).

The bad points guaranteed to exist by Theorem 4 are not rare. One can in fact show that such points occur at least once in iterations. This claim is formalized in Theorem 16 in appendix D.

4 Experimental Results

We present experimental validation on the suitability of the Step-decay schedule (or more precisely, its continuous counterpart, which is the exponentially decaying schedule), and compare its with the polynomially decaying stepsize schedules. In particular, we consider the use of:

(7)
(8)
(9)

Where, we perform a systematic grid search on the parameters and . In the section below, we consider a real world non-convex optimization problem of training a residual network on the cifar-10 dataset, with an aim to illustrate the practical implications of the results described in the paper. Complete details of the setup are given in Appendix E.

4.1 Non-Convex Optimization: Training a Residual Net on cifar-10

Consider the task of training a layer deep residual network (He et al., 2016b) with pre-activation blocks (He et al., 2016a) (dubbed preresnet-44) for cifar-10 classification problem. The code for implementing the network can be found here . For all experiments, we use Nesterov’s Accelerated gradient method (Nesterov, 1983) implemented in pytorch with a momentum set to and batchsize set to , training epochs, regularization set to .

Our experiments are based on grid searching for the best learning rate decay scheme on the parametric family of learning rate schemes described above 7,8,9; all grid searches are performed on a separate validation set (obtained by setting aside one-tenth of the training dataset) and with models trained on the remaining

samples. For presenting the final numbers in the plots/tables, we employ the best hyperparameters from the validation stage and train it on the entire

samples and average results run with different random seeds. The parameters for grid searches and other details are presented in Appendix E. Furthermore, we always extend the grid so that the best performing grid search parameter lies in the interior of our grid search.

Comparison between different schemes: Figure 2 and Table 2 present a comparison of the performance of the three schemes (7)-(9). They demonstrate that the exponential scheme outperforms the polynomial step-size schemes.

Hyperparameter selection using truncated runs: Figure 3 and Tables 3 and 4 present a comparison of the performance of three exponential decay schemes each of which has the best performance at , and epochs respectively. The key point to note is that best performing hyperparameters at and epochs are not the best performing at epochs (which is made stark from the perspective of the validation error). This demonstrates that hyper parameter selection using truncated runs, (for e.g., in hyperband (Li et al., 2017)) might necessitate rethinking.

5 Conclusions and Discussion

The main contribution of this work shows that the issue of learning rate scheduling is far more nuanced than suggested by prior theoretical results, where we do not even need to move to non-convex optimization to show that the starkly different schemes (compared to traditional polynomially decaying stepsizes) can be far more effective than the standard polynomially decaying rates considered in theory. This is important from a practical perspective in that the Step Decay schedule is widely used in practical SGD implementations for both convex and non-convex optimization.

Is quadratic loss minimization special? One may ask if there is something particularly special about why the minimax rates are different for quadratic loss minimization as opposed to more general convex (and non-convex) optimization problems? Ideally, we would hope that our theoretical results can be formalized in more general cases: this would serve as an exciting direction for future research. Interestingly, Allen-Zhu (2018) shows marked improvements for making gradient norm small (as opposed to function values, as considered in this paper), when working with stochastic gradients, for general function classes, with factors that appear similar to ones obtained in this work.

Acknowledgements: Sham Kakade acknowledges funding from the Washington Research Foundation for Innovation in Data-intensive Discovery, and the NSF Award 1740551, and the ONR award N00014-18-1-2247.

Appendix A Preliminaries

Before presenting the lemmas establishing the behavior of SGD under various learning rate schemes, we introduce some notation. We recount that the SGD update rule denoted through:

 wt=wt−1−ηtˆ∇f(wt−1)

We then write out the expression for the stochastic gradient .

 ˆ∇f(wt−1)=xtx⊤t(wt−1−w∗)−ϵtxt,

where, given the stochastic gradient corresponding to an example , with , the above stochastic gradient expression naturally follows. Now, in order to analyze the contraction properties of the SGD update rule, we require the following notation:

 Pt=I−ηtxtx⊤t.
Lemma 5.

[For e.g. Appendix A.2.2 from Jain et al. (2016)] Bias-Variance tradeoff: Running SGD for steps starting from and a stepsize sequence presents a final iterate whose excess risk is upper-bounded as:

 ⟨H,E[(wT−w∗)⊗(wT−w∗)]⟩ ≤2⋅(⟨H,E[PT⋯P1(w0−w∗)⊗(w0−w∗)P1⋯PT]⟩ +⟨H,T∑τ=1η2τ⋅E[PT⋯Pτ+1nτ⊗nτPτ+1⋯PT]⟩),

where, and . Note that and , where, is the filtration formed by all samples until time .

Proof.

One can view the contribution of the above two terms as ones stemming from SGD’s updates, which can be written as:

 wt =wt−1−ηtˆ∇f(wt−1) wt−w∗ =(I−ηtxtxt)(wt−1−w∗)+ηtnt wt−w∗ =Pt⋯P1(w0−w∗)+T∑τ=1Pt⋯Pτ+1ητnτ

From the above equation, the result of the lemma follows straightforwardly. Now, clearly, if the noise and the inputs are indepdent of each other, and if the noise is zero mean i.e. , the above inequality holds with equality (without the factor of two). This is true more generally iff

For more details, refer to (Défossez and Bach, 2015).

Now, in order to bound the total error, note that the original stochastic process associated with SGD’s updates can be decoupled into two (simpler) processes, one being the noiseless process (which corresponds to reducing the dependence on the initial error, and is termed “bias”), i.e., where, the recurrence evolves as:

 wbiast−w∗=Pt(wbiast−1−w∗) (10)

The second recursion corresponds to the dependence on the noise (termed as variance), wherein, the process is initiated at the solution, i.e. and is driven by the noise . The update for this process corresponds to:

 wvart−w∗ =Pt(wvart−1−w∗)+ηtnt,  with   wvar0=w∗ (11) =t∑τ=1Pt⋯Pτ+1⋅(ητnτ).

We represent by the covariance of the iterate of the bias process, i.e.,

 Bt =E[(wbiast−w∗)(wbiast−w∗)⊤] =E[PtBt−1P⊤t]=E[Pt⋯P1B0P1⋯Pt]

The quantity that routinely shows up when bounding SGD’s convergence behavior is the covariance of the variance error, i.e. . This implies the following (simplified) expression for :

 Vt =E[(wvart−w∗)⊗(wvart−w∗)] =∑τ,τ′E[PT⋯Pτ+1(ητnτ)⊗(ητ′nτ′)Pτ′+1⋯PT] =T∑τ=1η2τE[PT⋯Pτ+1nτ⊗nτPτ+1⋯PT]

Firstly, note that this naturally implies that the sequence of covariances , initialized at (say), the solution, i.e., naturally grows to its steady state covariance, i.e.,

 V1⪯V2⪯⋯⪯V∞.

See lemma 3 of Jain et al. (2017a) for more details. Furthermore, what naturally follows in relating to is:

 Vt⪯E[PtVt−1P⊤t]+η2tσ2H. (12)

Lemma 6 (Lemma 5 of Jain et al. (2017a)).

Running SGD with a (constant) stepsize sequence achieves the following steady-state covariance:

 V∞⪯ησ21−ηR2I.
Lemma 7.

Suppose , and . For any sequence of learning rates , then,

 Vt⪯2ησ2I   ∀   t.
Proof.

We will prove the lemma using an inductive argument. The base case, i.e. follows from the problem statement. Note also that for SGD, implying the statement naturally follows. If, say, satisfies the equation above, from equation 12, we have the following covariance for :

 Vt+1 ⪯E[PtVtP⊤t]+η2tσ2H =E[(I−ηtxtx⊤t)Vt(I−ηtxtx⊤t)]+η2tσ2H ⪯2ησ2E[(I−ηtxtx⊤t)(I−ηtxtx⊤t)]+η2tσ2H ⪯2ησ2I−4ηtησ2H+2η2tησ2R2H+η2tσ2H ⪯2ησ2I−2ηtησ2H+η2tσ2H =2ησ2I+ηt⋅(ηt−2η)σ2H ⪯2ησ2I,

from which the lemma follows. ∎

Lemma 8.

(Reduction from Multiplicative noise oracle) Let be the (expected) covariance of the variance error. Then, the recursion that connects to can be expressed as:

 Vt+1⪯(I−ηtH)Vt(I−ηtH)+2η2tσ2H
Proof.

From equation 12, we already know that the evolution of the co-variance of the variance error follows:

 Vt+1 ⪯E[PtVtP⊤t]+η2tσ2H ⪯E[(I−ηtH)Vt(I−ηtH)]+η2tE[xtx⊤tVtxtx⊤t]+η2tσ2H ⪯(I−ηtH)Vt(I−ηtH)+η2t∥Vt∥2R2H+η2tσ2H =(I−ηtH)Vt(I−ηtH)+η2t⋅2ησ2R2H+η2tσ2H ⪯(I−ηtH)Vt(I−ηtH)+2η2tσ2H.

Where the steps follow from lemma 7, and owing from the fact that . ∎

Note: Basically, one could analyze an auxiliary process driven by noise with variance off by a factor of two and convert the analysis into one involving exact (deterministic) gradients.

Lemma 9.

[Bias decay - strongly convex case] Let the minimal eigenvalue of the Hessian . Consider the bias recursion as in equation 10 with the stepsize set as . Then,

 E[∥∥wbiast−w∗∥∥22]≤(1−1/(2κ))E[∥∥wbiast−1−w∗∥∥22]
Proof.

The proof follows through straight forward computations:

 E[∥∥wbiast−w∗∥∥22] =E[∥∥wbiast−1−w∗∥∥22]−ηE[∥∥w%biast−1−w∗∥∥2H] ≤(1−ημ)E[∥∥w% biast−1−w∗∥∥22],

where, the first line follows from the fact that and the result follows through the definition of . ∎

Lemma 10.

[Reduction of the bias recursion with multiplicative noise to one resembling the variance recursion] Consider the bias recursion that evolves as

 Bt=E[(wt−w∗)(wt−w∗)⊤]=E[(I−γtxtx⊤t)Bt−1(I−γtxtx⊤t)]   with  B0=(w0−w∗)(w0−w∗)⊤.

Then, the following recursion holds :

 Bt⪯(I−γtH)Bt−1(I−γtH)+γ2tR2∥w0−w∗∥2H.
Proof.

The result follows owing to the following computations:

 Bt =E[(wt−w∗)(wt−w∗)⊤] =E[(I−γtxtx⊤t)Bt−1(I−γtxtx⊤t)] ⪯(I−γtH)Bt−1(I−γtH)+γ2tE[(x⊤tBt−1xt)xtx⊤t] ⪯(I−γtH)Bt−1(I−γtH)+γ2tE[∥wt−1−w∗∥22]R2H ⪯(I−γtH)Bt−1(I−γtH)+γ2tE[∥w0−w∗∥22]R2H,

with the last inequality holding true if the squared distance to the optimum doesn’t grow as a part of the recursion. We prove that this indeed is the case below:

 E[∥wt−1−w∗∥22] =E[∥∥wt−2−γt−1xt−1x⊤t−1−w∗∥∥22]

Recursively applying the above argument yields the desired result. ∎

Note: This result implies that the bias error (in the smooth non-strongly convex case of the least squares regression with multiplicative noise) can be bounded by employing a similar lemma as that of the variance, where one can look at the quantity as the analog of the variance that drives the process.

Lemma 11.

[Lower bounds on the additive noise oracle imply ones for the multiplicative noise oracle] Under the assumption that the covariance of noise , the following statement holds. Let be the (expected) covariance of the variance error. Then, the recursion that connects to can be expressed as:

 Vt+1=E[(I−ηtxtx⊤t)Vt(I−ηtxtx⊤t)]+η2tσ2H

Then,

 Vt+1⪰(I−ηtH)Vt(I−ηtH)+η2tσ2H
Proof.

Let us consider firstly, the setting of (bounded) additive noise. Here, we have:

 ^∇f(wt)=H(wt−w∗)+ζt, with E[ζt|wt]=0, and E[ζtζ⊤t|wt]=σ