The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure

04/29/2019 ∙ by Rong Ge, et al. ∙ Duke University Microsoft University of Washington 12

There is a stark disparity between the step size schedules used in practical large scale machine learning and those that are considered optimal by the theory of stochastic approximation. In theory, most results utilize polynomially decaying learning rate schedules, while, in practice, the "Step Decay" schedule is among the most popular schedules, where the learning rate is cut every constant number of epochs (i.e. this is a geometrically decaying schedule). This work examines the step-decay schedule for the stochastic optimization problem of streaming least squares regression (both in the non-strongly convex and strongly convex case), where we show that a sharp theoretical characterization of an optimal learning rate schedule is far more nuanced than suggested by previous work. We focus specifically on the rate that is achievable when using the final iterate of stochastic gradient descent, as is commonly done in practice. Our main result provably shows that a properly tuned geometrically decaying learning rate schedule provides an exponential improvement (in terms of the condition number) over any polynomially decaying learning rate schedule. We also provide experimental support for wider applicability of these results, including for training modern deep neural networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large scale machine learning and deep learning rely almost exclusively on stochastic optimization methods, primarily SGD 

(Robbins and Monro, 1951) and variants. Such methods are heavily tuned to the problem at hand (often with parallelized hyper-parameter searches (Li et al., 2017)). There are two predominant approaches in stochastic optimization: those methods which decay learning rate schedules to achieve the best performance (Krizhevsky et al., 2012; Sutskever et al., 2013; Kidambi et al., 2018) and those which rely on various forms of approximate preconditioning (Duchi et al., 2011; Tieleman and Hinton, 2012; Kingma and Ba, 2014) to obtain reasonably accurate results on classes of problem instances (often) with minimal hyper-parameter tuning. This work examines the former class of methods, where our goal is to present a more refined characterization of optimal learning rate schedules, through both sharp theoretical analysis (on the special case of convex quadratics) and empirical studies.

In this work, we will restrict our attention to only the SGD algorithm where we are concerned with the behavior of the final iterate (i.e. the last point when we choose to terminate the algorithm). While the majority of (minimax optimal) theoretical results for SGD focus on iterate averaging techniques (e.g. Polyak and Juditsky (1992)), practical implementations of SGD predominantly return the final iterate of the SGD procedure. Thus, it is of importance (both from theoretical and practical perspectives) to quantify what is achievable with the final iterate of an SGD procedure.

In theory, it is known that final iterate (Robbins and Monro, 1951) of SGD will (asymptotically) converge to the (local) minimizer only if the learning rates are not summable but are square summable (the former condition being one so that the initial conditions are forgotten and the latter condition being one so that the error due to the noise goes to zero) (Kushner and Clark, 1978; Kushner and Yin, 2003). In particular, much of the theoretically studied learning schedules are of the form for some and  (Robbins and Monro, 1951; Polyak and Juditsky, 1992) – we refer to these schedules as polynomial decay schemes; such polynomial decay schemes are convergent due to that they are not summable but are square summable. Furthermore, it is known that such polynomial decay schemes can yield near-minimax optimal rates (up to log factors) on the final iterate for certain classes of non-smooth stochastic convex optimization problems (Shamir and Zhang, 2012), with/without strong convexity.

In practice, a widely used stepsize schedule involves cutting the learning rate (by a constant factor) every constant number of epochs; such schemes are referred to as “Step Decay” schedules 111Towards Data Science: Stepsize schedules. Clearly, such a scheme is geometrically decaying the learning rate, and, therefore, it is a non-convergent scheme (from the stochastic approximation perspective). However, in practice, the schedule at which the rate is (geometrically) cut is tuned 222http://cs231n.github.io/neural-networks-3/#baby to obtain good performance when the algorithm is terminated (Krizhevsky et al., 2012; He et al., 2016b)

, as opposed to one that obtains the best rates in the limit of a large number of updates. Such schemes are widely used to the extent that these are available as a standard option in popular deep learning libraries such as PyTorch 

333PyTorch Learning Rate Scheduler: Reduce on plateau

, TensorFlow 

444https://www.tensorflow.org/api_docs/python/tf/train/exponential_decay.

Input:

Initial vector

, starting learning rate , number of steps
Output:
1 for  to  do
2      
3       for  to  do
4            
5       end for
6      
7 end for
Algorithm 1 Step Decay scheme
Figure 1: (Left) The Step Decay scheme for stochastic gradient descent. Note that the algorithm requires just two parameters - the starting learning rate and the number of iterations .
(Right) Plot of function value error (in scale) of the final iterate vs. condition number for polynomially decaying i.e., equation(7), equation(8) and the smoothed geometrically decaying (i.e. exponentially decaying) step-sizes equation(9) for the 2-d quadratic problem (equation 3

), which also captures the behavior of a 2-d linear regression problem. The condition number

is varied as . Exhaustive grid search is performed for all stepsize parameters ( and in equation 789). Initial error of the algorithm is and the algorithm is run for a total of steps and averaged over random seeds. Observe that the final iterate’s error grows linearly as a function of the condition number for the polynomially decaying stepsize schemes, whereas, the error grows only logarithmically in for the smoothed geometric stepsize scheme. For details, refer to section E.1 in Appendix E.

max width= Assumptions Minimax rate Rate w/ Final iterate using best poly-decay Rate w/ Final iterate using Step Decay General convex functions (Shamir and Zhang, 2012) Non-strongly convex quadratics (2) (This work - Theorem 1) (This work - Theorem 2) General strongly convex functions (Shamir and Zhang, 2012) Strongly convex quadratics (2) (This work - Theorem 1) (This work - Theorem 2)

Table 1: Comparison of sub-optimality for final iterate of SGD (i.e., ) for different settings. The minimax rate refers to the best possible worst case rate with access to stochastic gradients (typically achieved with iterate averaging methods (Polyak and Juditsky, 1992; Ghadimi and Lan, 2012)); the red shows the multiplicative factor increase (over the minimax rate) using the final iterate, under two different learning rate decay schedules. Polynomial decay rates are of the form (for appropriately chosen , , ). For the general cases above, the polynomial decay schemes achieve near optimal rates on the final iterate. Here denotes the stochastic gradient, denotes the gradient and denotes the Hessian of the function . For quadratics, we assume
(2)
This assumption is satisfied by multiplicative noise that is introduced when employing sampled stochastic gradients and features in several recent efforts (Bach and Moulines, 2013; Jain et al., 2016, 2017b). While the polynomial decay schemes are nearly minimax optimal for general (strongly) convex functions, they are notably suboptimal for convex quadratics. The geometrically decaying Step Decay schedule provides marked improvements over any polynomial decay scheme for convex quadratics. For simplicity of presentation, the results for quadratics do not show dependence on initial error. See Theorems 1 and 2 for precise statements (and Nemirovsky and Yudin (1983); Ghadimi and Lan (2012); Shamir and Zhang (2012) for precise statements of the general case).

Our contributions:

This work establishes near optimality of the step-decay schedule (Algorithm 1) on the final iterate of an SGD procedure (with a known

time horizon). In particular, the variance on the final iterate of a step-decay schedule is shown to offer an

exponential improvement over that of standard polynomially decaying step size schemes standard in the theory of stochastic approximation (Kushner and Yin, 2003). Figure 1 illustrates that this difference is evident (empirically) even when optimizing a two-dimensional convex quadratic. Table 1 provides a summary.

Our main contributions are as follows:

  • Sub-optimality of polynomially decaying learning rate schemes: For the case of optimizing strongly convex quadratics, this work shows that the final iterate of a polynomially decaying stepsize scheme (i.e. with , with ) is off the statistical minimax rate by a factor of the condition number of the problem. For the non-strongly convex case of optimizing quadratics, any polynomially decaying stepsize scheme can achieve a rate no better than (up to factors), while the statistical minimax rate is . We would like to make a note here that our main theorem 2, for the non-strongly convex case of quadratics, offers a rate on the initial error (i.e., the bias term) that is off the best known rate (Bach and Moulines, 2013) (that employs iterate averaging) by a dimension factor.

  • Near-optimality of the step-decay scheme: Given a fixed end time , the step-decay scheme (algorithm 1) presents a final iterate that is off the statistical minimax rate by just a factor for optimizing both strongly convex and non-strongly convex case of quadratics 555This dependence can be improved to of the condition number of the problem (for the strongly convex case) using a more refined stepsize decay scheme., thus indicating vast improvements over polynomially decaying stepsize schedules. Algorithm 1 is rather straightforward and employs the knowledge of just an initial learning rate and number of iterations for its implementation.

  • SGD has to query bad points (or iterates) infinitely often: For the case of optimizing strongly convex quadratics, this work shows that any stochastic gradient procedure (in a sense) must query sub-optimal iterates (off by nearly a condition number) infinitely often.

Table 1 summarizes this paper’s results. Note that the sub-optimality of standard polynomially decaying stepsizes for classes of smooth and strongly convex optimization doesn’t contradict the (minimax) optimality results in stochastic approximation (Polyak and Juditsky, 1992). Iterate averaging coupled with polynomially decaying learning rates clearly does achieve minimax optimal statistical rates in the limit (Ruppert, 1988; Polyak and Juditsky, 1992). In fact, recent results for the special case of quadratics indicate that a constant learning rate coupled with iterate averaging achieves anytime minimax optimal statistical rates (as opposed to results that work with the knowledge of the time horizon) (Bach and Moulines, 2013; Jain et al., 2016, 2017b). However, as mentioned previously, this work deals with the behavior of the final iterate (i.e. without iterate averaging) of a stochastic gradient procedure, which is clearly of relevance to practice.

Extending results on the performance of Step Decay schemes to more general convex optimization problems, beyond stochastic optimization of quadratics, is an important future direction.

Related work:

Stochastic Gradient Descent (SGD) and the problem of stochastic approximation was introduced in the work of Robbins and Monro (1951). This work elaborates on stepsize schemes satisfied by asymptotically convergent stochastic gradient methods: we refer to these schemes as “convergent” stepsize sequences. The asymptotic statistical optimality of SGD equipped with larger stepsize sequences and iterate averaging was shown in Ruppert (1988); Polyak and Juditsky (1992). In terms of oracle models and notions of optimality, there exists two lines of thought, as elaborated below. See also Jain et al. (2017b) for a detailed discussion in this regard.

One line of thought considers the goal of matching the excess risk of the statistically optimal estimator 

(Anbar, 1971; Kushner and Clark, 1978; Polyak and Juditsky, 1992) on every problem instance. Several recent works (Bach and Moulines, 2013; Frostig et al., 2015; Dieuleveut et al., 2016; Jain et al., 2016, 2017b) present non-asymptotic results work in this oracle model, in conjunction with iterate averaging, and achieve minimax rates (on a per-problem basis) (Lehmann and Casella, 1998; Kushner and Clark, 1978). This paper studies the final iterate of SGD and understands its behavior with both the standard polynomially decaying stepsizes and the step decay schedule under this oracle model.

The other line of thought designs algorithms under worst case assumptions such as bounded noise, with the goal to match lower bounds provided in Nemirovsky and Yudin (1983); Raginsky and Rakhlin (2011); Agarwal et al. (2012). Working in this oracle model, various asymptotic properties of convergent learning rate schemes in stochastic approximation literature have been studied in great detail (Kushner and Clark, 1978; Ljung et al., 1992; Bharath and Borkar, 1999; Kushner and Yin, 2003; Lai, 2003), for broad function classes. Using iterate averaged SGD, efforts of Lacoste-Julien et al. (2012); Rakhlin et al. (2012); Ghadimi and Lan (2012, 2013); Bubeck (2014); Dieuleveut et al. (2016) achieve (near-)minimax rates for various problem classes. The work of Shamir and Zhang (2012) is closest in spirit to our work (despite working with a different oracle model), and presents near minimax rates (up to factors) using the final iterate of an SGD procedure for non-smooth stochastic optimization with/without strong convexity assumptions. Note that the work of Harvey et al. (2018) established a lower bound indicating that the final iterate of an SGD method suffers an extra logarithmic dependence on the time (under specific classes of polynomially decaying stepsizes, and when the end time is not known), as established by the work of Shamir and Zhang (2012) over the minimax rate (Nemirovsky and Yudin, 1983; Raginsky and Rakhlin, 2011; Agarwal et al., 2012) in the context of SGD with “standard” polynomially decaying stepsizes when optimizing non-smooth objectives.

Paper organization:

Section 2 describes notation and problem setup. Section 3 presents our results on the sub-optimality of polynomial decay schemes and the near optimality of the step decay scheme. Section 3.3 presents results on the anytime behavior of SGD (i.e. the asymptotic/infinite horizon case). Section 4 presents experimental results and Section 5 presents conclusions.

2 Problem Setup

Notation: We present the setup and associated notation in this section. We represent scalars with normal font etc., vectors with boldface lowercase characters etc. and matrices with boldface uppercase characters etc. We represent positive semidefinite (PSD) ordering between two matrices using . The symbol represents that the direction of inequality holds for some universal constant.

Our theoretical results focus on the stochastic approximation problem of (streaming) least squares regression and this involves minimizing the following expected square loss objective:

(3)

Note that the hessian of the problem . In this paper, we are provided access to stochastic gradients that involves sampling a fresh example input-output pair

and using this to compute an unbiased estimator of the gradient of the objective

. This stochastic gradient , evaluated at some iterate is represented as:

(4)

Our goal in this paper is to consider the stochastic gradient descent method (Robbins and Monro, 1951), wherein, given an initial iterate and step size sequence , we perform the following update:

With regards to examples drawn from the underlying distribution , the input and the output are related to each other as:

where, is the noise on the example pair and is a minimizer of the objective . We assume that this noise satisfies the following condition:

(5)

Next, we assume that covariates within the samples

satisfy the following fourth moment inequality:

(6)

This assumption is satisfied, for instance, when the norm of the covariates , but holds true even in more general situations (i.e. this assumption is more general than a bounded norm assumption).

Finally, note that both the conditions 5 and 6 are fairly general and used in several recent works (Bach and Moulines, 2013; Jain et al., 2016, 2017b) that present a sharp analysis of SGD (and its variants) for the streaming least squares regression problem. Next, we denote by

the smallest eigenvalue, largest eigenvalue and condition number of

respectively. in the strongly convex case but not necessarily so in the non-strongly convex case (in section 3, the non-strongly convex quadratic objective is referred to as the “smooth” case).

Let . The excess risk of an iterate is given by . It is well known that given accesses to the stochastic gradient oracle in equation 4, any algorithm that uses these stochastic gradients and outputs has sub-optimality that is lower bounded by . More concretely, we have that (Van der Vaart, 2000)

There exists schemes that achieve this rate of e.g., constant step size SGD with averaging (Ruppert, 1988; Polyak and Juditsky, 1992; Bach and Moulines, 2013). This rate of is called the statistical minimax rate.

3 Main results

In this section, we will present the main results of this paper. We begin with the sub-optimality of polynomially decaying stepsizes 3.1, the (surprising) near-optimal behavior of the step-decay schedule 3.2, followed by the fundamental limitation that plagues SGD in making it query points with highly sub-optimal function values infinitely often.

3.1 Suboptimality of polynomial decay schemes

This paper begins by showing that there exist problem instances where traditional polynomial decay schemes that are presented by the theory of stochastic approximation Robbins and Monro (1951); Polyak and Juditsky (1992) i.e., those of the form , for any choice of and are significantly suboptimal (by a factor of the condition number of the problem) compared to the statistical minimax rate (Kushner and Clark, 1978).

Theorem 1.

Under assumptions 56,there exists a class of problem instances where the following lower bounds hold on the final iterate of a Stochastic Gradient procedure with polynomially decaying stepsizes when given access to the oracle as written in equation 4.

Strongly convex case: Suppose . For any condition number , there exists a problem instance with initial suboptimality such that, for any , and for all and , and for the learning rate scheme , we have

Smooth case: For any fixed , there exists a problem instance such that, for all and , and for the learning rate scheme , we have

In both the cases above, the statistical minimax rate is . In the strongly convex case, we have a suboptimality factor of and in the smooth case, we have a suboptimality factor of .

3.2 Near optimality of Step Decay schemes

This section presents results on the Step Decay schedules. In particular, given the knowledge of an end time when the algorithm is terminated, the step decay learning rate schedule (Algorithm 1) offers significant improvements over standard polynomially decaying stepsize schemes, and obtains near minimax rates (off by only a factor).

Theorem 2.

Suppose we are given access to the stochastic gradient oracle 4 satisfying assumptions 5 and 6. Running Algorithm 1 with an initial stepsize of allows the algorithm to achieve the following excess risk guarantees.

  • Strongly convex case: Suppose . We have:

  • Smooth case: We have:

We would like to make a note that, while the above theorem presents significant improvements over standard polynomial decay (or constant learning rate schemes (Polyak and Juditsky, 1992; Bach and Moulines, 2013; Défossez and Bach, 2015; Jain et al., 2016)) with iterate averaging, the result presents a worse rate on the initial error (by a dimension factor) in the smooth case, compared to the best known result (Bach and Moulines, 2013), which relies heavily on iterate averaging to remove this factor. It is an open question with regards to whether this factor can actually be improved or not. The above result shows that the Step Decay scheme significantly improves over polynomial decay schemes, which are plagued by a polynomial dependence of a condition number on the variance of the final iterate. Furthermore, note that Algorithm 1 just requires access to (just as standard SGD for least squares (Bach and Moulines, 2013; Jain et al., 2016)) and the knowledge of the end time and doesn’t require access to the strong convexity parameter, in contrast to standard results for the strongly convex setting (for e.g. Rakhlin et al. (2012); Shamir and Zhang (2012); Lacoste-Julien et al. (2012); Bubeck (2014)), which achieve rates given access to the strong convexity parameter (which is often harder to obtain in practice), and, more often, using iterate averaging. These results are off from statistical minimax rates achieved using iterate averaging  (Kushner and Clark, 1978; Polyak and Juditsky, 1992) by only a factor. Note that this factor can be improved to a factor for the strongly convex quadratic case by using an additional polynomial decay scheme in the beginning before switching to the Step Decay scheme.

Proposition 3.

Suppose we are given access to the stochastic gradient oracle 4 satisfying assumptions 5 and 6. Let and let . For any problem and fixed time horizon , there exists a learning rate scheme that achieves

Note that to in order to have improved the dependence on the variance from (in theorem 2) to (in proposition 3), we do require access to the strong convexity parameter in addition to and knowledge of the end time . However, this is indeed the case even for standard analyses for the strongly convex setting, say, Rakhlin et al. (2012); Shamir and Zhang (2012); Lacoste-Julien et al. (2012); Bubeck (2014).

As a final remark, recall that our results in this section (on step decay schemes) assumed the knowledge of a fixed time horizon. In contrast, most results SGD’s averaged iterate obtain anytime (i.e., limiting/infinite horizon) guarantees. Can we hope to achieve such guarantees with the final iterate?

3.3 SGD queries bad points infinitely often

Our main result in this section shows that obtaining near statistical minimax rates with the final iterate is not possible without knowledge of the time horizon . More concretely, we show the following limitation of SGD for the strongly convex quadratic case: for any learning rate sequence (be it polynomially decaying or step-decay), SGD requires to query a point with sub-optimality at least for infinitely many time steps .

Theorem 4.

Suppose we are given access to a stochastic gradient oracle 4 satisfying assumptions 56. There exists a universal constant , and a problem instance, such that for SGD algorithm with any for all 666Learning rate more than will make the algorithm diverge., we have

The bad points guaranteed to exist by Theorem 4 are not rare. One can in fact show that such points occur at least once in iterations. This claim is formalized in Theorem 16 in appendix D.

4 Experimental Results

We present experimental validation on the suitability of the Step-decay schedule (or more precisely, its continuous counterpart, which is the exponentially decaying schedule), and compare its with the polynomially decaying stepsize schedules. In particular, we consider the use of:

(7)
(8)
(9)

Where, we perform a systematic grid search on the parameters and . In the section below, we consider a real world non-convex optimization problem of training a residual network on the cifar-10 dataset, with an aim to illustrate the practical implications of the results described in the paper. Complete details of the setup are given in Appendix E.

4.1 Non-Convex Optimization: Training a Residual Net on cifar-10 

Consider the task of training a layer deep residual network (He et al., 2016b) with pre-activation blocks (He et al., 2016a) (dubbed preresnet-44) for cifar-10 classification problem. The code for implementing the network can be found here 777https://github.com/D-X-Y/ResNeXt-DenseNet. For all experiments, we use Nesterov’s Accelerated gradient method (Nesterov, 1983) implemented in pytorch 888https://github.com/pytorch with a momentum set to and batchsize set to , training epochs, regularization set to .

Our experiments are based on grid searching for the best learning rate decay scheme on the parametric family of learning rate schemes described above 7,8,9; all grid searches are performed on a separate validation set (obtained by setting aside one-tenth of the training dataset) and with models trained on the remaining

samples. For presenting the final numbers in the plots/tables, we employ the best hyperparameters from the validation stage and train it on the entire

samples and average results run with different random seeds. The parameters for grid searches and other details are presented in Appendix E. Furthermore, we always extend the grid so that the best performing grid search parameter lies in the interior of our grid search.

Comparison between different schemes: Figure 2 and Table 2 present a comparison of the performance of the three schemes (7)-(9). They demonstrate that the exponential scheme outperforms the polynomial step-size schemes.

Decay Scheme Train Function Value Test error
(equation 7)
(equation 8)
(equation 9)
Table 2: Comparing Train Cross-Entropy and Test Error of various learning rate decay schemes for the classification task on cifar-10  using a layer residual net with pre-activations.
Figure 2: Plot of the training function value (left) and test error (right) comparing the three decay schemes (two polynomial) 78, (and one exponential) 9 scheme for the cifar-10 classification problem with a layer residual net with pre-activation blocks.
Figure 3: Plot of the training function value (left) and test error (right) comparing exponential decay scheme (equation 9), with parameters optimized for , and epochs on the classification problem of cifar-10  with a layer residual net with pre-activation blocks.

Hyperparameter selection using truncated runs: Figure 3 and Tables 3 and 4 present a comparison of the performance of three exponential decay schemes each of which has the best performance at , and epochs respectively. The key point to note is that best performing hyperparameters at and epochs are not the best performing at epochs (which is made stark from the perspective of the validation error). This demonstrates that hyper parameter selection using truncated runs, (for e.g., in hyperband (Li et al., 2017)) might necessitate rethinking.

max width= Decay Scheme Train FVal Train FVal Train FVal [optimized for epochs] (eqn 9) [optimized for epochs] (eqn 9) [optimized for epochs] (eqn 9)

Table 3: Comparing training (softmax) function value of models obtained by optimizing the exponential decay scheme with end times of epochs for the classification task on cifar-10 dataset using a layer residual net.

max width= Decay Scheme Test Test Test [optimized for epochs] (eqn 9) [optimized for epochs] (eqn 9) [optimized for epochs] (eqn 9)

Table 4: Comparing test error of models obtained by optimizing the exponential decay scheme with end times of epochs for the classification task on cifar-10 dataset using a layer residual net.

5 Conclusions and Discussion

The main contribution of this work shows that the issue of learning rate scheduling is far more nuanced than suggested by prior theoretical results, where we do not even need to move to non-convex optimization to show that the starkly different schemes (compared to traditional polynomially decaying stepsizes) can be far more effective than the standard polynomially decaying rates considered in theory. This is important from a practical perspective in that the Step Decay schedule is widely used in practical SGD implementations for both convex and non-convex optimization.

Is quadratic loss minimization special? One may ask if there is something particularly special about why the minimax rates are different for quadratic loss minimization as opposed to more general convex (and non-convex) optimization problems? Ideally, we would hope that our theoretical results can be formalized in more general cases: this would serve as an exciting direction for future research. Interestingly, Allen-Zhu (2018) shows marked improvements for making gradient norm small (as opposed to function values, as considered in this paper), when working with stochastic gradients, for general function classes, with factors that appear similar to ones obtained in this work.

Acknowledgements: Sham Kakade acknowledges funding from the Washington Research Foundation for Innovation in Data-intensive Discovery, and the NSF Award 1740551, and the ONR award N00014-18-1-2247.

References

Appendix A Preliminaries

Before presenting the lemmas establishing the behavior of SGD under various learning rate schemes, we introduce some notation. We recount that the SGD update rule denoted through:

We then write out the expression for the stochastic gradient .

where, given the stochastic gradient corresponding to an example , with , the above stochastic gradient expression naturally follows. Now, in order to analyze the contraction properties of the SGD update rule, we require the following notation:

Lemma 5.

[For e.g. Appendix A.2.2 from Jain et al. (2016)] Bias-Variance tradeoff: Running SGD for steps starting from and a stepsize sequence presents a final iterate whose excess risk is upper-bounded as:

where, and . Note that and , where, is the filtration formed by all samples until time .

Proof.

One can view the contribution of the above two terms as ones stemming from SGD’s updates, which can be written as:

From the above equation, the result of the lemma follows straightforwardly. Now, clearly, if the noise and the inputs are indepdent of each other, and if the noise is zero mean i.e. , the above inequality holds with equality (without the factor of two). This is true more generally iff

For more details, refer to (Défossez and Bach, 2015).

Now, in order to bound the total error, note that the original stochastic process associated with SGD’s updates can be decoupled into two (simpler) processes, one being the noiseless process (which corresponds to reducing the dependence on the initial error, and is termed “bias”), i.e., where, the recurrence evolves as:

(10)

The second recursion corresponds to the dependence on the noise (termed as variance), wherein, the process is initiated at the solution, i.e. and is driven by the noise . The update for this process corresponds to:

(11)

We represent by the covariance of the iterate of the bias process, i.e.,

The quantity that routinely shows up when bounding SGD’s convergence behavior is the covariance of the variance error, i.e. . This implies the following (simplified) expression for :

Firstly, note that this naturally implies that the sequence of covariances , initialized at (say), the solution, i.e., naturally grows to its steady state covariance, i.e.,

See lemma 3 of Jain et al. (2017a) for more details. Furthermore, what naturally follows in relating to is:

(12)

Lemma 6 (Lemma 5 of Jain et al. (2017a)).

Running SGD with a (constant) stepsize sequence achieves the following steady-state covariance:

Lemma 7.

Suppose , and . For any sequence of learning rates , then,

Proof.

We will prove the lemma using an inductive argument. The base case, i.e. follows from the problem statement. Note also that for SGD, implying the statement naturally follows. If, say, satisfies the equation above, from equation 12, we have the following covariance for :

from which the lemma follows. ∎

Lemma 8.

(Reduction from Multiplicative noise oracle) Let be the (expected) covariance of the variance error. Then, the recursion that connects to can be expressed as:

Proof.

From equation 12, we already know that the evolution of the co-variance of the variance error follows:

Where the steps follow from lemma 7, and owing from the fact that . ∎

Note: Basically, one could analyze an auxiliary process driven by noise with variance off by a factor of two and convert the analysis into one involving exact (deterministic) gradients.

Lemma 9.

[Bias decay - strongly convex case] Let the minimal eigenvalue of the Hessian . Consider the bias recursion as in equation 10 with the stepsize set as . Then,

Proof.

The proof follows through straight forward computations:

where, the first line follows from the fact that and the result follows through the definition of . ∎

Lemma 10.

[Reduction of the bias recursion with multiplicative noise to one resembling the variance recursion] Consider the bias recursion that evolves as

Then, the following recursion holds :

Proof.

The result follows owing to the following computations:

with the last inequality holding true if the squared distance to the optimum doesn’t grow as a part of the recursion. We prove that this indeed is the case below:

Recursively applying the above argument yields the desired result. ∎

Note: This result implies that the bias error (in the smooth non-strongly convex case of the least squares regression with multiplicative noise) can be bounded by employing a similar lemma as that of the variance, where one can look at the quantity as the analog of the variance that drives the process.

Lemma 11.

[Lower bounds on the additive noise oracle imply ones for the multiplicative noise oracle] Under the assumption that the covariance of noise , the following statement holds. Let be the (expected) covariance of the variance error. Then, the recursion that connects to can be expressed as:

Then,

Proof.

Let us consider firstly, the setting of (bounded) additive noise. Here, we have: