Large scale machine learning and deep learning rely almost exclusively on stochastic optimization methods, primarily SGD(Robbins and Monro, 1951) and variants. Such methods are heavily tuned to the problem at hand (often with parallelized hyper-parameter searches (Li et al., 2017)). There are two predominant approaches in stochastic optimization: those methods which decay learning rate schedules to achieve the best performance (Krizhevsky et al., 2012; Sutskever et al., 2013; Kidambi et al., 2018) and those which rely on various forms of approximate preconditioning (Duchi et al., 2011; Tieleman and Hinton, 2012; Kingma and Ba, 2014) to obtain reasonably accurate results on classes of problem instances (often) with minimal hyper-parameter tuning. This work examines the former class of methods, where our goal is to present a more refined characterization of optimal learning rate schedules, through both sharp theoretical analysis (on the special case of convex quadratics) and empirical studies.
In this work, we will restrict our attention to only the SGD algorithm where we are concerned with the behavior of the final iterate (i.e. the last point when we choose to terminate the algorithm). While the majority of (minimax optimal) theoretical results for SGD focus on iterate averaging techniques (e.g. Polyak and Juditsky (1992)), practical implementations of SGD predominantly return the final iterate of the SGD procedure. Thus, it is of importance (both from theoretical and practical perspectives) to quantify what is achievable with the final iterate of an SGD procedure.
In theory, it is known that final iterate (Robbins and Monro, 1951) of SGD will (asymptotically) converge to the (local) minimizer only if the learning rates are not summable but are square summable (the former condition being one so that the initial conditions are forgotten and the latter condition being one so that the error due to the noise goes to zero) (Kushner and Clark, 1978; Kushner and Yin, 2003). In particular, much of the theoretically studied learning schedules are of the form for some and (Robbins and Monro, 1951; Polyak and Juditsky, 1992) – we refer to these schedules as polynomial decay schemes; such polynomial decay schemes are convergent due to that they are not summable but are square summable. Furthermore, it is known that such polynomial decay schemes can yield near-minimax optimal rates (up to log factors) on the final iterate for certain classes of non-smooth stochastic convex optimization problems (Shamir and Zhang, 2012), with/without strong convexity.
In practice, a widely used stepsize schedule involves cutting the learning rate (by a constant factor) every constant number of epochs; such schemes are referred to as “Step Decay” schedules 111Towards Data Science: Stepsize schedules. Clearly, such a scheme is geometrically decaying the learning rate, and, therefore, it is a non-convergent scheme (from the stochastic approximation perspective). However, in practice, the schedule at which the rate is (geometrically) cut is tuned 222http://cs231n.github.io/neural-networks-3/#baby to obtain good performance when the algorithm is terminated (Krizhevsky et al., 2012; He et al., 2016b)
, as opposed to one that obtains the best rates in the limit of a large number of updates. Such schemes are widely used to the extent that these are available as a standard option in popular deep learning libraries such as PyTorch333PyTorch Learning Rate Scheduler: Reduce on plateau444https://www.tensorflow.org/api_docs/python/tf/train/exponential_decay.
max width= Assumptions Minimax rate Rate w/ Final iterate using best poly-decay Rate w/ Final iterate using Step Decay General convex functions (Shamir and Zhang, 2012) – Non-strongly convex quadratics (2) (This work - Theorem 1) (This work - Theorem 2) General strongly convex functions (Shamir and Zhang, 2012) – Strongly convex quadratics (2) (This work - Theorem 1) (This work - Theorem 2)
This work establishes near optimality of the step-decay schedule (Algorithm 1) on the final iterate of an SGD procedure (with a known
time horizon). In particular, the variance on the final iterate of a step-decay schedule is shown to offer anexponential improvement over that of standard polynomially decaying step size schemes standard in the theory of stochastic approximation (Kushner and Yin, 2003). Figure 1 illustrates that this difference is evident (empirically) even when optimizing a two-dimensional convex quadratic. Table 1 provides a summary.
Our main contributions are as follows:
Sub-optimality of polynomially decaying learning rate schemes: For the case of optimizing strongly convex quadratics, this work shows that the final iterate of a polynomially decaying stepsize scheme (i.e. with , with ) is off the statistical minimax rate by a factor of the condition number of the problem. For the non-strongly convex case of optimizing quadratics, any polynomially decaying stepsize scheme can achieve a rate no better than (up to factors), while the statistical minimax rate is . We would like to make a note here that our main theorem 2, for the non-strongly convex case of quadratics, offers a rate on the initial error (i.e., the bias term) that is off the best known rate (Bach and Moulines, 2013) (that employs iterate averaging) by a dimension factor.
Near-optimality of the step-decay scheme: Given a fixed end time , the step-decay scheme (algorithm 1) presents a final iterate that is off the statistical minimax rate by just a factor for optimizing both strongly convex and non-strongly convex case of quadratics 555This dependence can be improved to of the condition number of the problem (for the strongly convex case) using a more refined stepsize decay scheme., thus indicating vast improvements over polynomially decaying stepsize schedules. Algorithm 1 is rather straightforward and employs the knowledge of just an initial learning rate and number of iterations for its implementation.
SGD has to query bad points (or iterates) infinitely often: For the case of optimizing strongly convex quadratics, this work shows that any stochastic gradient procedure (in a sense) must query sub-optimal iterates (off by nearly a condition number) infinitely often.
Table 1 summarizes this paper’s results. Note that the sub-optimality of standard polynomially decaying stepsizes for classes of smooth and strongly convex optimization doesn’t contradict the (minimax) optimality results in stochastic approximation (Polyak and Juditsky, 1992). Iterate averaging coupled with polynomially decaying learning rates clearly does achieve minimax optimal statistical rates in the limit (Ruppert, 1988; Polyak and Juditsky, 1992). In fact, recent results for the special case of quadratics indicate that a constant learning rate coupled with iterate averaging achieves anytime minimax optimal statistical rates (as opposed to results that work with the knowledge of the time horizon) (Bach and Moulines, 2013; Jain et al., 2016, 2017b). However, as mentioned previously, this work deals with the behavior of the final iterate (i.e. without iterate averaging) of a stochastic gradient procedure, which is clearly of relevance to practice.
Extending results on the performance of Step Decay schemes to more general convex optimization problems, beyond stochastic optimization of quadratics, is an important future direction.
Stochastic Gradient Descent (SGD) and the problem of stochastic approximation was introduced in the work of Robbins and Monro (1951). This work elaborates on stepsize schemes satisfied by asymptotically convergent stochastic gradient methods: we refer to these schemes as “convergent” stepsize sequences. The asymptotic statistical optimality of SGD equipped with larger stepsize sequences and iterate averaging was shown in Ruppert (1988); Polyak and Juditsky (1992). In terms of oracle models and notions of optimality, there exists two lines of thought, as elaborated below. See also Jain et al. (2017b) for a detailed discussion in this regard.
One line of thought considers the goal of matching the excess risk of the statistically optimal estimator(Anbar, 1971; Kushner and Clark, 1978; Polyak and Juditsky, 1992) on every problem instance. Several recent works (Bach and Moulines, 2013; Frostig et al., 2015; Dieuleveut et al., 2016; Jain et al., 2016, 2017b) present non-asymptotic results work in this oracle model, in conjunction with iterate averaging, and achieve minimax rates (on a per-problem basis) (Lehmann and Casella, 1998; Kushner and Clark, 1978). This paper studies the final iterate of SGD and understands its behavior with both the standard polynomially decaying stepsizes and the step decay schedule under this oracle model.
The other line of thought designs algorithms under worst case assumptions such as bounded noise, with the goal to match lower bounds provided in Nemirovsky and Yudin (1983); Raginsky and Rakhlin (2011); Agarwal et al. (2012). Working in this oracle model, various asymptotic properties of convergent learning rate schemes in stochastic approximation literature have been studied in great detail (Kushner and Clark, 1978; Ljung et al., 1992; Bharath and Borkar, 1999; Kushner and Yin, 2003; Lai, 2003), for broad function classes. Using iterate averaged SGD, efforts of Lacoste-Julien et al. (2012); Rakhlin et al. (2012); Ghadimi and Lan (2012, 2013); Bubeck (2014); Dieuleveut et al. (2016) achieve (near-)minimax rates for various problem classes. The work of Shamir and Zhang (2012) is closest in spirit to our work (despite working with a different oracle model), and presents near minimax rates (up to factors) using the final iterate of an SGD procedure for non-smooth stochastic optimization with/without strong convexity assumptions. Note that the work of Harvey et al. (2018) established a lower bound indicating that the final iterate of an SGD method suffers an extra logarithmic dependence on the time (under specific classes of polynomially decaying stepsizes, and when the end time is not known), as established by the work of Shamir and Zhang (2012) over the minimax rate (Nemirovsky and Yudin, 1983; Raginsky and Rakhlin, 2011; Agarwal et al., 2012) in the context of SGD with “standard” polynomially decaying stepsizes when optimizing non-smooth objectives.
Section 2 describes notation and problem setup. Section 3 presents our results on the sub-optimality of polynomial decay schemes and the near optimality of the step decay scheme. Section 3.3 presents results on the anytime behavior of SGD (i.e. the asymptotic/infinite horizon case). Section 4 presents experimental results and Section 5 presents conclusions.
2 Problem Setup
Notation: We present the setup and associated notation in this section. We represent scalars with normal font etc., vectors with boldface lowercase characters etc. and matrices with boldface uppercase characters etc. We represent positive semidefinite (PSD) ordering between two matrices using . The symbol represents that the direction of inequality holds for some universal constant.
Our theoretical results focus on the stochastic approximation problem of (streaming) least squares regression and this involves minimizing the following expected square loss objective:
Note that the hessian of the problem . In this paper, we are provided access to stochastic gradients that involves sampling a fresh example input-output pair
and using this to compute an unbiased estimator of the gradient of the objective. This stochastic gradient , evaluated at some iterate is represented as:
Our goal in this paper is to consider the stochastic gradient descent method (Robbins and Monro, 1951), wherein, given an initial iterate and step size sequence , we perform the following update:
With regards to examples drawn from the underlying distribution , the input and the output are related to each other as:
where, is the noise on the example pair and is a minimizer of the objective . We assume that this noise satisfies the following condition:
Next, we assume that covariates within the samples
satisfy the following fourth moment inequality:
This assumption is satisfied, for instance, when the norm of the covariates , but holds true even in more general situations (i.e. this assumption is more general than a bounded norm assumption).
Finally, note that both the conditions 5 and 6 are fairly general and used in several recent works (Bach and Moulines, 2013; Jain et al., 2016, 2017b) that present a sharp analysis of SGD (and its variants) for the streaming least squares regression problem. Next, we denote by
the smallest eigenvalue, largest eigenvalue and condition number ofrespectively. in the strongly convex case but not necessarily so in the non-strongly convex case (in section 3, the non-strongly convex quadratic objective is referred to as the “smooth” case).
Let . The excess risk of an iterate is given by . It is well known that given accesses to the stochastic gradient oracle in equation 4, any algorithm that uses these stochastic gradients and outputs has sub-optimality that is lower bounded by . More concretely, we have that (Van der Vaart, 2000)
There exists schemes that achieve this rate of e.g., constant step size SGD with averaging (Ruppert, 1988; Polyak and Juditsky, 1992; Bach and Moulines, 2013). This rate of is called the statistical minimax rate.
3 Main results
In this section, we will present the main results of this paper. We begin with the sub-optimality of polynomially decaying stepsizes 3.1, the (surprising) near-optimal behavior of the step-decay schedule 3.2, followed by the fundamental limitation that plagues SGD in making it query points with highly sub-optimal function values infinitely often.
3.1 Suboptimality of polynomial decay schemes
This paper begins by showing that there exist problem instances where traditional polynomial decay schemes that are presented by the theory of stochastic approximation Robbins and Monro (1951); Polyak and Juditsky (1992) i.e., those of the form , for any choice of and are significantly suboptimal (by a factor of the condition number of the problem) compared to the statistical minimax rate (Kushner and Clark, 1978).
Under assumptions 5, 6,there exists a class of problem instances where the following lower bounds hold on the final iterate of a Stochastic Gradient procedure with polynomially decaying stepsizes when given access to the oracle as written in equation 4.
Strongly convex case: Suppose . For any condition number , there exists a problem instance with initial suboptimality such that, for any , and for all and , and for the learning rate scheme , we have
Smooth case: For any fixed , there exists a problem instance such that, for all and , and for the learning rate scheme , we have
In both the cases above, the statistical minimax rate is . In the strongly convex case, we have a suboptimality factor of and in the smooth case, we have a suboptimality factor of .
3.2 Near optimality of Step Decay schemes
This section presents results on the Step Decay schedules. In particular, given the knowledge of an end time when the algorithm is terminated, the step decay learning rate schedule (Algorithm 1) offers significant improvements over standard polynomially decaying stepsize schemes, and obtains near minimax rates (off by only a factor).
We would like to make a note that, while the above theorem presents significant improvements over standard polynomial decay (or constant learning rate schemes (Polyak and Juditsky, 1992; Bach and Moulines, 2013; Défossez and Bach, 2015; Jain et al., 2016)) with iterate averaging, the result presents a worse rate on the initial error (by a dimension factor) in the smooth case, compared to the best known result (Bach and Moulines, 2013), which relies heavily on iterate averaging to remove this factor. It is an open question with regards to whether this factor can actually be improved or not. The above result shows that the Step Decay scheme significantly improves over polynomial decay schemes, which are plagued by a polynomial dependence of a condition number on the variance of the final iterate. Furthermore, note that Algorithm 1 just requires access to (just as standard SGD for least squares (Bach and Moulines, 2013; Jain et al., 2016)) and the knowledge of the end time and doesn’t require access to the strong convexity parameter, in contrast to standard results for the strongly convex setting (for e.g. Rakhlin et al. (2012); Shamir and Zhang (2012); Lacoste-Julien et al. (2012); Bubeck (2014)), which achieve rates given access to the strong convexity parameter (which is often harder to obtain in practice), and, more often, using iterate averaging. These results are off from statistical minimax rates achieved using iterate averaging (Kushner and Clark, 1978; Polyak and Juditsky, 1992) by only a factor. Note that this factor can be improved to a factor for the strongly convex quadratic case by using an additional polynomial decay scheme in the beginning before switching to the Step Decay scheme.
Note that to in order to have improved the dependence on the variance from (in theorem 2) to (in proposition 3), we do require access to the strong convexity parameter in addition to and knowledge of the end time . However, this is indeed the case even for standard analyses for the strongly convex setting, say, Rakhlin et al. (2012); Shamir and Zhang (2012); Lacoste-Julien et al. (2012); Bubeck (2014).
As a final remark, recall that our results in this section (on step decay schemes) assumed the knowledge of a fixed time horizon. In contrast, most results SGD’s averaged iterate obtain anytime (i.e., limiting/infinite horizon) guarantees. Can we hope to achieve such guarantees with the final iterate?
3.3 SGD queries bad points infinitely often
Our main result in this section shows that obtaining near statistical minimax rates with the final iterate is not possible without knowledge of the time horizon . More concretely, we show the following limitation of SGD for the strongly convex quadratic case: for any learning rate sequence (be it polynomially decaying or step-decay), SGD requires to query a point with sub-optimality at least for infinitely many time steps .
4 Experimental Results
We present experimental validation on the suitability of the Step-decay schedule (or more precisely, its continuous counterpart, which is the exponentially decaying schedule), and compare its with the polynomially decaying stepsize schedules. In particular, we consider the use of:
Where, we perform a systematic grid search on the parameters and . In the section below, we consider a real world non-convex optimization problem of training a residual network on the cifar-10 dataset, with an aim to illustrate the practical implications of the results described in the paper. Complete details of the setup are given in Appendix E.
4.1 Non-Convex Optimization: Training a Residual Net on cifar-10
Consider the task of training a layer deep residual network (He et al., 2016b) with pre-activation blocks (He et al., 2016a) (dubbed preresnet-44) for cifar-10 classification problem. The code for implementing the network can be found here 777https://github.com/D-X-Y/ResNeXt-DenseNet. For all experiments, we use Nesterov’s Accelerated gradient method (Nesterov, 1983) implemented in pytorch 888https://github.com/pytorch with a momentum set to and batchsize set to , training epochs, regularization set to .
Our experiments are based on grid searching for the best learning rate decay scheme on the parametric family of learning rate schemes described above 7,8,9; all grid searches are performed on a separate validation set (obtained by setting aside one-tenth of the training dataset) and with models trained on the remaining
samples. For presenting the final numbers in the plots/tables, we employ the best hyperparameters from the validation stage and train it on the entiresamples and average results run with different random seeds. The parameters for grid searches and other details are presented in Appendix E. Furthermore, we always extend the grid so that the best performing grid search parameter lies in the interior of our grid search.
Comparison between different schemes: Figure 2 and Table 2 present a comparison of the performance of the three schemes (7)-(9). They demonstrate that the exponential scheme outperforms the polynomial step-size schemes.
|Decay Scheme||Train Function Value||Test error|
Hyperparameter selection using truncated runs: Figure 3 and Tables 3 and 4 present a comparison of the performance of three exponential decay schemes each of which has the best performance at , and epochs respectively. The key point to note is that best performing hyperparameters at and epochs are not the best performing at epochs (which is made stark from the perspective of the validation error). This demonstrates that hyper parameter selection using truncated runs, (for e.g., in hyperband (Li et al., 2017)) might necessitate rethinking.
5 Conclusions and Discussion
The main contribution of this work shows that the issue of learning rate scheduling is far more nuanced than suggested by prior theoretical results, where we do not even need to move to non-convex optimization to show that the starkly different schemes (compared to traditional polynomially decaying stepsizes) can be far more effective than the standard polynomially decaying rates considered in theory. This is important from a practical perspective in that the Step Decay schedule is widely used in practical SGD implementations for both convex and non-convex optimization.
Is quadratic loss minimization special? One may ask if there is something particularly special about why the minimax rates are different for quadratic loss minimization as opposed to more general convex (and non-convex) optimization problems? Ideally, we would hope that our theoretical results can be formalized in more general cases: this would serve as an exciting direction for future research. Interestingly, Allen-Zhu (2018) shows marked improvements for making gradient norm small (as opposed to function values, as considered in this paper), when working with stochastic gradients, for general function classes, with factors that appear similar to ones obtained in this work.
Acknowledgements: Sham Kakade acknowledges funding from the Washington Research Foundation for Innovation in Data-intensive Discovery, and the NSF Award 1740551, and the ONR award N00014-18-1-2247.
- Agarwal et al. (2012) A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 2012.
- Allen-Zhu (2018) Z. Allen-Zhu. How to make the gradients small stochastically. CoRR, abs/1801.02982, 2018.
- Anbar (1971) D. Anbar. On Optimal Estimation Methods Using Stochastic Approximation Procedures. University of California, 1971. URL http://books.google.com/books?id=MmpHJwAACAAJ.
- Bach and Moulines (2013) F. R. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In NIPS 26, 2013.
- Bharath and Borkar (1999) B. Bharath and V. S. Borkar. Stochastic approximation algorithms: overview and recent trends. Sādhanā, 1999.
- Bubeck (2014) S. Bubeck. Theory of convex optimization for machine learning. CoRR, abs/1405.4980, 2014.
- Défossez and Bach (2015) A. Défossez and F. R. Bach. Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In Artifical Intelligence and Statistics (AISTATS), 2015.
- Dieuleveut et al. (2016) A. Dieuleveut, N. Flammarion, and F. R. Bach. Harder, better, faster, stronger convergence rates for least-squares regression. CoRR, abs/1602.05419, 2016.
- Duchi et al. (2011) J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
- Frostig et al. (2015) R. Frostig, R. Ge, S. M. Kakade, and A. Sidford. Competing with the empirical risk minimizer in a single pass. In COLT, 2015.
- Ghadimi and Lan (2012) S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 2012.
- Ghadimi and Lan (2013) S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 2013.
- Harvey et al. (2018) N. J. A. Harvey, C. Liaw, Y. Plan, and S. Randhawa. Tight analyses for non-smooth stochastic gradient descent. CoRR, 2018. URL http://arxiv.org/abs/1812.05217.
- He et al. (2016a) K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV (4), Lecture Notes in Computer Science, pages 630–645. Springer, 2016a.
- He et al. (2016b) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016b.
- Jain et al. (2016) P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Parallelizing stochastic approximation through mini-batching and tail-averaging. arXiv preprint arXiv:1610.03774, 2016.
- Jain et al. (2017a) P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, V. K. Pillutla, and A. Sidford. A markov chain theory approach to characterizing the minimax optimality of stochastic gradient descent (for least squares). CoRR, 2017a. URL http://arxiv.org/abs/1710.09430.
- Jain et al. (2017b) P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Accelerating stochastic gradient descent. arXiv preprint arXiv:1704.08227, 2017b.
- Kidambi et al. (2018) R. Kidambi, P. Netrapalli, P. Jain, and S. M. Kakade. On the insufficiency of existing momentum schemes for stochastic optimization. CoRR, 2018.
- Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- Kushner and Clark (1978) H. J. Kushner and D. S. Clark. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-Verlag, 1978.
- Kushner and Yin (2003) H. J. Kushner and G. Yin. Stochastic approximation and recursive algorithms and applications. Springer-Verlag, 2003.
- Lacoste-Julien et al. (2012) S. Lacoste-Julien, M. W. Schmidt, and F. R. Bach. A simpler approach to obtaining an o(1/t) convergence rate for the projected stochastic subgradient method. CoRR, 2012. URL http://arxiv.org/abs/1212.2002.
- Lai (2003) T. L. Lai. Stochastic approximation: invited paper, 2003.
- Lehmann and Casella (1998) E. L. Lehmann and G. Casella. Theory of Point Estimation. Springer Texts in Statistics. Springer, 1998. ISBN 9780387985022.
- Li et al. (2017) L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765–6816, 2017.
- Ljung et al. (1992) L. Ljung, G. Pflug, and H. Walk. Stochastic Approximation and Optimization of Random Systems. Birkhauser Verlag, Basel, Switzerland, Switzerland, 1992. ISBN 3-7643-2733-2.
- Nemirovsky and Yudin (1983) A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. John Wiley, 1983.
- Nesterov (1983) Y. E. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence . Doklady AN SSSR, 269, 1983.
- Polyak and Juditsky (1992) B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, volume 30, 1992.
- Raginsky and Rakhlin (2011) M. Raginsky and A. Rakhlin. Information-based complexity, feedback and dynamics in convex programming. IEEE Transactions on Information Theory, 2011.
- Rakhlin et al. (2012) A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012.
- Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, vol. 22, 1951.
- Ruppert (1988) D. Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Tech. Report, ORIE, Cornell University, 1988.
- Shamir and Zhang (2012) O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. CoRR, abs/1212.1824, 2012.
- Sutskever et al. (2013) I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.
Tieleman and Hinton (2012)
T. Tieleman and G. Hinton.
Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.COURSERA: Neural networks for machine learning, 2012.
- Van der Vaart (2000) A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
Appendix A Preliminaries
Before presenting the lemmas establishing the behavior of SGD under various learning rate schemes, we introduce some notation. We recount that the SGD update rule denoted through:
We then write out the expression for the stochastic gradient .
where, given the stochastic gradient corresponding to an example , with , the above stochastic gradient expression naturally follows. Now, in order to analyze the contraction properties of the SGD update rule, we require the following notation:
[For e.g. Appendix A.2.2 from Jain et al. (2016)] Bias-Variance tradeoff: Running SGD for steps starting from and a stepsize sequence presents a final iterate whose excess risk is upper-bounded as:
where, and . Note that and , where, is the filtration formed by all samples until time .
One can view the contribution of the above two terms as ones stemming from SGD’s updates, which can be written as:
From the above equation, the result of the lemma follows straightforwardly. Now, clearly, if the noise and the inputs are indepdent of each other, and if the noise is zero mean i.e. , the above inequality holds with equality (without the factor of two). This is true more generally iff
For more details, refer to (Défossez and Bach, 2015).
Now, in order to bound the total error, note that the original stochastic process associated with SGD’s updates can be decoupled into two (simpler) processes, one being the noiseless process (which corresponds to reducing the dependence on the initial error, and is termed “bias”), i.e., where, the recurrence evolves as:
The second recursion corresponds to the dependence on the noise (termed as variance), wherein, the process is initiated at the solution, i.e. and is driven by the noise . The update for this process corresponds to:
We represent by the covariance of the iterate of the bias process, i.e.,
The quantity that routinely shows up when bounding SGD’s convergence behavior is the covariance of the variance error, i.e. . This implies the following (simplified) expression for :
Firstly, note that this naturally implies that the sequence of covariances , initialized at (say), the solution, i.e., naturally grows to its steady state covariance, i.e.,
See lemma 3 of Jain et al. (2017a) for more details. Furthermore, what naturally follows in relating to is:
Lemma 6 (Lemma 5 of Jain et al. (2017a)).
Running SGD with a (constant) stepsize sequence achieves the following steady-state covariance:
Suppose , and . For any sequence of learning rates , then,
We will prove the lemma using an inductive argument. The base case, i.e. follows from the problem statement. Note also that for SGD, implying the statement naturally follows. If, say, satisfies the equation above, from equation 12, we have the following covariance for :
from which the lemma follows. ∎
(Reduction from Multiplicative noise oracle) Let be the (expected) covariance of the variance error. Then, the recursion that connects to can be expressed as:
Note: Basically, one could analyze an auxiliary process driven by noise with variance off by a factor of two and convert the analysis into one involving exact (deterministic) gradients.
[Bias decay - strongly convex case] Let the minimal eigenvalue of the Hessian . Consider the bias recursion as in equation 10 with the stepsize set as . Then,
The proof follows through straight forward computations:
where, the first line follows from the fact that and the result follows through the definition of . ∎
[Reduction of the bias recursion with multiplicative noise to one resembling the variance recursion] Consider the bias recursion that evolves as
Then, the following recursion holds :
The result follows owing to the following computations:
with the last inequality holding true if the squared distance to the optimum doesn’t grow as a part of the recursion. We prove that this indeed is the case below:
Recursively applying the above argument yields the desired result. ∎
Note: This result implies that the bias error (in the smooth non-strongly convex case of the least squares regression with multiplicative noise) can be bounded by employing a similar lemma as that of the variance, where one can look at the quantity as the analog of the variance that drives the process.
[Lower bounds on the additive noise oracle imply ones for the multiplicative noise oracle] Under the assumption that the covariance of noise , the following statement holds. Let be the (expected) covariance of the variance error. Then, the recursion that connects to can be expressed as:
Let us consider firstly, the setting of (bounded) additive noise. Here, we have: