Stochastic gradient descent is among the most commonly used practical algorithms for large scale stochastic optimization. The seminal result of [9, 8] formalized this effectiveness, showing that for certain (locally quadric) problems, asymptotically, stochastic gradient descent is statistically minimax optimal (provided the iterates are averaged). There are a number of more modern proofs [1, 3, 2, 5] of this fact, which provide finite rates of convergence. Other recent algorithms also achieve the statistically optimal minimax rate, with finite convergence rates .
This work provides a short proof of this minimax optimality for SGD for the special case of least squares through a characterization of SGD as a stochastic process. The proof builds on ideas developed in [2, 5].
SGD for least squares. The expected square loss for over input-output pairs , where and are sampled from a distribution , is:
The optimal weight is denoted by:
Assume the argmin in unique.
Stochastic gradient descent proceeds as follows: at each iteration , using an i.i.d. sample , the update of is:
where is a fixed stepsize.
Notation. For a symmetric positive definite matrix
and a vector, define:
For a symmetric matrix , define the induced matrix norm under as:
The statistically optimal rate. Using samples (and for large enough
), the minimax optimal rate is achieved by the maximum likelihood estimator (the MLE), or, equivalently, the empirical risk minimizer. Giveni.i.d. samples , define
where denotes the MLE estimator over the samples.
This rate can be characterized as follows: define
For the case of additive noise models (i.e. the “well-specified” case), the assumption is that , with being independent of . Here, it is straightforward to see that:
2 Statistical Risk Bounds
and so the optimal constant in the rate can be written as:
For the mis-specified case, it is helpful to define:
which can be viewed as a measure of how mis-specified the model is. Note if the model is well-specified, then .
Denote the average iterate, averaged from iteration to , by:
Suppose . The risk is bounded as:
The bias term (the first term) decays at a geometric rate (one can set or maintain multiple running averages if is not known in advance). If and the model is well-specified (
), then the variance term is, and the rate of the bias contraction is . If the model is not well specified, then using a smaller stepsize of , leads to the same minimax optimal rate (up to a constant factor of 2), albeit at a slower bias contraction rate. In the mis-specified case, an example in  shows that such a smaller stepsize is required in order to be within a constant factor of the minimax rate. An even smaller stepsize leads to a constant even closer to that of the optimal rate.
The analysis first characterizes a bias/variance decomposition, where the variance is bounded in terms of properties of the stationary covariance of . Then this asymptotic covariance matrix is analyzed.
3.1 The Bias-Variance Decomposition
The gradient at in iteration is:
which is a mean quantity. Also define:
The update rule can be written as:
Roughly speaking, the above shows how the process on consists of a contraction along with an addition of a zero mean quantity.
It is helpful to consider a certain bias and variance decomposition. Let us write:
(The first conditional expectation notation slightly abuses notation, and should be taken as a definition111The abuse is due that the right hand side drops the conditioning.).
The error is bounded as:
Bias. The bias term is characterized as follows:
For all ,
Assume for all . Observe:
which completes the proof. ∎
Variance. Now suppose . Define the covariance matrix:
Using the recursion, ,
which follows from:
(these hold since is mean and both and are independent of ).
Suppose . There exists a unique such that:
Using that is mean zero and independent of and for ,
Now using that and that and are independent (for ),
which proves .
To prove the limit exists, it suffices to first argue the trace of is uniformly bounded from above, for all . By taking the trace of update rule, Equation 2, for ,
and, using ,
proving the uniform boundedness of the trace of . Now, for any fixed , the limit of exists, by the monotone convergence theorem. From this, it follows that every entry of the matrix converges. ∎
double counting the diagonal terms . For , . To see why, consider the recursion and take expectations to get since the sample is independent of the . From this,
Notice that for any non-negative integer . Since and , because the product of two commuting PSD matrices is PSD. Also note that for PSD matrices , . Hence,
from lemma 3 where followed from
and the series converges because . ∎
3.2 Stationary Distribution Analysis
Define two linear operators on symmetric matrices, and — where and can be viewed as matrices acting on dimensions — as follows:
With this, is the solution to:
(due to Equation 3).
(Crude bound) is bounded as:
Define one more linear operator as follows:
The inverse of this operator can be written as:
which exists since the sum converges due to fact that .
A few inequalities are helpful: If , then
(which follows since ). Also, if , then
The following inequality is also of use:
which completes the proof. ∎
(Refined bound) The is bounded as:
3.3 Completing the proof of Theorem 1
The proof of the theorem is completed by applying the developed lemmas. For the bias term, using convexity leads to:
For the variance term, observe that
which completes the proof. ∎
Sham Kakade acknowledges funding from the Washington Research Foundation Fund for Innovation in Data-Intensive Discovery. The authors also thank Zaid Harchaoui for helpful discussions.
Francis R. Bach.
Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression.
Journal of Machine Learning Research (JMLR), volume 15, 2014.
-  Alexandre Défossez and Francis R. Bach. Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In AISTATS, volume 38, 2015.
-  Aymeric Dieuleveut and Francis R. Bach. Non-parametric stochastic approximation with large step sizes. The Annals of Statistics, 2015.
-  Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Competing with the empirical risk minimizer in a single pass. In COLT, 2015.
-  Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic approximation through mini-batching and tail-averaging. CoRR, abs/1610.03774, 2016.
-  Harold J. Kushner and Dean S. Clark. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-Verlag, 1978.
-  Erich L. Lehmann and George Casella. Theory of Point Estimation. Springer Texts in Statistics. Springer, 1998.
-  Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, volume 30, 1992.
-  David Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Tech. Report, ORIE, Cornell University, 1988.
-  Aad W. van der Vaart. Asymptotic Statistics. Cambridge University Publishers, 2000.