# A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)

This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent (SGD), for the special case of least squares. This result is obtained by analyzing SGD as a stochastic process and by sharply characterizing the stationary covariance matrix of this process. The finite rate optimality characterization captures the constant factors and addresses model mis-specification.

## Authors

• 63 publications
• 56 publications
• 13 publications
• 42 publications
• 2 publications
• 55 publications
• ### The convergence of the Stochastic Gradient Descent (SGD) : a self-contained proof

We give here a proof of the convergence of the Stochastic Gradient Desce...
03/26/2021 ∙ by Gabrel Turinici, et al. ∙ 0

• ### Least Squares Regression with Markovian Data: Fundamental Limits and Algorithms

We study the problem of least squares linear regression where the data-p...
06/16/2020 ∙ by Guy Bresler, et al. ∙ 10

• ### Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

We consider stochastic gradient descent (SGD) for least-squares regressi...
05/25/2018 ∙ by Loucas Pillaud-Vivien, et al. ∙ 0

• ### Mixing of Stochastic Accelerated Gradient Descent

We study the mixing properties for stochastic accelerated gradient desce...
10/31/2019 ∙ by Peiyuan Zhang, et al. ∙ 0

• ### Benign Overfitting of Constant-Stepsize SGD for Linear Regression

There is an increasing realization that algorithmic inductive biases are...
03/23/2021 ∙ by Difan Zou, et al. ∙ 9

• ### Bias-Variance Tradeoff in a Sliding Window Implementation of the Stochastic Gradient Algorithm

This paper provides a framework to analyze stochastic gradient algorithm...
10/25/2019 ∙ by Yakup Ceki Papo, et al. ∙ 0

• ### Stochastic Gradient/Mirror Descent: Minimax Optimality and Implicit Regularization

Stochastic descent methods (of the gradient and mirror varieties) have b...
06/04/2018 ∙ by Navid Azizan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Stochastic gradient descent is among the most commonly used practical algorithms for large scale stochastic optimization. The seminal result of  [9, 8] formalized this effectiveness, showing that for certain (locally quadric) problems, asymptotically, stochastic gradient descent is statistically minimax optimal (provided the iterates are averaged). There are a number of more modern proofs [1, 3, 2, 5] of this fact, which provide finite rates of convergence. Other recent algorithms also achieve the statistically optimal minimax rate, with finite convergence rates [4].

This work provides a short proof of this minimax optimality for SGD for the special case of least squares through a characterization of SGD as a stochastic process. The proof builds on ideas developed in [2, 5].

SGD for least squares. The expected square loss for over input-output pairs , where and are sampled from a distribution , is:

 L(w)=12E(x,y)∼D[(y−w⋅x)2]

The optimal weight is denoted by:

 w∗:=argminwL(w).

Assume the argmin in unique.

Stochastic gradient descent proceeds as follows: at each iteration , using an i.i.d. sample , the update of is:

 wt=wt−1+γ(yt−wt−1⋅xt)xt

where is a fixed stepsize.

Notation. For a symmetric positive definite matrix

and a vector

, define:

 ∥x∥2A:=x⊤Ax.

For a symmetric matrix , define the induced matrix norm under as:

 ∥M∥A:=max∥v∥=1v⊤Mvv⊤Av=∥A−1/2MA−1/2∥.

The statistically optimal rate. Using samples (and for large enough

), the minimax optimal rate is achieved by the maximum likelihood estimator (the MLE), or, equivalently, the empirical risk minimizer. Given

i.i.d. samples , define

 ˆwMLEn:=argminw1nn∑i=112(yi−w⋅xi)2

where denotes the MLE estimator over the samples.

This rate can be characterized as follows: define

 σ2MLE:=12E[(y−w∗x)2∥x∥2H−1],

and the (asymptotic) rate of the MLE is  [7, 10]. Precisely,

 limn→∞E[L(ˆwMLEn)]−L(w∗)σ2MLE/n=1,

The works of [9, 8] proved that a certain averaged stochastic gradient method achieves this minimax rate, in the limit.

For the case of additive noise models (i.e. the “well-specified” case), the assumption is that , with being independent of . Here, it is straightforward to see that:

 σ2MLEn=12dσ2n.

The rate of is still minimax optimal even among mis-specified models, where the additive noise assumption may not hold [6, 7, 10].

Assumptions.

Assume the fourth moment of

is finite. Denote the second moment matrix of as

 H:=E[xx⊤],

and suppose

is strictly positive definite with minimal eigenvalue:

 μ:=σmin(H).

Define as the smallest value which satisfies:

 E[∥x∥2xx⊤]⪯R2E[xx⊤].

This implies .

## 2 Statistical Risk Bounds

Define:

 Σ:=E[(y−w∗x)2xx⊤],

and so the optimal constant in the rate can be written as:

 σ2MLE=12Tr(H−1Σ)=12E[(y−w∗x)2∥x∥2H−1],

For the mis-specified case, it is helpful to define:

 ρmisspec:=d∥Σ∥HTr(H−1Σ),

which can be viewed as a measure of how mis-specified the model is. Note if the model is well-specified, then .

Denote the average iterate, averaged from iteration to , by:

 ¯¯¯¯wt:T:=1T−tT−1∑t′=twt′.
###### Theorem 1.

Suppose . The risk is bounded as:

 E[L(¯¯¯¯wt:T)]−L(w∗)≤(√12exp(−γμt)R2∥w0−w∗∥2+√(1+γR21−γR2ρmisspec)σ2MLET−t)2.

The bias term (the first term) decays at a geometric rate (one can set or maintain multiple running averages if is not known in advance). If and the model is well-specified (

), then the variance term is

, and the rate of the bias contraction is . If the model is not well specified, then using a smaller stepsize of , leads to the same minimax optimal rate (up to a constant factor of 2), albeit at a slower bias contraction rate. In the mis-specified case, an example in [5] shows that such a smaller stepsize is required in order to be within a constant factor of the minimax rate. An even smaller stepsize leads to a constant even closer to that of the optimal rate.

## 3 Analysis

The analysis first characterizes a bias/variance decomposition, where the variance is bounded in terms of properties of the stationary covariance of . Then this asymptotic covariance matrix is analyzed.

Throughout assume:

 γ<1R2.

### 3.1 The Bias-Variance Decomposition

The gradient at in iteration is:

 ξt:=−(yt−w∗⋅xt)xt,

which is a mean quantity. Also define:

 Bt:=I−xtx⊤t.

The update rule can be written as:

 wt−w∗ =wt−1−w∗+γ(yt−wt−1⋅xt)xt =(I−γxtx⊤t)(wt−1−w∗)−γξt =Bt(wt−1−w∗)−γξt.

Roughly speaking, the above shows how the process on consists of a contraction along with an addition of a zero mean quantity.

From recursion,

 wt−w∗ =Bt⋯B1(w0−w∗)−γ(ξt+Btξt−1+⋯+Bt⋯B2ξ1). (1)

It is helpful to consider a certain bias and variance decomposition. Let us write:

 E[∥¯¯¯¯wt:T−w∗∥2H|ξ0=⋯=ξT=0]:=1(T−t)2E⎡⎣∥∥ ∥∥T−1∑τ=tBτ⋯B1(w0−w∗)∥∥ ∥∥2H⎤⎦.

and

(The first conditional expectation notation slightly abuses notation, and should be taken as a definition111The abuse is due that the right hand side drops the conditioning.).

###### Lemma 1.

The error is bounded as:

 E[L(¯¯¯¯wt:T)]−L(w∗)≤12(√E[∥¯¯¯¯wt:T−w∗∥2H|ξ0=⋯=ξT=0]+√E[∥¯¯¯¯wt:T−w∗∥2H|w0=w∗])2.
###### Proof.

Equation 1 implies that:

 ¯¯¯¯wt:T−w∗=1T−tT−1∑τ=tBτ⋯B1(w0−w∗)−γT−tT−1∑τ=t(ξτ+Bτξτ−1+⋯+Bτ⋯B2ξ1).

Now observe that for vector valued random variables

, implies

 E∥u+v∥2H≤(√E∥u∥2H+√E∥v∥2H)2,

the proof of the lemma follows by noting that . ∎

Bias. The bias term is characterized as follows:

###### Lemma 2.

For all ,

 E[∥wt−w∗∥2|ξ0=⋯=ξT=0]≤exp(−γμt)∥w0−w∗∥2.
###### Proof.

Assume for all . Observe:

 E∥wt−w∗∥2 = E∥wt−1−w∗∥2−2γ(wt−1−w∗)⊤E[xx⊤](wt−1−w∗) +γ2(wt−1−w∗)⊤E[∥x∥2xx⊤](wt−1−w∗) ≤ E∥wt−1−w∗∥2−2γ(wt−1−w∗)⊤H(wt−1−w∗) +γ2R2(wt−1−w∗)⊤H(wt−1−w∗) ≤ E∥wt−1−w∗∥2−γE∥wt−1−w∗∥2H ≤ (1−γμ)E∥wt−1−w∗∥2,

which completes the proof. ∎

Variance. Now suppose . Define the covariance matrix:

 Ct:=E[(wt−w∗)(wt−w∗)⊤|w0=w∗]

Using the recursion, ,

 Ct+1=Ct−γHCt−γCtH+γ2E[(x⊤Ctx)xx⊤]+γ2Σ (2)

which follows from:

 E[(wt−w∗)ξ⊤t+1]=0, and E[(xt+1x⊤t+1)(wt−w∗)ξ⊤t+1]=0

(these hold since is mean and both and are independent of ).

###### Lemma 3.

Suppose . There exists a unique such that:

 0=C0⪯C1⪯⋯⪯C∞

where satisfies:

 C∞=C∞−γHC∞−γC∞H+γ2E[(x⊤C∞x)xx⊤]+γ2Σ. (3)
###### Proof.

By recursion,

 wt−w∗ = Bt(wt−1−w∗)+γξt = γ(ξt+Btξt−1+⋯+Bt⋯B2ξ1).

Using that is mean zero and independent of and for ,

 Ct=γ2(E[ξtξ⊤t]+E[Btξt−1ξ⊤t−1Bt]+⋯+E[Bt⋯B2ξ1ξ⊤1B⊤2⋯B⊤t])

Now using that and that and are independent (for ),

 Ct = γ2(Σ+E[B2ΣB2]+⋯+E[Bt⋯B2ΣB⊤2⋯B⊤t]) = Ct−1+γ2E[Bt⋯B2ΣB⊤2⋯B⊤t]

which proves .

To prove the limit exists, it suffices to first argue the trace of is uniformly bounded from above, for all . By taking the trace of update rule, Equation 2, for ,

 Tr(Ct+1)=Tr(Ct)−2γTr(HCt)+γ2Tr(E[(x⊤Ctx)xx⊤])+γ2Tr(Σ).

Observe:

 Tr(E[(x⊤Ctx)xx⊤])=Tr(E[(x⊤Ctx)∥x∥2])=Tr(CtE[∥x∥2xx⊤])≤R2Tr(CtH) (4)

and, using ,

 Tr(Ct+1)≤Tr(Ct)−γTr(HCt)+γ2Tr(Σ)≤(1−γμ)Tr(Ct)+γ2Tr(Σ)≤γTr(Σ)μ.

proving the uniform boundedness of the trace of . Now, for any fixed , the limit of exists, by the monotone convergence theorem. From this, it follows that every entry of the matrix converges. ∎

###### Lemma 4.

Define:

 ¯¯¯¯wT :=1TT−1∑t=0wt.

and so:

 12E[∥¯¯¯¯wT−w∗∥2H|w0=w∗] ≤Tr(C∞)γT
###### Proof.

Note

 E[(¯¯¯¯wT−w∗)(¯¯¯¯wT−w∗)⊤|w0=w∗]= 1T2T−1∑t=0T−1∑t′=0E[(wt−w∗)(wt′−w∗)⊤|w0=w∗] ⪯ 1T2T−1∑t=0T−1∑t′=t(E[(wt−w∗)(wt′−w∗)⊤|w0=w∗]+ E[(wt′−w∗)(wt−w∗)⊤|w0=w∗]),

double counting the diagonal terms . For , . To see why, consider the recursion and take expectations to get since the sample is independent of the . From this,

 E[(¯¯¯¯wT−w∗)(¯¯¯¯wT−w∗)⊤|w0=w∗] ⪯1T2T−1∑t=0T−t−1∑τ=0(I−γH)τCt+Ct(I−γH)τ,

and so,

 E[∥¯¯¯¯wT−w∗∥2H|w0=w∗] =Tr(HE[(¯¯¯¯wT−w∗)(¯wT−w∗)⊤|w0=w∗]) ≤1T2T−1∑t=0T−t−1∑τ=0Tr(H(I−γH)τCt)+Tr(Ct(I−γH)τH).

Notice that for any non-negative integer . Since and , because the product of two commuting PSD matrices is PSD. Also note that for PSD matrices , . Hence,

 E[∥¯¯¯¯wT−w∗∥2H|w0=w∗] ≤2T2T−1∑t=0∞∑τ=0Tr(H(I−γH)τCt) =2T2T−1∑t=0Tr(H(∞∑τ=0(I−γH)τ)Ct) =2T2T−1∑t=0Tr(H(γH)−1Ct) (∗) =2γT2T−1∑t=0Tr(Ct) ≤2γT⋅Tr(C∞),

from lemma 3 where followed from

 (γH)−1=(I−(I−γH))−1=∞∑τ=0(I−γH)τ,

and the series converges because . ∎

### 3.2 Stationary Distribution Analysis

Define two linear operators on symmetric matrices, and — where and can be viewed as matrices acting on dimensions — as follows:

 S∘M:=E[(x⊤Mx)xx⊤],T∘M:=HM+MH.

With this, is the solution to:

 T∘C∞=γS∘C∞+γΣ (5)

(due to Equation 3).

###### Lemma 5.

(Crude bound) is bounded as:

 C∞⪯γ∥Σ∥H1−γR2I.
###### Proof.

Define one more linear operator as follows:

 ˜T∘M:=T∘M−γHMH=HM+MH−γHMH.

The inverse of this operator can be written as:

 ˜T−1∘M=γ∞∑t=0(I−γ˜T)t∘M=γ∞∑t=0(I−γH)tM(I−γH)t.

which exists since the sum converges due to fact that .

A few inequalities are helpful: If , then

 0⪯˜T−1∘M⪯˜T−1∘M′, (6)

since

 ˜T−1∘M=γ∞∑t=0(I−γH)tM(I−γH)t⪯γ∞∑t=0(I−γH)tM′(I−γH)t=˜T−1∘M′,

(which follows since ). Also, if , then

 0⪯S∘M⪯S∘M′, (7)

which implies:

 0⪯˜T−1∘S∘M⪯˜T−1∘S∘M′. (8)

The following inequality is also of use:

 Σ⪯∥H−1/2ΣH−1/2∥H=∥Σ∥HH.

By definition of ,

 ˜T∘C∞=γS∘C∞+γΣ−γHC∞H.

Using this and Equation 6,

 C∞ = γ˜T−1∘S∘C∞+γ˜T−1∘Σ−γ˜T−1∘(HC∞H) ⪯ γ˜T−1∘S∘C∞+γ˜T−1∘Σ ⪯ γ˜T−1∘S∘C∞+γ∥Σ∥H˜T−1∘H.

Proceeding recursively by using Equation 8,

 C∞ ⪯ (γ˜T−1∘S)2∘C∞+γ∥Σ∥H(γ˜T−1∘S)∘˜T−1∘H+γ∥Σ∥H˜T−1∘H ⪯

Using

 S∘I ⪯R2H

and

 ˜T−1∘H =γ∞∑t=0(I−γH)2tH=γ∞∑t=0(I−γ2H+γ2H)tH⪯γ∞∑t=0(I−γH)tH=γ(γH)−1H=I

 C∞⪯γ∥Σ∥H∞∑t=0(γR2)tI=γ∥Σ∥H1−γR2I,

which completes the proof. ∎

###### Lemma 6.

(Refined bound) The is bounded as:

 Tr(C∞)≤γ2Tr(H−1Σ)+12γ2R21−γR2d∥Σ∥H
###### Proof.

From Lemma 5 and Equation 7,

 S∘C∞⪯γ∥Σ∥H1−γR2S∘I⪯γR2∥Σ∥H1−γR2H.

Also, from Equation 3, satisfies:

 HC∞+C∞H=γS∘C∞+γΣ.

Multiplying this by and taking the trace leads to:

 Tr(C∞) = γ2Tr(H−1⋅(S∘C∞))+γ2Tr(H−1Σ) ≤ 12γ2R21−γR2∥Σ∥HTr(H−1H)+γ2Tr(H−1Σ) = 12γ2R21−γR2d∥Σ∥H+γ2Tr(H−1Σ)

which completes the proof. ∎

### 3.3 Completing the proof of Theorem 1

###### Proof.

The proof of the theorem is completed by applying the developed lemmas. For the bias term, using convexity leads to:

 12E[∥¯¯¯¯wt:T−w∗∥2H|ξ0=⋯ξT=0] ≤ 12R2E[∥¯¯¯¯wt:T−w∗∥2|ξ0=⋯ξT=0] ≤ 12R2T−tT−1∑t′=tE[∥wt′−w∗∥2|ξ0=⋯ξT=0] ≤ 12exp(−γμt)R2∥w0−w∗∥2.

For the variance term, observe that

 12E[∥¯¯¯¯wt:T−w∗∥2H|w0=w∗]≤Tr(C∞)γ(T−t)≤1T−t(12Tr(H−1Σ)+12γR21−γR2d∥Σ∥H),

which completes the proof. ∎

#### Acknowledgements.

Sham Kakade acknowledges funding from the Washington Research Foundation Fund for Innovation in Data-Intensive Discovery. The authors also thank Zaid Harchaoui for helpful discussions.

## References

• [1] Francis R. Bach.

Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression.

Journal of Machine Learning Research (JMLR)

, volume 15, 2014.
• [2] Alexandre Défossez and Francis R. Bach. Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In AISTATS, volume 38, 2015.
• [3] Aymeric Dieuleveut and Francis R. Bach. Non-parametric stochastic approximation with large step sizes. The Annals of Statistics, 2015.
• [4] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Competing with the empirical risk minimizer in a single pass. In COLT, 2015.
• [5] Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic approximation through mini-batching and tail-averaging. CoRR, abs/1610.03774, 2016.
• [6] Harold J. Kushner and Dean S. Clark. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-Verlag, 1978.
• [7] Erich L. Lehmann and George Casella. Theory of Point Estimation. Springer Texts in Statistics. Springer, 1998.
• [8] Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, volume 30, 1992.
• [9] David Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Tech. Report, ORIE, Cornell University, 1988.
• [10] Aad W. van der Vaart. Asymptotic Statistics. Cambridge University Publishers, 2000.