 # Stochastic gradient descent algorithms for strongly convex functions at O(1/T) convergence rates

With a weighting scheme proportional to t, a traditional stochastic gradient descent (SGD) algorithm achieves a high probability convergence rate of O(κ/T) for strongly convex functions, instead of O(κ ln(T)/T). We also prove that an accelerated SGD algorithm also achieves a rate of O(κ/T).

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider a stochastic optimization problem

 minx∈\calX{f(x):=\expect[ξ]F(x,ξ)}

where is a nonempty bounded closed convex set,

is a random variable,

is a smooth convex function,

is a smooth strongly-convex function. The requirement of smoothness simplifies the analysis. If the objective function is nonsmooth but satisfies Lipschitz continuity, stochastic gradient descent algorithms can replace gradients with subgradients, but the analysis has to introduce an additional term in the same order as the variance term. Some nonsmooth cases have been studied in

Assume that the domain is bounded, i.e. . Let be a stochastic gradient of function at with a random variable . Then is a gradient of . Assume that , where is known as the Lipschitz constant. We only consider strongly convex function in this note, thus assume that there is , such that . We assume that stochastic gradients are bounded, i.e., there exists , such that

We are interested in the conditional number , which is defined as . The conditional number, , could be as large as , where is the number of samples and

. One reference case is regularized linear classifiers

[smale03:_estim_approx_error_learn_theor], where the regularization factor could be as large as . The other reference case is the conditional number of a random matrix [rudelson09:_small]

, where the smallest singular value is

. When , , which bridges the gap between the convergence rate for strongly convex functions and that for those without strongly convex condition. In this note, we assume . We use big- notation in term of and and hide the factors , and besides constants.

### Notation

Denote by . Let be a sequence of independent random variables. Denote . We define . Then , and for , .

## 2 Stochastic gradient descent algorithm

Algorithm 1 shows the stochastic gradient descent method. Unlike the conventional averaging by equal weights , we use a weighting scheme , where . Theorem 1 shows a convergence rate of , assuming that . Let , , , and the coefficients and . The informal argument is that the weighting scheme equalizes the variance of each iteration, since and are assuming that . Assume that the underlying function is strongly convex, i.e., . Let . If , , then it holds for Algorithm 1 that for ,

 Pr{f(¯xT)−f(x∗)≥¯K(T)+√2θ~K(T)+θ^K(T)}≤exp{−θ}, (1)

where

 ¯K(T) :=D2LT+2κQ2LT=O(κ/T), ~K(T) :=4DQ(κ+1)T3/2+2√2κQ2LT+4√2κ3/2Q2√1+lnTLT3/2=O(κ/T), ^K(T) :=10κQ2LT=O(κ/T).

Similarly with traditional equal weighting scheme, , we have a convergence rate of in Proposition 2. Informally, implies a convergence rate of . Assume that . Let . If , , then for ,

 Pr{f(¯xT)−f(x∗)≥¯K(T)+√2θ~K(T)+θ^K(T)}≤exp{−θ},

where

 ¯K(T) :=LD22T+κQ22LT(1+lnT), ~K(T) :=DQ√κ+1T+κQ2LT√1+lnT, ^K(T) :=6κQ2LT.

Proposition 1 shows that if the optimal solution is an interior point, it is possible to simply take the non-averaged solution, . The convergence rate is . However, if , means not convergent, just like the non-averaged SGD solution without strongly convex conditions. Assume that and the optimal solution is an interior point. Let . If , then for ,

 Pr{f(xT)−f(x∗)≥¯K(T)+√2θ~K(T)+θ^K(T)}≤exp{−θ},

where

 ¯K(T) :=D2L(κ+1)22(T+κ)2+κ2Q2(T+κ(1+lnT))2L(T+κ)2=O(κ2/T), ~K(T) :=DQ(κ+1)2√2(T+κ)3/2+κ2Q22L(T+κ)+κ2Q2√κT(1+ln(T))2L(T+κ)2=O(κ2/T), ^K(T) :=6κ2Q2L(T+κ)=O(κ2/T).

There are studies on the high probability convergence rate of stochastic algorithm on strongly convex functions, such as [rakhlin12:_makin_gradien_descen_optim_stron]. The convergence rate usefully is . Here, we prove a convergence rate of with proper weighting scheme.

## 3 Accelerated Stochastic Gradient Descent Algorithm

Algorithm 2 is a stochastic variant of Nesterov’s accelerated methods. The convergence rate is also . Comparing with Theorem 1, the determinant part in Theorem 3 have a better rate, i.e. . Assume that . If , , then for ,

 Pr{f(¯xT)−f(x∗)>¯K(T)+√2θ~K(T)+θ^K(T)}≤exp{−θ},

where

 ¯K(T) :=2D2LT2+2κQ2LT, ~K(T) :=√20κDQT3/2+√10κQ22LT, ^K(T) :=8κQ2LT.

The paper [ghadimi12:_optim_stoch_approx_algor_stron] has its strongly convex version for AC-SA for sub-Gaussian gradient assumption, but its proof relies on a multi-stage algorithm.

Although SAGE [hu09:_accel_gradien_method_stoch_optim_onlin_learn] also provided a stochastic algorithm based on Nesterov’s method for strongly convexity, the high probability bound was not given in the paper.

## 4 A note on weighting schemes

In this study, we find the interesting property of weighting scheme with , i.e. . The scheme takes advantage of a sequence with variance at the decay rate of . Now let informally investigate a sequence with homogeneous variance, say . With a constant weighting scheme, , i.e. , the averaged variance is . With an exponential weighting scheme, , , i.e. and , the averaged variance is , which is translated to that the number of effective tail samples is a constant . With the weighting scheme or , the averaged variance is , which is translated to effective tail samples. This is a trade-off between sample efficiency and recency. To make other trade-offs, We can use a generalized scheme111An alternative scheme is or , where , or . Then the averaged variance is approximately .

## 5 Proofs

The proof strategy is first to construct inequalities from the algorithms in Lemma 5 and 5, then to apply Lemma 5 to derive the probability inequalities. Assume that is martingale difference, , , , , , , , , and

 Xt =wt(~atAt−1+2~btBt+~ctCt), (2) At ≤dt(atAt−1+2btBt+ctCt), (3) B2t ≤At−1Ct, Ct ≤1.

If the following conditions hold

1. for ,

 \expect[|T]exp(uXT+1)≤exp((u¯PT+2u2\tP2T1−u^RT)AT+u¯RT+2u2~R2T1−u^RT), (4)
2. for ,

 \lefteqatdt¯Pt+wt~at≤¯Pt−1, (5) \lefteq¯Rt+wt~ct+ctdt¯Pt≤¯Rt−1, \lefteqatdt\tP2t+4(wt~bt+btdt¯Pt)2≤\tP2t−1, \lefteq~R2t+ctdt\tP2t≤~R2t−1, \lefteq^Rt≤^Rt−1, \lefteqatdt\tP2t^Rt+4btdt(wt~bt+btdt¯Pt)\tP2t≤\tP2t−1^Rt−1, \lefteqatdt\tP2t^R2t+4btdt(wt~bt+btdt¯Pt)\tP2t^Rt+2b2td2t\tP4t≤\tP2t−1^R2t−1,

then for ,

 Pr{T+1∑t=1Xt≥¯P0D2+¯R0+√2θ(\tP0D2+~R0)+2θ^R0}≤exp{−θ}. (6)
###### Proof.

We will prove the following inequality by induction,

 \lefteq\expect[|t]exp(uT+1∑τ=t+1Xτ)≤exp((u¯Pt+2u2\tP2t1−u^Rt)At+u¯Rt+2u2~R2t1−u^Rt),∀u∈(0,12^Rt). (7)

Eq. 4 implies that Eq. (7) holds for . For ,

 \lefteq\expect[|t−1]exp(uT+1∑τ=tXτ)≤\expect[|t−1]exp(uXt+(u¯Pt+u2\tP2t2(1−u^Rt))At+u¯Rt+u2~R2t2(1−u^Rt)) (8) ≤\expect[|t−1]exp(uwt(~atAt−1+2~btBt+~ctCt)+(u¯Pt+u2\tP2t2(1−u^Rt))dt(atAt−1+2btBt+ctCt)+u¯Rt+u2~R2t2(1−u^Rt)) (9) ≤exp((u(¯Ptdtat+pt~at)+u2\tP2tdtat2(1−u^Rt))At−1+u(¯Rt+ptct+¯Ptdtct)+u2~R2t2(1−u^Rt)+u2\tP2tdtct2(1−u^Rt)) (10) ×\expect[|t−1]exp(2u(wt~bt+btdt¯Pt+ubtdt\tP2t2(1−u^Rt))Bt) ≤exp((u(¯Ptdtat+pt~at)+u2\tP2tdtat2(1−u^Rt)+2u2(wt~bt+btdt¯Pt+ubtdt\tP2t2(1−u^Rt))2)At−1 +u(¯Rt+wt~ct+¯Ptdtct)+u2~R2t2(1−u^Rt)+u2\tP2tdtct2(1−u^Rt)) (11) ≤exp((u(¯Ptdtat+pt~at)+u2\tP2tdtat2(1−u^Rt)+2u2(wt~bt+btdt¯Pt)2+u3(wt~bt+btdt¯Pt)btdt\tP2t2(1−u^Rt)+2u4b2td2t\tP4t2(1−u^Rt))At−1 +u(¯Rt+wt~ct+¯Ptdtct)+u2~R2t2(1−u^Rt)+u2\tP2tdtct2(1−u^Rt)) (12) ≤exp((u¯Pt−1+u2\tP2t−12(1−u^Rt−1))At−1+u¯Rt−1+u2~R2t−12(1−u^Rt−1)), (13)

where Eq. (8) is due to the assumption of induction; Eq. (9) is due to Eq. (2,3); Eq. (10) is due to ; Eq. (11) is due to , , and Hoeffding’s lemma, thus ; Eq. (12) is due to ; Eq. (13) is due to Eqs. (5). Then for ,

 \lefteq\expectexp(uT+1∑τ=1Xτ)≤exp((u¯P0+u2\tP202(1−u^R0))A0+u¯R0+u2~R202(1−u^R0))≤exp(u(¯P0D2+¯R0)+u2(\tP20D2+~R20)2(1−2u^R0)).

Eq. (6) follows Lemma Supporting lemma. ∎

We prove Lemma 5, which is the same as Lemma 7 of [lan08:_effic_method_stoch_compos_optim] except for the strong convexity. Let , , , . If and , it holds for Algorithm 1 that

 f(xt)−f(x∗) ≤1−γtμ2γtAt−1−12γtAt−QBt+γt2(1−γtL)Q2Ct.
###### Proof.

Let .

 f(xt) ≤f(xt−1)+⟨g(xt−1),dt⟩+L2∥dt∥2 (14) ≤f(x∗)+⟨g(xt−1),xt−x∗⟩−μ2∥xt−1−x∗∥2+L2∥dt∥2 (15) =f(x∗)+⟨^gt,xt−x∗⟩−μ2∥xt−1−x∗∥2+L2∥dt∥2−⟨δt,xt−x∗⟩ ≤f(x∗)+1−γtμ2γt∥xt−1−x∗∥2−12γt∥xt−x∗∥2−1−γtL2γt∥dt∥2−⟨δt,dt⟩−⟨δt,xt−1−x∗⟩ (16) ≤f(x∗)+1−γtμ2γt∥xt−1−x∗∥2−12γt∥xt−x∗∥2+γt2(1−γtL)∥δt∥2∗−⟨δt,xt−1−x∗⟩. (17)

Eq. (14) is due to the Lipschitz continuity of , Eq. (15) due to the strong convexity of , Eq. (16) due to the optimality of Step 4. ∎

###### Proof of Theorem 1.

Because , it follows Lemma 5 that

 f(xt)−f(x∗) ≤1−γtμ2γtAt−1−12γtAt−QBt+γtQ22(1−γtL) ≤(t+2κ−2)μAt−14−(t+2κ)μAt4−QBt+Q2μt.

As it follows Lemma 5 that

 At≤dt(atAt−1+2btBt+ctCt),

where , , and . Let . Assume that and . Then

 f(¯xT)−f(x∗) ≤T∑t=1wt(f(xt)−f(x∗))≤T∑t=1wt(1−γtμ2γtAt−1−12γtAt−QBt+γtQ22(1−γtL)) ≤T∑t=1wt(1−γtμ2γt−wt−12wtγt−1)At−1−T∑t=1wtQBt+T∑t=1wtγtQ22(1−γtL) ≤T∑t=1wt(L2tAt−1−QBt+Q2μt)≤LD2T+T∑t=1wt(−QBt+Q2μt).

Note that we use the factor for simplicity. Let , , , , and

 ¯Pt =0, ¯Rt =LD2T+2κQ2(T−t)LT2,