# Making the Last Iterate of SGD Information Theoretically Optimal

Stochastic gradient descent (SGD) is one of the most widely used algorithms for large scale optimization problems. While classical theoretical analysis of SGD for convex problems studies (suffix) averages of iterates and obtains information theoretically optimal bounds on suboptimality, the last point of SGD is, by far, the most preferred choice in practice. The best known results for last point of SGD shamir2013stochastic however, are suboptimal compared to information theoretic lower bounds by a T factor, where T is the number of iterations. harvey2018tight shows that in fact, this additional T factor is tight for standard step size sequences of 1/√(t) and 1/t for non-strongly convex and strongly convex settings, respectively. Similarly, even for subgradient descent (GD) when applied to non-smooth, convex functions, the best known step-size sequences still lead to O( T)-suboptimal convergence rates (on the final iterate). The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality of last point of SGD as well as GD. We achieve this by designing a modification scheme, that converts one sequence of step sizes to another so that the last point of SGD/GD with modified sequence has the same suboptimality guarantees as the average of SGD/GD with original sequence. We also show that our result holds with high-probability. We validate our results through simulations which demonstrate that the new step size sequence indeed improves the final iterate significantly compared to the standard step size sequences.

## Authors

• 55 publications
• 7 publications
• 35 publications
• ### Barzilai-Borwein Step Size for Stochastic Gradient Descent

One of the major issues in stochastic gradient descent (SGD) methods is ...
05/13/2016 ∙ by Conghui Tan, et al. ∙ 0

• ### Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence

We propose a stochastic variant of the classical Polyak step-size (Polya...
02/24/2020 ∙ by Nicolas Loizou, et al. ∙ 13

• ### Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Recent works have shown that stochastic gradient descent (SGD) achieves ...
05/24/2019 ∙ by Sharan Vaswani, et al. ∙ 10

• ### Stochastic algorithms with geometric step decay converge linearly on sharp functions

Stochastic (sub)gradient methods require step size schedule tuning to pe...
07/22/2019 ∙ by Damek Davis, et al. ∙ 0

• ### SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and Interpolation

We provide several convergence theorems for SGD for two large classes of...
06/18/2020 ∙ by Robert M. Gower, et al. ∙ 17

• ### Statistical inference using SGD

We present a novel method for frequentist statistical inference in M-est...
05/21/2017 ∙ by Tianyang Li, et al. ∙ 0

• ### How Good is SGD with Random Shuffling?

We study the performance of stochastic gradient descent (SGD) on smooth ...
07/31/2019 ∙ by Itay Safran, et al. ∙ 4

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Stochastic Gradient Descent (SGD) is one of the most popular algorithms for solving large-scale empirical risk minimization (ERM) problems [3, 4, 5]. The algorithm updates the iterates using stochastic gradients obtained by sampling data points uniformly at random. The algorithm has been studied for several decades [6] but there are still significant gaps between practical implementations and theoretical analyses. In particular, the standard analyses hold only for some kind of average of iterates, but most practitioners just use the final iterate of SGD. So, [7] asked the natural question of whether the final iterate of SGD, as opposed to average of iterates, is provably good. It was partly answered in [1] which gave sub-optimality bound for the last point of SGD but the obtained sub-optimality rates are worse than the information theoretically optimal rates; is the number of iterations.

[2] showed that the above result is tight for the standard step-size sequence used by most existing theoretical results. The extra logarithmic factor is not due to the stochastic nature of SGD. In fact, even for subgradient descent (GD) when applied to general non-smooth, convex functions, the last point’s convergence rates are sub-optimal by factor.

So, this work addresses the following two fundamental questions:
“Does there exist a step-size sequence for which the last point of SGD when applied to general convex functions as well as to strongly-convex functions has optimal error (sub-optimality) rate?”, and,
“Does there exist a step-size sequence for which the last point of GD when applied to general non-smooth convex functions has optimal error (sub-optimality) rate?”
In this paper, we answer both the questions in the affirmative. That is, we provide novel step size sequences and show that the final iterate of SGD run with these step size sequences has the information theoretically optimal error (suboptimality) rate. In particular, for general non-smooth convex functions, our results ensure an error rate of and for strongly-convex functions, the error rate is . We also present high-probablity versions, i.e., we show that with probability at least , the suboptimality is and respectively (see Theorems 1 and 2). For GD, we show that a similarly modified step-size sequence leads to suboptimality of and for non-smooth convex functions, with and with out strong convexity respectively, which is optimal.

In general, SGD takes the iterates near the optimum value but since the objective isn’t smooth near the optimizer , the gradients don’t become small even when the points are close to . Standard step sizes don’t decay appreciably with time to ensure fast enough convergence to . Therefore the iterates , after going close to , start oscillating around it without actually approaching it (See Section 4 for concrete examples). Our new step sizes, given in Section 2.1 ensure that the step sizes decay fast enough after a certain point, making the iterates go closer to the optimum . The exact mode of this decay ensures that the last iterate approaches the optimum at the information theoretic rate.

Our results utilize a general step size modification scheme which ensures that the upper bounds for the average function value with the original step sizes gets transferred to the last iterate when the modified step sizes are used (see Theorems 3 and 4). A key technical contribution of the paper is the proof of Theorem 2 that constructs a sequence of averaging schemes which are ‘good’ with high probability such that the last averaging scheme consists only of the last iterate and hence lets us conclude that the last iterate is ‘good’ with high probability.

Our new step-size sequence requires that the number of iterations or horizon is known apriori. In contrast, standard step-size sequences do not require apriori, and hence guarantee any-time results. Information about apriori helps us in ensuring that we do not drop step-size too early; only after we are close to the optimum, does the step size drop rapidly. In fact, we conjecture that in absence of apriori information about , no step-size sequence can ensure the information theoretically optimal error rates for final iterate of SGD. As a step towards proving this, we show that in the case of strongly convex objectives, any choice of step sizes with infinite horizon (i.e, without the knowledge of total number of iterations) is either suboptimal almost surely or suboptimal in expectation for infinitely many points. We show this in Theorem 5.

Related Work: Averaging was used first in the stochastic approximation setting by [8] to show optimal rates of convergence. Gradient Descent type methods have been shown to achieve information theoretically optimal error rates in the convex and strongly convex settings when averaging of iterates is used ([9],[10],[11], [12]

, Epoch GD in

[13] , SGD [14] and [15]). The question of the last iterate was first considered in [1] and it gives a bound of and in expectation for the general case and strongly convex case respectively. [2] show matching high probability bounds and show that for the standard step sizes ( in the general case and in the strongly convex case), the logarithmic-suboptimal bounds are tight.

Organization: The setting and main results are presented in Section 2. In particular, Section 2.1 describes the general step size modification considered and states key results regarding this modification and the lower bound is presented in Section 2.2. Key technical ideas are developed in Section 3 and the main theorems are proved. We present some experimental results in Section 4 and conclude in Section 5. Skipped proofs of technical lemmas are given in the appendix.

## 2 Problem Setup and Main Results

Consider the following optimization problem:

 minx∈WF(x), (1)

where objective function is a convex function and is a closed convex set. Let the global minimizer of be . We start the SGD algorithm at a point

and iteratively obtain estimates

for the minimizer of

. We assume that at each time step, we have access to independent, unbiased estimate

to a subgradient . That is, for every and are independent. We pick step sizes . Let be the projection operator to the set . The SGD algorithm is given as follows:

Henceforth, we will retain the assumptions made above. Whenever we use , it is implied that . Throughout the paper, we assume that is a Lipschitz continuous convex function.

###### Assumption 1 (Lipschitz Continuity).

is -Lipschitz continuous convex function over closed convex set , i.e., for every and every . Furthermore, the stochastic gradients satisfy: almost surely for every .

###### Assumption 2 (Closed and bounded set).

Diameter of closed convex set is bounded by , i.e., .

###### Assumption 3 (Strong convexity).

Let . A convex function is said to be strongly convex over iff .

Step size sequence for general convex functions: we first define,

 k:=inf{i:T⋅2−i≤1},  Ti:=T−⌈T⋅2−i⌉, 0≤i≤k,   and Tk+1:=T. (2)

Clearly, . We note in particular that . Let be arbitrary. Then, we choose the step size as follows:

 αt=C⋅2−i√Twhen Ti

The theorem below provides suboptimality guarantee for the SGD algorithm with the step-size sequence mentioned above.

###### Theorem 1 (SGD/GD Last Point for General Convex Functions).

Let Assumptions 1 and 2 hold. Given , let be the iterates of SGD (Algorithm 1) with step size as defined in Equation (3). Then, the following holds for all :

 E[F(xT)]≤F(x∗)+4D2C√T+11G2C√T.

In particular, if we choose , we have: Furthermore, the following holds w.p. for any :

 F(xT)=F(x∗)+O(D2C√T+CG2√Tlog(1δ))≤F(x∗)+O⎛⎜⎝DG√log1δT⎞⎟⎠.

Finally, under the same assumptions, GD update with the same step-size sequence given in (3) also ensures the following after iterations:

 F(xT)≤F(x∗)+4D2C√T+11G2C√T.

We will prove this theorem in Section 3 after developing some general ideas.

Remarks: (1) Note that the bounds on sub-optimality (for SGD and GD) are information theoretically optimal up to constants.
(2) Our result on the expected sub-optimality improves upon that of [1] by a multiplicative factor and our result on the high probability sub-optimality improves upon [2] by a multiplicative factor of . On the other hand, our step-size sequence requires apriori knowledge of . We conjecture that for any-time algorithm (i.e., without apriori knowledge of ) expected error rate of is information theoretically optimal.
(3) The rate obtained above for last point of GD (in the deterministic setting) is also optimal in the gradient oracle model and to the best of our knowledge, is the first such result for last point of GD.

Step size sequence for strongly-convex functions: Let be strongly convex (Assumption 3). Let . We pick as follows:

 αt=2−i1λt,∀ Ti

We now present our result for last point of SGD with strong-convexity assumption.

###### Theorem 2 (SGD Last Point for Strongly Convex Functions).

Let satisfy Assumptions 1 and 3. Then the following holds for the -th iterate of the SGD algorithm (Algorithm 1) when run with the step size sequence given in Equation (4):

 E[F(xT)]≤F(x∗)+130G2λT.

Furthermore, the following holds for all with probability at least :

 E[F(xT)]=F(x∗)+O⎛⎝G2log(1δ)λT⎞⎠.

Under the same assumptions, GD update with the same step-size sequence given in (4) also ensures the following after iterations:

 F(xT)≤F(x∗)+130G2λT.

Here again, we note that the result is information theoretically optimal up to factor.

### 2.1 General Step Size Modification

Theorems 1 and  2 are consequences of our general results on step size modification that we present below. Consider SGD step size sequence . We obtain modified step size sequence as follows:

 αt:=2−iγt∀ Ti

Under certain mild conditions, we will show that the last iterate of SGD with step size is as good as the average iterate of SGD with step size . We make these notions precise below:

###### Assumption 4 (Slowly Decreasing Step Size Sequence).

We call a step size sequence ‘decreasing’ if . We say that step size sequence has ‘at most polynomial decay’ with decay constant if for every .

We have the following general theorem:

###### Theorem 3.

Let be a decreasing step size sequence with at most polynomial decay with decay constant . Let the iterates of SGD with step size be . Let be the modification of as defined in Equation (5). Let the iterates of SGD with step size be . Then, for all , we have:

 E[F(xT)]≤5G2γT(1β2+1β4)+inf⌈T4⌉≤t≤T1E[F(yt)].

We also give a high probability version of Theorem 3.

###### Theorem 4.

Let . Let

be any arbitrary fixed probability distribution over the set

. With probability atleast , we have:

 F(xT) ≤γT⋅G2(120log(1δ)+400)⋅(1β2+1β4)+T1∑s=⌈T4⌉q(0)(s)F(ys).

That is, the above theorems show that compared to any weighted average of function values of iterates in the iterations, the error is not significantly larger if is reasonably large and is small. Now, using standard analysis, we can ensure small average function value for iterates in iterations. Small value of and bound on hold trivially for standard step-size sequences.

See Section 3 for detailed proofs of the above theorems. We first develop general technique and prove key lemmas in the next section, and then present proofs for all the theorems.

### 2.2 Lower Bounds

The step size modification procedure described above assumed the knowledge of the last iterate (this is not a setback in practice). We study the case of infinite horizon SGD. In this section we state our bounds on the last iterate of ‘any time’ (infinite horizon) SGD in the case of strongly convex objectives. We will first introduce the notion of suboptimality that we consider. In particular, we look at two kinds of ‘bad performance’ in infinite horizon SGD for non-smooth strongly convex optimization. Consider any infinite step size sequence .

1. The sequence is said to be ‘bad in expectation’ if for an objective satisfying assumptions 1,2 and 3, some choice of subgradient oracle, and SGD iterates with step size , there is a fixed subsequence such that .

2. The sequence is said to be ‘bad almost surely’ if for an objective satisfying assumptions 1,2 and 3, some choice of subgradient oracle, and SGD iterates with step size , with probability there exists a random infinite sequence of times such that

We give a ‘no free lunch’ theorem: that is we show that infinite horizon step-size sequence for non-smooth strongly convex optimization is either ‘bad in expectation’ or ‘bad almost surely’. More precisely, we will show that if any infinite horizon SGD is good in ‘expectation’ for every for every strongly convex function, then it is ‘bad almost surely’ for some function .

###### Theorem 5.

Consider infinite horizon SGD with step size such that assumptions 1 2 and  3 hold for the objective function. Then, for any choice of , the algorithm is either bad in expectation or bad almost surely.

We give the proof in Section B.

## 3 Technical Ideas and Proofs

Recall the definition of from Section 2. The rough idea behind the proof is as follows: we will find a ‘good point’ in the range and then show that this implies that there is a ‘good point’ between and and so on, until we conclude that is a good point.

To this end, we first provide a key lemma that bounds the total weighted deviation of SGD iterates from a given iterate (in terms of function value), i.e., it intuitively shows that once we find an iterate with small function value, the remaining iterates cannot deviate from it significantly. The lemma uses a trick that was first used in [16] and then also in [1].

###### Lemma 1.

Let be the output of SGD algorithm (Algorithm 1) with step size sequence defined by (3). Then, given any ,

 t1∑t=t02αtE[F(xt)−F(xt0)]≤t1∑t=t0G2α2t.
###### Proof.

By convexity of , we have:

 ∥xt+1−xt0∥ =∥ΠW(xt−αt^gt(xt))−xt0∥≤∥xt−αt^gt(xt)−xt0∥

Taking squares and expanding on both sides,

 ∥xt+1−xt0∥2 ≤∥xt−xt0∥2+α2t∥^gt(xt)∥2−2αt⟨^gt(xt),xt−xt0⟩

Taking expectation on both sides, and realizing that is independent of and , we conclude,

 E[∥xt+1−xt0∥2] ≤E[∥xt−xt0∥2]+α2tG2−2αtE⟨gt,xt−xt0⟩

Here we have used the fact that . Using convexity, is lower bounded by . We conclude that:

 E[∥xt+1−xt0∥2] ≤E[∥xt−xt0∥2]+α2tG2−2αtE[F(xt)−F(xt0)]

The result now follows by summing the above term from to . ∎

We now provide a high probability version of Lemma 1. To this end, we construct an exponential super-martingale that when combined with a Chernoff bound leads to exponential concentration bound. The method used is somewhat similar to the one used in [2], but our technique is specifically for Lemma 1 and is more concise.

For simplicity of exposition, we first define a few key quantities. Let and . We define the sequence as follows: for as follows:

 Lt1=1e⋅r, t∈[t0,t1],   Lt−1=Lt+L2t, t0≤t−1

Using Lemma  3, . Now, for any such that

, we define the following random variables :

 A(l,t1):=t1∑t=lLt[2αt(F(xt)−F(xl))−α2tG2], A∗(t0,t1):=t1∑t=t0Lt[2αt(F(xt)−F(x∗))−α2tG2]. (7)

We note the difference between and : considers suboptimality with respect to whereas considers the suboptimality with respect to the optimizer .

###### Lemma 2.

Let and be as defined by (7). Let be any probability distribution over . We let . Also, let be a decreasing step size sequence. Then,

 P[(p.A)(t0,t1)>η]≤exp(−η8α2t0G2).

Additionally, if almost surely, we have:

 P[A∗(t0,t1)>η]≤exp(2D2Lt08α2t0G2)exp(−η8α2t0G2).
###### Lemma 3.

Let be fixed. Let , , , . Then, for every ,

See Section A for proofs of the above given lemmata. We also require the following technical lemma:

###### Lemma 4.

Let be as defined in Section 2. Then, for all :

 4(Ti+2−Ti+1)≥Ti+1−Ti.
###### Proof.

Lemma follows from the fact that . ∎

### 3.1 Step Size Modification

Henceforth, we will assume that is a decreasing step size sequence with at most polynomial decay (decay constant being ). We let be the modification of as defined in Equation 5. Let,

 τi:=arginfTi

Note that . We note that are completely deterministic and only used as part of the proof. The ability to compute is not necessary.

###### Lemma 5.

Let ’s be iterates of SGD (Algorithm 1) with modified step size sequence of defined in (5); sequence satisfies Assumption 4. Let be as defined by (2), and , be as defined in (8). Also, let . Then, the following holds for all :

 E[F(xτi+1)−F(xτi)]≤5G2γTβ22−i,  E[F(xτ1)−F(xτ0)]≤5G2γTβ4.
###### Proof.

We first consider . If , the proof is done. Else, using Lemma 1 with and , and the fact that is a decreasing sequence, we get:

 ∑Ti+2t=τi2αtE[F(xt)−F(xτi)]Ti+2−τi+1 ≤∑Ti+2t=τiG2α2tTi+2−τi+1≤G2α2Ti+1. (9)

By definition of , whenever . Hence,

 G22−2iγ2Ti+1 =G2α2Ti+1≥∑Ti+2t=τi2αtE[F(xt)−F(xτi)]Ti+2−τi+1≥∑Ti+2t=Ti+1+12αtE[F(xt)−F(xτi)]Ti+2−τi+1, (10)

where the first equality follows from the definition of in (5), first inequality follows from Equation (9), and the final inequality follows from the fact that when (see definition of in (8)).

Now, by using the above inequality with the assumption , and the fact that , we have:

 G22−2iγ2Ti+1 ≥2αTi+2Ti+2−Ti+1Ti+2−TiE[F(xτi+1)−F(xτi)]ζ1≥2αTi+25E[F(xτi+1)−F(xτi)] =2−iγTi+25E[F(xτi+1)−F(xτi)]≥2−iβγTi+15E[F(xτi+1)−F(xτi)], (11)

where follows from Lemma 4. The equality follows from definition of and the last inequality follows from the -slowly decaying assumption for (Assumption 4).That is we obtain the result for the case . The proof for the case when follows with minor modifications to the arguments given above. ∎

We now present a high probability version of Lemma 5.

###### Lemma 6.

Consider the setting of Lemma 5. Let and define for and for . Let be any probability distribution over . Let , where and the sequence is defined by (6). Then, for any and , the following holds with probability at least :

 Ti+2∑t=Ti+1+1p(i+1)(t)F(xt) ≤G2γT2−iβ2(15+120log1δi)+Ti+1∑s=Ti+1q(i)(s)F(xs).

For , the following holds with probability atleast :

 T2∑t=T1+1p(1)(t)F(xt) ≤G2γT2−iβ4(15+120log1δi)+T1∑s=⌈T4⌉q(0)(s)F(xs).
###### Proof.

We will only show the case . The case follows by a similar proof. For , we define . We let be defined as follows over :

 κ(Ti+1):=Γ(Ti+1)Γ(Ti+1)⋅q(i)(Ti+1), κ(t):=Γ(Ti+1)Γ(t)q(i)(t)+αtLt⋅(∑t−1s=Ti+1κ(s))Γ(t), t∈(Ti+1,Ti+1], κ(t):=0, ∀t≥Ti+1. (12)

From Lemma 7, we conclude that is a probability distribution over . From Lemma 2, we conclude that with probability atleast :

 (κ.A)(t0,t1)≤8α2t0G2log1δi (13)

We will show that when this event happens, the inequality in the statement of the lemma holds. If , then the statement of the lemma holds trivially. Now assume . We use the fact that is supported over and hence:

 (κ.A)(t0,t1)=Ti+1∑l=Ti+1Ti+2∑t=lκ(l)Lt[2αt(F(xt)−F(xl))−α2tG2]

We exchange summation and collect the coefficients of the term to conclude:

where (empty sum being by definition). By definition of , and . Therefore, we conclude:

 (κ.A)(t0,t1)=Ti+2∑Ti+1+12αtLtF(xt)−Ti+2∑Ti+1+1α2tG2Lt−Ti+1∑s=Ti+1α2sG2Lsσs−⎛⎝Ti+2∑Ti+1+12αtLt⎞⎠⎛⎝Ti+1∑s=Ti+1q(i)(s)F(xs)⎞⎠. (14)

We recall that whenever . The rest of the proof is similar to Equation (11) in Lemma 5. We use the fact that is the modification of , has at most polynmial decay, and Lemma 4 in Equation (14) to conclude the result. ∎

###### Lemma 7.

Let be as defined in (12). Then, is a probability distribution over .

The proof of this lemma is given in Section A

### 3.2 Proof of Theorem 3

###### Proof.

Recall the definition of in (8). Clearly, . Summing the bounds in Lemma 5 we conclude:

 E[F(xT)] =E[F(xτk+1)]=E[F(xτ0)]+k∑i=0E[F(xτi+1)−F(xτi)] ≤E[F(xτ0)]+5G2γTβ4+k∑i=15G2γTβ22−i≤5G2γT(1β2+1β4)+inf⌈T4⌉≤t≤T1E[F(xt)].

We conclude the result by noting that for all . ∎

### 3.3 Proof of Theorem 4

###### Proof.

This proof is similar to the proof of Theorem 3, but instead of Lemma 5 we use Lemma 6. In Lemma 6, we pick for and we let be arbitrary. We let . By union bound, the inequalities in the statement of Lemma 6 hold for all simultaneously with probabiliy atleast . Summing all these inequalities, we conclude:

 Tk+1∑t=Tk+1p(k)(t)F(xt) ≤γTG2[120log(1δ)+400][1β2+1β4]+T1∑s=⌈T4⌉q(0)(s)F(xs).

We note that the distribution has unit mass over the point and that when to conclude the result. ∎

### 3.4 Proof of Theorem 1

###### Proof.

We note that the step size defined in Equation (3) is the modification of the standard step size . Let be the output of SGD under the assumptions of the theorem when step size is used. Using the fact that infimum is smaller than any weighted average, we have:

 inf⌈T4⌉≤t≤T1E[F(yt)−F(x∗)]≤1T1−⌈T4⌉+1T1∑t=⌈T4⌉E[F(yt)−F(x∗)]