# Stochastic Approximation of Smooth and Strongly Convex Functions: Beyond the O(1/T) Convergence Rate

Stochastic approximation (SA) is a classical approach for stochastic convex optimization. Previous studies have demonstrated that the convergence rate of SA can be improved by introducing either smoothness or strong convexity condition. In this paper, we make use of smoothness and strong convexity simultaneously to boost the convergence rate. Let λ be the modulus of strong convexity, κ be the condition number, F_* be the minimal risk, and α>1 be some small constant. First, we demonstrate that, in expectation, an O(1/[λ T^α] + κ F_*/T) risk bound is attainable when T = Ω(κ^α). Thus, when F_* is small, the convergence rate could be faster than O(1/[λ T]) and approaches O(1/[λ T^α]) in the ideal case. Second, to further benefit from small risk, we show that, in expectation, an O(1/2^T/κ+F_*) risk bound is achievable. Thus, the excess risk reduces exponentially until reaching O(F_*), and if F_*=0, we obtain a global linear convergence. Finally, we emphasize that our proof is constructive and each risk bound is equipped with an efficient stochastic algorithm attaining that bound.

## Authors

• 61 publications
• 63 publications
• ### Binary Excess Risk for Smooth Convex Surrogates

In statistical learning theory, convex surrogates of the 0-1 loss are hi...
02/07/2014 ∙ by Mehrdad Mahdavi, et al. ∙ 0

• ### Stochastic Composite Least-Squares Regression with convergence rate O(1/n)

We consider the minimization of composite objective functions composed o...
02/21/2017 ∙ by Nicolas Flammarion, et al. ∙ 0

• ### FastAdaBelief: Improving Convergence Rate for Belief-based Adaptive Optimizer by Strong Convexity

The AdaBelief algorithm demonstrates superior generalization ability to ...
04/28/2021 ∙ by Yangfan Zhou, et al. ∙ 0

• ### A Study of Condition Numbers for First-Order Optimization

The study of first-order optimization algorithms (FOA) typically starts ...
12/10/2020 ∙ by Charles Guille-Escuret, et al. ∙ 0

• ### On the Global Linear Convergence of Frank-Wolfe Optimization Variants

The Frank-Wolfe (FW) optimization algorithm has lately re-gained popular...
11/18/2015 ∙ by Simon Lacoste-Julien, et al. ∙ 0

• ### Fast Stochastic Algorithms for SVD and PCA: Convergence Properties and Convexity

We study the convergence properties of the VR-PCA algorithm introduced b...
07/31/2015 ∙ by Ohad Shamir, et al. ∙ 0

• ### Zeroth-order Stochastic Compositional Algorithms for Risk-Aware Learning

We present Free-MESSAGEp, the first zeroth-order algorithm for convex me...
12/19/2019 ∙ by Dionysios S. Kalogerias, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Stochastic optimization (SO) is frequently encountered in a vast number of areas, including telecommunication, medicine, and finance, to name but a few (Shapiro et al., 2014). SO aims to minimize an objective function which is given in a form of the expectation. Formally, the problem can be formulated as

 minw∈W F(w)=Ef∼P[f(w)] (1)

where is a random function sampled from a distribution

. A well-known special case is the risk minimization in machine learning, whose objective function is

 F(w)=E(x,y)∼D[ℓ(y,⟨w,x⟩)]

where denotes a random instance-label pair sampled from certain distribution , is the model for prediction, and is a loss that measures the prediction error (Vapnik, 1998).

In this paper, we focus on stochastic convex optimization (SCO), in which both the domain and the expected function are convex. A basic difficulty of solving stochastic optimization problem is that the distribution is generally unknown, or even if known, it is hard to evaluate the expectation exactly (Nemirovski et al., 2009). To address this challenge, two different ways have been proposed: sample average approximation (SAA) (Kim et al., 2015) and stochastic approximation (SA) (Kushner and Yin, 2003). SAA collects a set of random functions from , and constructs the empirical average to approximate the expected function . In contrast, SA tackles the stochastic optimization problem directly, at each iteration using a noisy observation of to improve the current iterate.

Compared with SAA, SA is more efficient due to the low computational cost per iteration, and has received significant research interests from optimization and machine learning communities (Zhang, 2004; Duchi et al., 2011; Ge et al., 2015; Wang et al., 2017). The performance of SA algorithms is typically measured by the excess risk:

 F(wT)−minw∈WF(w)

where is the solution returned after

iterations. For Lipschitz continuous convex functions, stochastic gradient descent (SGD) achieves the unimprovable

rate of convergence. Alternatively, if the optimization problem has certain curvature properties, then faster rates are sometimes possible. Specifically, for smooth functions, SGD is equipped with an risk bound, where is the minimal risk (Srebro et al., 2010). Thus, the convergence rate for smooth functions could be faster than when the minimal risk is small. For strongly convex functions, the convergence rate can also be improved to , where is the modulus of strong convexity (Hazan and Kale, 2011).

From the above discussions, we observe that either smoothness or strong convexity could be exploited to improve the convergence rate of SA. This observation motivates subsequent studies that boost the convergence rate by considering smoothness and strong convexity simultaneously. However, existing results are unsatisfactory because they either rely on strong assumptions (Mahdavi and Jin, 2013; Schmidt and Roux, 2013), are only applicable to unconstrained domains (Moulines and Bach, 2011; Needell et al., 2014), or limited to the problem of finite sum (Roux et al., 2012; Shalev-Shwartz and Zhang, 2013; Johnson and Zhang, 2013). This paper demonstrates that for the general SO problem, the convergence rate of SA could be faster than when both smoothness and strong convexity are present and the minimal risk is small. Our work is similar in spirit to a recent study of SAA (Zhang et al., 2017a), which also establishes faster rates under similar conditions. The main contributions of our paper are summarized below.

• First, we propose a fast algorithm for stochastic approximation (FASA), which applies epoch gradient descent (Epoch-GD)

(Hazan and Kale, 2011) with carefully designed initial solution and step size. Let be the condition number and be some small constant. Our theoretical analysis shows that, in expectation, FASA achieves an risk bound when the number of iterations . As a result, the convergence rate could be faster than when is small, and approaches when .

• Second, to further benefit from small risk, we propose to use a fixed step size in Epoch-GD, and establish an risk bound which holds in expectation. Thus, the excess risk reduces exponentially until reaching , and if , we obtain a global linear convergence.

## 2 Related Work

In this section, we review related work on SA and SAA.

### 2.1 Stochastic Approximation (SA)

For brevity, we only discuss first-order methods of SA, and results of zero-order methods can be found in the literature (Nesterov, 2011; Wibisono et al., 2012).

For Lipschitz continuous convex functions, stochastic gradient descent (SGD) exhibits the optimal risk bound (Nemirovski and Yudin, 1983; Zinkevich, 2003). When the random function is nonnegative and smooth, SGD (with a suitable step size) has a risk bound of , becoming if the minimal risk (Srebro et al., 2010, Corollary 4). If the expected function is -strongly convex, some variants of SGD (Hazan and Kale, 2011, 2014; Rakhlin et al., 2012; Shamir and Zhang, 2013) achieve an rate which is known to be minimax optimal (Agarwal et al., 2012). For the square loss and the logistic loss, an rate is attainable without strong convexity (Bach and Moulines, 2013). When the random function is -exponentially concave, the online Newton step (ONS) is equipped with an risk bound, where is the dimensionality (Hazan et al., 2007; Mahdavi et al., 2015). When the expected function is both smooth and strongly convex, we still have the convergence rate but with a smaller constant (Ghadimi and Lan, 2012). Specifically, the constant in the big O notation depends on the variance of the stochastic gradient instead of the maximum norm.

There are some studies that have established convergence rates that are faster than when both smoothness and strong convexity are present. Moulines and Bach (2011) and Needell et al. (2014) demonstrate that the distance between the SGD iterate and the optimal solution decreases at a linear rate in the beginning, but their results are limited to unconstrained problems. When an upper bound of is available, Mahdavi and Jin (2013) show that it is possible to reduce the excess risk at a linear rate until certain level. Under a strong growth condition, Schmidt and Roux (2013) prove that SGD could achieve a global linear rate. Recently, a variety of variance reduction techniques have been proposed and yield faster rates for SA (Roux et al., 2012; Shalev-Shwartz and Zhang, 2013; Johnson and Zhang, 2013). However, these methods are restricted to the special case that the expected function is a finite sum, and thus cannot be applied if the distribution is unknown. As can be seen, existing fast rates of SA are restricted to special problems or rely on strong assumptions. We will provide detailed comparisons in Section 3 to illustrate the advantage of this study—our setting is more general and our convergence rates are faster.

While our paper focuses on stochastic convex optimization, we note there has been a recent surge of interests in developing SA algorithms for non-convex problems (Ge et al., 2015; Allen-Zhu and Hazan, 2016; Reddi et al., 2016; Zhang et al., 2017b).

### 2.2 Sample Average Approximation (SAA)

SAA is also referred to as empirical risk minimization (ERM) in machine learning. In the literature, there are plenty of theories for SAA (Kim et al., 2015) or ERM (Vapnik, 1998). In the following, we only discuss related work on SAA in the past decade.

To present the results in SAA, we use to denote the total number of training samples. When the random function is Lipschitz continuous, Shalev-Shwartz et al. (2009) establish an risk bound. When is -strongly convex and Lipschitz continuous, Shalev-Shwartz et al. (2009) further prove an risk bound which holds in expectation. When is -exponentially concave, an risk bound is attainable (Koren and Levy, 2015; Mehta, 2016). Lower bounds of ERM for stochastic optimization are investigated by Feldman (2016). In a recent work, Zhang et al. (2017a) establish an risk bound when is smooth and is Lipschitz continuous. The most surprising result is that when is smooth and is Lipschitz continuous and -strongly convex, Zhang et al. (2017a) prove an risk bound, when . Thus, the convergence rate of ERM could be faster than when both smoothness and strong convexity are present and the number of training samples is large enough.

## 3 Our Results

We first introduce assumptions used in our analysis, then present our algorithms and theoretical guarantees.

### 3.1 Assumptions

###### Assumption 1

The random function is nonnegative.

###### Assumption 2

The random function is (almost surely) -smooth over , that is,

 ∥∥∇f(w)−∇f(w′)∥∥≤L∥w−w′∥, ∀w,w′∈W. (2)
###### Assumption 3

The expected function is -strongly convex over , that is,

 F(w)+⟨∇F(w),w′−w⟩+λ2∥w′−w∥2≤F(w′), ∀w,w′∈W. (3)
###### Assumption 4

The gradient of the random function is (almost surely) upper bounded by , that is,

 ∥∇f(w)∥≤G, ∀w∈W. (4)
##### Remark 1

We have the following comments regarding our assumptions.

• The above assumptions hold for many popular machine learning problems, such as (regularized) linear regression or logistic regression.

• Based on Assumptions 2 and 3, we define the condition number , which will be used to characterize the performance of our methods. For simplicity, we assume is a constant, and thus and are on the same order.

• Let be the optimal solution to (1). Assumption 3 implies (Hazan and Kale, 2011)

 (5)

Actually, in our analysis, we only make use of (5) instead of (3).

### 3.2 A General Algorithm

We first introduce a general algorithm for SA, which always achieves an rate, and becomes faster when is small.

#### 3.2.1 Fast Algorithm for Stochastic Approximation (FASA)

Our fast algorithm for stochastic approximation (FASA) takes epoch gradient descent (Epoch-GD) as a subroutine. Although Hazan and Kale (2011) have established the convergence rate of Epoch-GD under the strong convexity condition, they did not utilize smoothness in their analysis. The procedures of Epoch-GD and FASA are described in Algorithm 1 and Algorithm 2, respectively.

Epoch-GD is an extension of stochastic gradient descent (SGD). It divides the optimization process into a sequence of epochs. In each epoch, Epoch-GD applies SGD multiple times, and the averaged iterate is passed to the next epoch. In the algorithm, we use to denote the projection onto the nearest point in . There are input parameters of Epoch-GD: (1) , the step size used in the first epoch; (2) , the size of the first epoch; (3) , the total number of stochastic gradients that can be consumed; and (4) , the initial solution. In each consecutive epoch, the step size decreases exponentially and the size of epoch increases exponentially.

In FASA, we first invoke Epoch-GD with an arbitrary initial solution, and the number of stochastic gradients is set to be . The purpose of this step is to get a good solution at the expense of stochastic gradients.111In this step, Epoch-GD can be replaced with any algorithm that achieves the optimal rate for strongly convex stochastic optimization, e.g., SGD with -suffix averaging (Rakhlin et al., 2012). Then, Epoch-GD is invoked again with as its initial solution and a budget of stochastic gradients. This time, we set a large epoch size to utilize the fact that the initial solution is of high quality. The convergence rate of FASA is given below.

###### Theorem 1

Suppose

 T≥κα (6)

where is some constant. Under Assumptions 1, 2, 3 and 4, the solution returned by Algorithm 2 satisfies

 E[F(˜w)]−F∗≤2α2+5α+5G2λTα+22α+5κF∗(2α−1−1)T

where is the minimal risk.

##### Remark 2

The above theorem implies that when is large enough, i.e., , FASA achieves an

 O(1λTα+κF∗T)

rate of convergence, which is faster than when the minimal risk is small. In particular, when , the convergence rate is improved to . Note that the upper bound has an exponential dependence on , so it is meaningful only when is chosen as a small constant.

##### Remark 3

Note that our algorithm is translation-invariant, i.e., it does not change if we translate the function by a constant. Since the upper bound in Theorem 1 depends on the minimal risk , one may attempt to subtract a constant from the function to make the bound tighter. However, because of the nonnegative requirement in Assumption 1, the best we can do is to redefine

 f(w)←f(w)−essinff∼Pinfw∈Wf(w)

and replace in Theorem 1 with .

To simplify Theorem 1, we provide the following corollary by setting .

###### Corollary 2

Suppose . Under the same conditions as Theorem 1, we have

 E[F(˜w)]−F∗≤219G2λT2+29κF(w∗)T=O(1λT2+κF∗T).

#### 3.2.2 Comparisons with Previous Results

In the following, we compare our Theorem 1 and Corollary 2 with related work in SA (Ghadimi and Lan, 2012; Moulines and Bach, 2011; Needell et al., 2014) and SAA (Zhang et al., 2017a).

For smooth and strongly convex functions, Ghadimi and Lan (2012, Proposition 9) have established an rate for the expected risk, where is the variance of the stochastic gradient. Note that this rate is worse than that in Corollary 2 because is a constant in general, even when is small. For example, consider the problem of linear regression

 minw∈W F(w)=E(x,y)∼D[(x⊤w−y)2],

and assume where is the Gaussian random noise and . Then, , which approaches zero as . On the other hand, the variance of the stochastic gradient at solution can be decomposed as

 σ2=E[∥∥2(x⊤wt−y)x−E[2(x⊤wt−y)x]∥∥2]=4E[∥∥(xx⊤−E[xx⊤])(wt−w∗)∥∥2]+4E[∥ϵx∥2].

Even there is no noise, i.e., , the variance is nonzero due to the randomness of .

For unconstrained problems, Moulines and Bach (2011) and Needell et al. (2014) have analyzed the distance between the SGD iterate and the optimal solution under the smoothness and strong convexity condition. In particular, Theorem 1 of Moulines and Bach (2011) (with and ) implies the following convergence rate for the expected risk

 O(exp(κ2)n2+F∗logTλ2T)

which is worse than our Corollary 2 because of the additional factor in the second term. Theorem 2.1 of Needell et al. (2014) leads to the following rate

 O((1−λT)T+κF∗T) (7)

which is also worse than our Corollary 2 because becomes a constant when . We note that it is possible to extend the analysis of Needell et al. (2014) to constrained problems, but the convergence rate becomes slower, and thus is worse than our rate. Detailed discussions about how to simplify and extend the result of Needell et al. (2014) are provided in Appendix A.

The convergence rate in Corollary 2 matches the state-of-the-art convergence rate of SAA (Zhang et al., 2017a). Specifically, under similar conditions, Zhang et al. (2017a, Theorem 3) have proved an risk bound for SAA, when . Compared with the results of Zhang et al. (2017a), our theoretical guarantees have the following advantages:

• The lower bound of in our results is independent from the dimensionality, and thus our results can be applied to infinite dimensional problems, e.g., learning with kernels. In contrast, the lower bound of given by Zhang et al. (2017a, Theorem 3) depends on the dimensionality.

• For the special problem of supervised learning,

Zhang et al. (2017a, Theorem 7) shows that the lower bound on can be replaced with . However, it does not support the case , which is covered by our Theorem 1.

• The convergence rate in Theorem 1 keeps improving as increases. As a result, when , the convergence rate in Theorem 1 is faster than that of SAA given by Zhang et al. (2017a).

### 3.3 A Special Algorithm for Small Risk

The convergence rate of FASA cannot go beyond , even when is . In the following, we develop a special algorithm for the case that is small. The new algorithm achieves a linear convergence when is small, although it may not perform well otherwise.

#### 3.3.1 Epoch Gradient Descent with Fixed Step Size (Epoch-GD-F)

The new algorithm is a variant of Epoch-GD, in which the step size, as well as the size of each epoch, is fixed. We name the new algorithm as epoch gradient descent with fixed step size (Epoch-GD-F), and summarize it in Algorithm 3. Epoch-GD-F has parameters: (1) , the fixed step size; (2) , the size of each epoch; (3) , the total number of stochastic gradients that can be consumed; and (4) , the initial solution. We bound the excess risk of Epoch-GD-F in the following theorem.

###### Theorem 3

Set

 η=14βL, T′=16βκ (8)

where is some constant, and be any point in . Under Assumptions 1, 2 and 3, the solution returned by Algorithm 3 satisfies

 E[F(˜w)]−F∗≤F(w0)−F∗2k†+2F∗β

where .

##### Remark 4

From the above theorem, we observe that the excess risk is upper bounded by two terms: the first one decreases exponentially w.r.t. the number of epoches and the second one depends on . When , the excess risk is on the order of

 O(12T/κ+F∗)

which means it reduces exponentially until reaching . Note that if , we obtain a global linear convergence.

To better illustrate the convergence rate in Theorem 3, we present the iteration complexity of Epoch-GD-F.

###### Corollary 4

Assume

 T=Ω(βκlog1ϵ).

Under the same condition as Theorem 3, the solution returned by Algorithm 3 satisfies

 E[F(˜w)]−F∗≤ϵ+2F∗β.

#### 3.3.2 Comparisons with Previous Results

In the following, we compare our Theorem 3 and Corollary 4 with related work in SA (Mahdavi and Jin, 2013; Schmidt and Roux, 2013; Moulines and Bach, 2011; Needell et al., 2014).

When a prior knowledge is given beforehand, Mahdavi and Jin (2013) show that when

 T=Ω(dβ3κ4log1ϵ),

their stochastic algorithm is able to find a solution

such that with high probability

 F(ˆw)≤ϵprior+ϵ+2ϵpriorβ.

Although our Corollary 4 only holds in expectation, it is stronger than that of Mahdavi and Jin (2013) in the following aspects:

• Their algorithm needs a prior knowledge , while our algorithm does not.

• The final risk of their solution is upper bounded in terms of , while in our case, the risk is upper bounded in terms of , which is smaller than .

• Their sample complexity has a linear dependent on the dimensionality , in contrast ours is dimensionality-independent. Thus, our results can be applied to the non-parametric setting where hypotheses lie in a functional space of infinite dimension.

• The dependence of their sample complexity on and is much higher than ours.

Under a strong growth condition (Solodov, 1998), Schmidt and Roux (2013) have established the following linear convergence rate for SGD when applied to unconstrained problems:

 O((1−1κ)T).

This strong growth condition requires that all stochastic gradients are at , which is itself a necessary condition for , because all the random functions are nonnegative. In this case, our Theorem 3 also achieves a linear rate at the same order. However, our results have the following advantages:

• Our Theorem 3 is more general because it covers the cases that is nonzero.

• Our results are applicable even when there is a domain constraint.

For unconstrained problems, Theorem 2.1 of Needell et al. (2014) with a suitable step size also implies the following rate

 O((1−1κ)T+κF∗) (9)

which is slower than our rate in Theorem 3, because of the additional dependence on in the second term. Besides, Needell et al. (2014, (2.4) and (2.2)) provided the iteration complexity of their algorithm, as well as that of Moulines and Bach (2011) when the minimal risk is known. Specifically, the iteration complexities of Moulines and Bach (2011) and Needell et al. (2014) for finding an -optimal solution are

 Ω(log1ϵ(κ2+κ2F∗ϵ)) and Ω(log1ϵ(κ+κ2F∗ϵ)), (10)

respectively. In this case, our Theorem 3 with implies the following iteration complexity

 Ω(log1ϵ(κ+κF∗ϵ)). (11)

Compared with the lower bounds in (10), our iteration complexity is better because (i) it has a smaller dependence on , and (ii) it holds for constrained problems.

## 4 Analysis

Our analysis follows from well-known and standard techniques, including the analysis of stochastic gradient descent (Zinkevich, 2003), self-bounding property of smooth functions (Srebro et al., 2010), and the implication of strong convexity (Hazan and Kale, 2011).

### 4.1 Proof of Theorem 1

We first state the excess risk of , the solution returned by the first call of Epoch-GD. From Theorem 5 of Hazan and Kale (2014), we have

 E[F(ˆw)]−F(w∗)≤32G2λT(???)≤32G2λκα. (12)

We proceed to analyze the solution returned by the second call of Epoch-GD. In each epoch, the standard stochastic gradient descent (SGD) (Zinkevich, 2003) is applied. The following lemma shows how the excess risk decreases in each epoch. Apply iterations of the update

 wt+1=ΠW[wt−η∇ft(wt)]

where is a random function sampled from , and . Assume is convex and Assumptions 1 and 2 hold, for any , we have

 E[F(¯w)]−F(w)≤12ηT(1−2ηL)E[∥w1−w∥2]+2ηL(1−2ηL)F(w)

where .

Based on the above lemma, we establish the following result for bounding the excess risk of the intermediate iterate. Consider the second call of Epoch-GD with parameters (,,, ). For any , we have

 E[F(wk+11)]−F(w∗)≤2α2+2α+5G2λ(Tk)α+2α+3κF(w∗)Tk(k∑i=112(i−1)(α−1)). (13)

The number of epochs made is given by the largest value of satisfying , i.e.,

 k∑i=1Ti=T1k∑i=12i−1=T1(2k−1)≤T2.

This value is

 k†=⌊log2(T2T1+1)⌋,

and the final solution is . From Lemma 4.1, we have

 F(wk†+11)−F(w∗)≤2α2+2α+5G2λ(Tk†)α+2α+3κF(w∗)Tk†⎛⎝k†∑i=112(i−1)(α−1)⎞⎠≤2α2+2α+5G2λ(Tk†)α+2α+3κF(w∗)Tk†2α−12α−1−1≤2α2+5α+5G2λTα+22α+5κF(w∗)(2α−1−1)T

where the last step is due to

 Tk†=T12k†−1≥T14(T2T1+1)≥T8.

### 4.2 Proof of Lemma 4.1

We first introduce the self-bounding property of smooth functions (Srebro et al., 2010, Lemma 4.1). For an -smooth and nonnegative function ,

 ∥∇f(w)∥≤√4Hf(w), ∀w∈W.

Assumptions 1 and 2 imply is nonnegative and -smooth. From Lemma 4.2, we have

 ∥∇fi(w)∥2≤4Lfi(w), ∀w∈W. (14)

Let . Following the analysis of online gradient descent (Zinkevich, 2003), for any , we have

 F(wt)−F(w)≤⟨∇F(wt),wt−w⟩=⟨∇ft(wt),wt−w⟩+⟨∇F(wt)−∇ft(wt),wt−w⟩=12η(∥wt−w∥2−∥w′t+1−w∥2)+η2∥∇ft(wt)∥2+⟨∇F(wt)−∇ft(wt),wt−w⟩≤12η(∥wt−w∥2−∥wt+1−w∥2)+η2∥∇ft(wt)∥2+⟨∇F(wt)−∇ft(wt),wt−w⟩(???)≤12η(∥wt−w∥2−∥wt+1−w∥2)+2ηLft(wt)+⟨∇F(wt)−∇ft(wt),wt−w⟩

where the second inequality is due to the nonexpanding property of the projection operator (Nemirovski et al., 2009, (1.5)).

Summing up over all , we get

 T∑t=1(F(wt)−F(w))≤12η∥w1−w∥2+2ηLT∑t=1ft(wt)+T∑t=1⟨∇F(wt)−∇ft(wt),wt−w⟩.

Recall that and is independent from . Taking expectation over both sides, we have

 E[T∑t=1(F(wt)−F(w))]≤12ηE[∥w1−w∥2]+2ηLE[T∑t=1F(wt)].

Rearranging the above inequality, we obtain

 E[T∑t=1(F(wt)−F(w))]≤12η(1−2ηL)E[∥w1−w∥2]+2ηLT(1−2ηL)F(w).

Dividing both sides by , we have

where the last step is due to Jensen’s inequality.

### 4.3 Proof of Lemma 4.1

Recall that the following parameters are used in the second call of Epoch-GD

 η1=14L, T1=2α+3κ, Tk+1=2Tk, ηk+1=ηk2, k≥1.

Then, we have

 ηkL≤η1L=14, (15) ληkTk=2α+1. (16)

We prove this lemma by induction on . When , from Lemma 4.1, we have

 E[F(w21)]−F(w∗)≤12η1T1(1−2η1L)E[∥w11−w∗∥2]+2η1L(1−2η1L)F(w∗)(???)=1η1T1E[∥w11−w∗∥2]+4η1LF(w∗)(???)=λ2α+1E[∥w11−w∗∥2]+2α+3κF(w∗)T1(???)≤λ2α+12λE[F(w11)−F(w∗)]+2α+3κF(w∗)T1(???)≤12α(32G2λκα)+2α+3κF(w∗)T1(T1=2α+3κ)=2α2+2α+5G2λ(T1)α+2α+3κF(w∗)T1.

Assume that (13) is true for some , and we prove the inequality for . According to Lemma 4.1, we have

 E[F(wk+21)]−F(w∗)≤12ηk+1Tk+1(1−2ηk+1L)E[∥wk+11−w∗∥2]+2ηk+1L(1−2ηk+1L)F(w∗)(???)≤1ηk+1Tk+1E[∥wk+11−w∗∥2]+4ηk+1LF(w∗)(???)=λ2α+1E[∥wk+11−w∗∥2]+2α+3κF(w∗)Tk+1(???)≤λ2α+12λE[F(wk+11)−F(w∗)]+2α+3κF(w∗)Tk+1(???)≤12α(2α2+2α+5G2λ(Tk)α+2α+3κF(w∗)Tk(k∑i=112(i−1)(α−1)))+2α+3κF(w∗)Tk+1=2α2+2α+5G2λ(Tk+1)α+2α+3κF(w∗)Tk+1(k+1∑i=112(i−1)(α−1)).

### 4.4 Proof of Theorem 3

We first establish the following lemma for bounding the excess risk of the intermediate iterate. For any , we have

 E[F(wk+11)]−F(w∗)≤F(w11)−F(w∗)2k+F(w∗)β(k∑i=112i−1). (17)

The number of epochs made is given by and the final solution is . From Lemma 4.4, we have

 F(wk†+11)−F(w∗)≤F(w11)−F(w∗)2k†+F(w∗)β⎛⎝k†∑i=112i−1⎞⎠≤F(w11)−F(w∗)2k†+2F(w∗)β.

### 4.5 Proof of Lemma 4.4

From (8), we know that

 ηL=14β≤14, (18) ληT′=4. (19)

We prove this lemma by induction on . When , from Lemma 4.1, we have

 E[F(w21)]−F(w∗)≤12ηT′(1−2ηL)∥w11−w∗∥2+2ηL(1−2ηL)F(w∗)(???)≤1ηT′∥w11−w∗∥2+F(w∗)β(???)=λ4∥w11−w∗∥2+F(w∗)β(???)≤f(w11)−f(w∗)2+F(w∗)β.

Assume that (17) is true for some , and we prove the inequality for . According to Lemma 4.1, we have

 E[F(wk+21)]−F(w∗)≤12ηT′(1−2ηL)E[∥wk+11−w∗∥2]+2ηL(1−2ηL)F(w∗)(???)≤1ηT′E[∥wk+11−w∗∥2]+F(w∗)β(???)=λ4E[∥wk+11−w∗∥2]+F(w∗)β(???)≤λ42λE[F(wk+11)−F(w∗)]+F(w