# Inexact SARAH Algorithm for Stochastic Optimization

We develop and analyze a variant of variance reducing stochastic gradient algorithm, known as SARAH, which does not require computation of the exact gradient. Thus this new method can be applied to general expectation minimization problems rather than only finite sum problems. While the original SARAH algorithm, as well as its predecessor, SVRG, require an exact gradient computation on each outer iteration, the inexact variant of SARAH (iSARAH), which we develop here, requires only stochastic gradient computed on a mini-batch of sufficient size. The proposed method combines variance reduction via sample size selection and iterative stochastic gradient updates. We analyze the convergence rate of the algorithms for strongly convex, convex, and nonconvex cases with appropriate mini-batch size selected for each case. We show that with an additional, reasonable, assumption iSARAH achieves the best known complexity among stochastic methods in the case of general convex case stochastic value functions.

• 34 publications
• 22 publications
• 46 publications
03/01/2017

### Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

In this paper, we develop a new accelerated stochastic gradient method f...
10/24/2019

### Katyusha Acceleration for Convex Finite-Sum Compositional Optimization

Structured problems arise in many applications. To solve these problems,...
11/06/2017

We study a new aggregation operator for gradients coming from a mini-bat...
10/30/2017

### Adaptive Sampling Strategies for Stochastic Optimization

In this paper, we propose a stochastic optimization method that adaptive...
09/12/2016

### Less than a Single Pass: Stochastically Controlled Stochastic Gradient Method

We develop and analyze a procedure for gradient-based optimization that ...
09/06/2018

### Stochastically Controlled Stochastic Gradient for the Convex and Non-convex Composition problem

In this paper, we consider the convex and non-convex composition problem...
10/02/2020

### A variable metric mini-batch proximal stochastic recursive gradient algorithm with diagonal Barzilai-Borwein stepsize

Variable metric proximal gradient methods with different metric selectio...

## 1 Introduction

We consider the problem of stochastic optimization

 minw∈Rd{F(w)=E[f(w;ξ)]}, (1)

where

is a random variable. One of the most popular applications of this problem is expected risk minimization in supervised learning. In this case, random variable

represents a random data sample , or a set of such samples . We can consider a set of realization of corresponding to a set of random samples , and define . Then the sample average approximation of , known as empirical risk in supervised learning, is written as

 minw∈Rd{F(w)=1nn∑i=1fi(w)}. (2)

Throughout the paper, we assume the existence of unbiased gradient estimator, that is

for any fixed . In addition we assume that there exists a lower bound of function .

A version of SVRG algorithm, SCSG, has been recently proposed and analyzed in [6, 7]. While this method has been developed for (2) it can be directly applied to (1

) because the exact gradient computation is replaced with a mini-batch stochastic gradient. The size of the inner loop of SCSG is then set to a geometrically distributed random variable with distribution dependent on the size of the mini-batch used in the outer iteration. In this paper, we propose and analyze an inexact version of SARAH (iSARAH) which can be applied to solve (

1). Instead of exact gradient computation, a mini-batch gradient is computed using a sufficiently large sample size. We develop total sample complexity analysis for this method under various convexity assumptions on . These complexity results are summarized in Tables 1-3 and are compared to the result for SCSG from [6, 7] when applied to (1). We also list the complexity bounds for SVRG, SARAH and SCSG when applied to finite sum problem (2).

As SVRG, SCSG and SARAH, iSARAH algorithm consists of the outer loop, which performs variance reduction by computing sufficiently accurate gradient estimate, and the inner loop, which performs the stochastic gradient updates. The main difference between SARAH and SVRG is that the inner loop of SARAH by itself is a convergent stochastic gradient algorithm, while the inner loop of SVRG is not. In other words, if only one outer iteration of SARAH is performed and then followed by sufficiently many inner iterations, we refer to this algorithm as one-loop SARAH. In [10]

one-loop SARAH is analyzed and shown to match the complexity of stochastic gradient descent. Here along with multiple-loop iSARAH we analyze one-loop iSARAH, which is obtained from one-loop SARAH by replacing the first full gradient computation with a stochastic gradient based on a sufficiently large mini-batch.

We summarize our complexity results and compare them to those of other methods in Tables 1-3. All our complexity results present the bound on the number of iterations it takes to achieve . These complexity results are developed under the assumption that is -smooth for every realization of the random variable . Table 1 shows the complexity results in the case when , but not necessarily every realization , is -strongly convex, with denoting the condition number. Notice that results for one-loop iSARAH (and SARAH) are the same from strongly convex case as for simple convex case. The convergence rate of the multiple-loop iSARAH, on the other hand, is better, in terms of dependence on , than the rate achieved by SCSG, which is the only other variance reduction method of the type we consider here, that applies to (1). The general convex case is summarized in Table 2. In this case, the multiple-loop iSARAH achieves the best convergence rate with respect to among the compared stochastic methods, but this rate is derived under an additional assumption (Assumption 4) which is discussed in detail in Section 3. We note here that Assumption 4 is weaker than the, common in the analysis of stochastic algorithms, assumption that the iterates remain in a bounded set. In recent paper [14] a stochastic line search method is shown to have expected iteration complexity of under the assumption that the iterates remain in a bounded set, however, no total sample complexity is derived in [14].

Finally, the results for nonconvex problems are presented in Table 3. In this case, SCSG achieves the best convergence rate under the bounded variance assumption, which requires that , for some and . While convergence rate of multiple-loop iSARAH for nonconvex function remains an open question (as it is for the original SARAH algorithm), we derive convergence rate for one-loop iSARAH without the bounded variance assumption. This convergence rate matches that of the general stochastic gradient algorithm, since the one-loop iSARAH method is not a variance reduction method. The one-loop iSARAH method can be viewed as a variant of a momentum SGD method.

We summarize the key results we obtained in this paper as follows.

• We develop and analyze a stochastic variance reduction method for solving general stochastic optimization problems (1) which is an inexact version of SARAH algorithm [10].

• To the best of our knowledge, the proposed algorithm is the best in terms of sample complexity for convex problems than any other algorithm for (1), under an additional relatively mild assumption.

• The one-loop iSARAH presents a version of momentum SGD method. We analyze convergence rates of this algorithm in the general convex and nonconvex cases without assuming boundedness of the variance of the stochastic gradients.

### 1.1 Paper Organization

The remainder of the paper is organized as follows. In Section 2, we describe Inexact SARAH (iSARAH) algorithm in detail. We provide the convergence analysis of iSARAH in Section 3 which includes the sample complexity bounds for the one-loop case when applied to strongly convex, convex, and nonconvex functions; and for the multiple-loop case when applied to strongly convex and convex functions. We conclude the paper and discuss future work in Section 4.

## 2 The Algorithm

Like SVRG and SARAH, iSARAH consists of the outer loop and the inner loop. The inner loop performs recursive stochastic gradient updates, while the outer loop of SVRG and SARAH compute the exact gradient. Specifically, given an iterate at the beginning of each outer loop, SVRG and SARAH compute . The only difference between SARAH and iSARAH is that the latter replaces the exact gradient computation by a gradient estimate based on a sample set of size .

In other words, given a iterate and a sample set size ,

is a random vector computed as

 v0=1bb∑i=1∇f(w0;ζi), (3)

where are i.i.d.111

Independent and identically distributed random variables. We note from probability theory that if

are i.i.d. random variables then are also i.i.d. random variables for any measurable function . and . We have . The larger is, the more accurately the gradient estimate approximates . The key idea of the analysis of iSARAH is to establish bounds on which ensure sufficient accuracy for recovering original SARAH convergence rate.

The key step of the algorithm is a recursive update of the stochastic gradient estimate (SARAH update)

 vt=∇f(wt;ξt)−∇f(wt−1;ξt)+vt−1, (4)

followed by the iterate update

 wt+1=wt−ηvt. (5)

Let be the -algebra generated by . We note that is independent of and that is a biased estimator of the gradient .

 E[vt|Ft] =∇F(wt)−∇F(wt−1)+vt−1.

In contrast, SVRG update is given by

 vt=∇f(wt;ξt)−∇f(w0;ξt)+v0 (6)

which implies that is an unbiased estimator of .

The outer loop of iSARAH algorithm is summarized in Algorithm 1 and the inner loop is summarized in Algorithm 2.

As a variant of SARAH, iSARAH inherits the special property that a one-loop iSARAH, which is the variant of Algorithm 1 with and is a convergent algorithm. In the next section we provide the analysis for both one-loop and multiple-loop versions of i-SARAH.

Convergence criteria. Our iteration complexity analysis aims to bound the total number of stochastic gradient evaluations needed to achieve a desired bound on the gradient norm. For that we will need to bound the number of outer iterations which is needed to guarantee that and also to bound and . Since the algorithm is stochastic and is random the -accurate solution is only achieved in expectation. i.e.,

 E[∥∇F(wT)∥2]≤ϵ. (7)

## 3 Convergence Analysis of iSARAH

### 3.1 Basic Assumptions

The analysis of the proposed algorithm will be performed under apropriate subset of the following key assumptions.

###### Assumption 1 (L-smooth).

is -smooth for every realization of , i.e., there exists a constant such that

 ∥∇f(w;ξ)−∇f(w′;ξ)∥≤L∥w−w′∥, ∀w,w′∈Rd. (8)

Note that this assumption implies that is also L-smooth. The following strong convexity assumption will be made for the appropriate parts of the analysis, otherwise, it would be dropped.

###### Assumption 2 (μ-strongly convex).

The function , is -strongly convex, i.e., there exists a constant such that ,

 F(w)≥F(w′)+∇F(w′)⊤(w−w′)+μ2∥w−w′∥2.

Under Assumption 2, let us define the (unique) optimal solution of (2) as , Then strong convexity of implies that

 2μ[F(w)−F(w∗)]≤∥∇F(w)∥2, ∀w∈Rd. (9)

Under strong convexity assumption we will use to denote the condition number .

Finally, as a special case of the strong convexity with , we state the general convexity assumption, which we will use for some of the convergence results.

###### Assumption 3.

is convex for every realization of , i.e., ,

 f(w;ξ)≥f(w′;ξ)+∇f(w′;ξ)⊤(w−w′).

We note that Assumption 2 does not imply Assumption 3, because the latter applies to all realizations, while the former applied only to the expectation.

Hence in our analysis, depending on the result we aim at, we will require Assumption 3 to hold by itself, or Assumption 2 and Assumption 3 to hold together. We will always use Assumption 1.

### 3.2 Existing Results

We provide some well-known results from the existing literature that support our theoretical analysis as follows. First, we start introducing two standard lemmas in smooth convex optimization ([8]) for a general function .

###### Lemma 1 (Theorem 2.1.5 in [8]).

Suppose that is convex and -smooth. Then, for any , ,

 f(w)≤f(w′)+∇f(w′)T(w−w′)+L2∥w−w′∥2, (10) f(w)≥f(w′)+∇f(w′)⊤(w−w′)+12L∥∇f(w)−∇f(w′)∥2, (11) (∇f(w)−∇f(w′))⊤(w−w′)≥1L∥∇f(w)−∇f(w′)∥2. (12)

Note that (10) does not require the convexity of .

###### Lemma 2 (Theorem 2.1.11 in [8]).

Suppose that is -strongly convex and -smooth. Then, for any , ,

 (∇f(w)−∇f(w′))⊤(w−w′)≥μLμ+L∥w−w′∥2+1μ+L∥∇f(w)−∇f(w′)∥2. (13)

The following existing results are more specific properties of component functions .

###### Lemma 3 ([2]).

Suppose that Assumptions 1 and 3 hold. Then, ,

 E[∥∇f(w;ξ)−∇f(w∗;ξ)∥2]≤2L[F(w)−F(w∗)], (14)

where is any optimal solution of .

###### Lemma 4 (Lemma 1 in [12]).

Suppose that Assumptions 1 and 3 hold. Then, for ,

 E[∥∇f(w;ξ)∥2]≤4L[F(w)−F(w∗)]+2E[∥∇f(w∗;ξ)∥2], (15)

where is any optimal solution of .

###### Lemma 5 (Lemma 1 in [11]).

Let and be i.i.d. random variables with , , for all . Then,

 E⎡⎣∥∥ ∥∥1bb∑i=1∇f(w;ξi)−∇F(w)∥∥ ∥∥2⎤⎦=E[∥∇f(w;ξ)∥2]−∥∇F(w)∥2b. (16)
###### Proof.

The proof of this Lemma is in [11]. We are going to use mathematical induction to prove the result. With , it is easy to see

 E[∥∇f(w;ξ1)−∇F(w)∥2] =E[∥∇f(w;ξ1)∥2]−2∥∇F(w)∥2+∥∇F(w)∥2 =E[∥∇f(w;ξ1)∥2]−∥∇F(w)∥2.

Let assume that it is true with , we are going to show it is also true with . We have

 E⎡⎣∥∥ ∥∥1mm∑i=1∇f(w;ξi)−∇F(w)∥∥ ∥∥2⎤⎦ =E⎡⎢⎣∥∥ ∥∥∑m−1i=1∇f(w;ξi)−(m−1)∇F(w)+(∇f(w;ξm)−∇F(w))m∥∥ ∥∥2⎤⎥⎦ +1mE⎡⎣2(m−1∑i=1∇f(w;ξi)−(m−1)∇F(w))⊤(∇f(w;ξm)−∇F(w))⎤⎦ =1m2((m−1)E[∥∇f(w;ξ1)∥2]−(m−1)∥∇F(w)∥2+E[∥∇f(w;ξm)∥2]−∥∇F(w)∥2) =1m(E[∥∇f(w;ξ1)∥2]−∥∇F(w)∥2).

The third and the last equalities follow since are i.i.d. with . Therefore, the desired result is achieved. ∎

Lemmas 4 and 5 clearly imply the following result.

###### Corollary 1.

Suppose that Assumptions 1 and 3 hold. Let and be i.i.d. random variables with , , for all . Then,

 E⎡⎣∥∥ ∥∥1bb∑i=1∇f(w;ξi)−∇F(w)∥∥ ∥∥2⎤⎦≤4L[F(w)−F(w∗)]+2E[∥∇f(w∗;ξ)∥2]−∥∇F(w)∥2b, (17)

where is any optimal solution of .

Based on the above lemmas, we will show in detail how to achieve our main results in the following subsections.

### 3.3 Special Property of SARAH Update

The most important property of the SVRG algorithm is the variance reduction of the steps. This property holds as the number of outer iteration grows, but it does not hold, if only the number of inner iterations increases. In other words, if we simply run the inner loop for many iterations (without executing additional outer loops), the variance of the steps does not reduce in the case of SVRG, while it goes to zero in the case of SARAH with large learning rate in the strongly convex case. We recall the SARAH update as follows.

 vt=∇f(wt;ξt)−∇f(wt−1;ξt)+vt−1, (18)

followed by the iterate update:

 wt+1=wt−ηvt. (19)

We will now show that is going to zero in expectation in the strongly convex case. These results substantiate our conclusion that SARAH uses more stable stochastic gradient estimates than SVRG.

###### Proposition 1.

Suppose that Assumptions 1, 2 and 3 hold. Consider defined by (18) with and any given . Then, for any ,

 E[∥vt∥2]

The proof of this Proposition can be derived directly from Theorem 1b in [13]. This result implies that by choosing , we obtain the linear convergence of in expectation with the rate .

We will provide our convergence analysis in detail in next sub-section. We will divide our results into two parts: the one-loop results corresponding to iSARAH-IN (Algorithm 2) and the multiple-loop results corresponding to iSARAH (Algorithm 1).

### 3.4 One-loop (iSARAH-IN) Results

We begin with providing two useful lemmas that do not require convexity assumption. Lemma 6 bounds the sum of expected values of ; and Lemma 7 expands the value of .

###### Lemma 6.

Suppose that Assumption 1 holds. Consider iSARAH-IN (Algorithm 2). Then, we have

 m∑t=0E[∥∇F(wt)∥2]≤2ηE[F(w0)−F(w∗)]+m∑t=0E[∥∇F(wt)−vt∥2]−(1−Lη)m∑t=0E[∥vt∥2], (20)

where .

###### Proof.

By Assumption 1 and , we have

 E[F(wt+1)] E[F(wt)]−ηE[∇F(wt)⊤vt]+Lη22E[∥vt∥2] =E[F(wt)]−η2E[∥∇F(wt)∥2]+η2E[∥∇F(wt)−vt∥2]−(η2−Lη22)E[∥vt∥2],

where the last equality follows from the fact

By summing over , we have

 E[F(wm+1)] ≤E[F(w0)]−η2m∑t=0E[∥∇F(wt)∥2]+η2m∑t=0E[∥∇F(wt)−vt∥2] −(η2−Lη22)m∑t=0E[∥vt∥2],

which is equivalent to ():

 m∑t=0E[∥∇F(wt)∥2] ≤2ηE[F(w0)−F(wm)]+m∑t=0E[∥∇F(wt)−vt∥2]−(1−Lη)m∑t=0E[∥vt∥2] ≤2ηE[F(w0)−F(w∗)]+m∑t=0E[∥∇F(wt)−vt∥2]−(1−Lη)m∑t=0E[∥vt∥2],

where the second inequality follows since . ∎

###### Lemma 7.

Suppose that Assumption 1 holds. Consider defined by (4) in iSARAH-IN (Algorithm 2). Then for any ,

 E[∥∇F(wt)−vt∥2]=E[∥∇F(w0)−v0∥2]+t∑j=1E[∥vj−vj−1∥2]−t∑j=1E[∥∇F(wj)−∇F(wj−1)∥2].
###### Proof.

Let be the -algebra generated by 222 contains all the information of as well as . We note that is independent of . For , we have

 E[∥∇F(wj)−vj∥2|Fj] =E[∥[∇F(wj−1)−vj−1]+[∇F(wj)−∇F(wj−1)]−[vj−vj−1]∥2|Fj] =∥∇F(wj−1)−vj−1∥2+∥∇F(wj)−∇F(wj−1)∥2+E[∥vj−vj−1∥2|Fj] +2(∇F(wj−1)−vj−1)⊤(∇F(wj)−∇F(wj−1)) −2(∇F(wj−1)−vj−1)⊤E[vj−vj−1|Fj] −2(∇F(wj)−∇F(wj−1))⊤E[vj−vj−1|Fj] =∥∇F(wj−1)−vj−1∥2−∥∇F(wj)−∇F(wj−1)∥2+E[∥vj−vj−1∥2|Fj],

where the last equality follows from

 E[vj−vj−1|Fj]E[∇f(wj;ξj)−∇f(wj−1;ξj)|Fj]=∇F(wj)−∇F(wj−1).

By taking expectation for the above equation, we have

 E[∥∇F(wj)−vj∥2] =E[∥∇F(wj−1)−vj−1∥2]−E[∥∇F(wj)−∇F(wj−1)∥2]+E[∥vj−vj−1∥2].

By summing over , we have

 E[∥∇F(wt)−vt∥2]=E[∥∇F(w0)−v0∥2]+t∑j=1E[∥vj−vj−1∥2]−t∑j=1E[∥∇F(wj)−∇F(wj−1)∥2].

#### 3.4.1 General Convex Case

In this subsection, we analyze one-loop results of Inexact SARAH (Algorithm 2) in the general convex case. We first derive the bound for .

###### Lemma 8.

Suppose that Assumptions 1 and 3 hold. Consider defined as (4) in SARAH (Algorithm 1) with . Then we have that for any ,

 E[∥∇F(wt)−vt∥2] ≤ηL2−ηL[E[∥v0∥2]−E[∥vt∥2]]+E[∥∇F(w0)−v0∥2]. (21)
###### Proof.

For , we have

 E[∥vj∥2|Fj] =E[∥vj−1−(∇f(wj−1;ξj)−∇f(wj;ξj)∥2|Fj] =∥vj−1∥2+E[∥∇f(wj−1;ξj)−∇f(wj;ξj)∥2−2η(∇f(wj−1;ξj)−∇f(wj;ξj))⊤(wj−1−wj)|Fj] =∥vj−1∥2+(1−2ηL)E[∥∇f(wj−1;ξj)−∇f(wj;ξj)∥2|Fj] ∥vj−1∥2+(1−2ηL)E[∥vj−vj−1∥2|Fj],

which, if we take expectation, implies that

 E[∥vj−vj−1∥2]≤ηL2−ηL[E[∥vj−1∥2]−E[∥vj∥2]],

when .

By summing the above inequality over , we have

 (22)

By Lemma 7, we have

 E[∥∇F(wt)−vt∥2] ≤t∑j=1E[∥vj−vj−1∥2]+E[∥∇F(w0)−v0∥2] ηL2−ηL[E[∥v0∥2]−E[∥vt∥2]]+E[∥∇F(w0)−v0∥2].

###### Lemma 9.

Suppose that Assumptions 1 and 3 hold. Consider defined as (3) in SARAH (Algorithm 1). Then we have,

 ηL2−ηLE[∥v0∥2]+E[∥∇F(w0)−v0∥2] +ηL2−ηLE[∥∇F(w0)∥2]. (23)
###### Proof.

By Corollary 1, we have

 ηL2−ηLE[∥v0∥2|w0]−ηL2−ηL∥∇F(w0)∥2+E[∥∇F(w0)−v0∥2|w0] =22−ηL[E[∥v0∥2|w0]−∥∇F(w0)∥2] =22−ηL[E[∥v0−∇F(w0)∥2|w0] 22−ηL(4L[F(w0)−F(w∗)]+2E[∥∇f(w∗;ξ)∥2]−∥∇F(w0)∥2b).

Taking the expectation and adding for both sides, the desired result is achieved. ∎

We then derive this basic result for the convex case by using Lemmas 8 and 9.

###### Lemma 10.

Suppose that Assumptions 1 and 3 hold. Consider iSARAH-IN (Algorithm 2) with . Then, we have

 E[∥∇F(~w)∥2] ≤2η(m+1)E[F(w0)−F(w∗)]+ηL2−ηLE[∥∇F(w0)∥2] +22−ηL(4L<