1 Introduction
We consider the problem of stochastic optimization
(1) 
where
is a random variable. One of the most popular applications of this problem is expected risk minimization in supervised learning. In this case, random variable
represents a random data sample , or a set of such samples . We can consider a set of realization of corresponding to a set of random samples , and define . Then the sample average approximation of , known as empirical risk in supervised learning, is written as(2) 
Throughout the paper, we assume the existence of unbiased gradient estimator, that is
for any fixed . In addition we assume that there exists a lower bound of function .In recent years, a class of variance reduction methods [17, 5, 1, 2, 4, 10] has been proposed for problem (2) which have smaller computational complexity than both, the full gradient descent method and the stochastic gradient method. All these methods rely on the finite sum form of (2) and are, thus, not readily extendable to (1). In particular, SVRG [2] and SARAH [10] are two similar methods that consist of an outer loop, which includes one exact gradient computation at each outer iteration and an inner loop with multiple iterative stochastic gradient updates. The only difference between SVRG and SARAH is how the iterative updates are performed in the inner loop. The advantage of SARAH is that the inner loop itself results in a convergent stochastic gradient algorithm. Hence, it is possible to apply only oneloop of SARAH with sufficiently large number of steps to obtain an approximately optimal solution (in expectation). The convergence behavior of oneloop SARAH is similar to that of the standard stochastic gradient method [10]. The multipleloop SARAH algorithm matches convergence rates of SVRG, however, due to its convergent inner loop, it has an additional practical advantage of being able to use an adaptive inner loop size (see [10] for details).
A version of SVRG algorithm, SCSG, has been recently proposed and analyzed in [6, 7]. While this method has been developed for (2) it can be directly applied to (1
) because the exact gradient computation is replaced with a minibatch stochastic gradient. The size of the inner loop of SCSG is then set to a geometrically distributed random variable with distribution dependent on the size of the minibatch used in the outer iteration. In this paper, we propose and analyze an inexact version of SARAH (iSARAH) which can be applied to solve (
1). Instead of exact gradient computation, a minibatch gradient is computed using a sufficiently large sample size. We develop total sample complexity analysis for this method under various convexity assumptions on . These complexity results are summarized in Tables 13 and are compared to the result for SCSG from [6, 7] when applied to (1). We also list the complexity bounds for SVRG, SARAH and SCSG when applied to finite sum problem (2).As SVRG, SCSG and SARAH, iSARAH algorithm consists of the outer loop, which performs variance reduction by computing sufficiently accurate gradient estimate, and the inner loop, which performs the stochastic gradient updates. The main difference between SARAH and SVRG is that the inner loop of SARAH by itself is a convergent stochastic gradient algorithm, while the inner loop of SVRG is not. In other words, if only one outer iteration of SARAH is performed and then followed by sufficiently many inner iterations, we refer to this algorithm as oneloop SARAH. In [10]
oneloop SARAH is analyzed and shown to match the complexity of stochastic gradient descent. Here along with multipleloop iSARAH we analyze oneloop iSARAH, which is obtained from oneloop SARAH by replacing the first full gradient computation with a stochastic gradient based on a sufficiently large minibatch.
We summarize our complexity results and compare them to those of other methods in Tables 13. All our complexity results present the bound on the number of iterations it takes to achieve . These complexity results are developed under the assumption that is smooth for every realization of the random variable . Table 1 shows the complexity results in the case when , but not necessarily every realization , is strongly convex, with denoting the condition number. Notice that results for oneloop iSARAH (and SARAH) are the same from strongly convex case as for simple convex case. The convergence rate of the multipleloop iSARAH, on the other hand, is better, in terms of dependence on , than the rate achieved by SCSG, which is the only other variance reduction method of the type we consider here, that applies to (1). The general convex case is summarized in Table 2. In this case, the multipleloop iSARAH achieves the best convergence rate with respect to among the compared stochastic methods, but this rate is derived under an additional assumption (Assumption 4) which is discussed in detail in Section 3. We note here that Assumption 4 is weaker than the, common in the analysis of stochastic algorithms, assumption that the iterates remain in a bounded set. In recent paper [14] a stochastic line search method is shown to have expected iteration complexity of under the assumption that the iterates remain in a bounded set, however, no total sample complexity is derived in [14].
Finally, the results for nonconvex problems are presented in Table 3. In this case, SCSG achieves the best convergence rate under the bounded variance assumption, which requires that , for some and . While convergence rate of multipleloop iSARAH for nonconvex function remains an open question (as it is for the original SARAH algorithm), we derive convergence rate for oneloop iSARAH without the bounded variance assumption. This convergence rate matches that of the general stochastic gradient algorithm, since the oneloop iSARAH method is not a variance reduction method. The oneloop iSARAH method can be viewed as a variant of a momentum SGD method.
Method  Bound  Problem type 

SARAH (oneloop) [10, 13]  Finitesum  
SARAH (multipleloop) [10]  Finitesum  
SVRG [2, 16]  Finitesum  
SCSG [6, 7]  Finitesum  
SCSG  Expectation  
SGD [12]  Expectation  
iSARAH (oneloop)  Expectation  
iSARAH (multipleloop)  Expectation 
Method  Bound  Problem type  Additional assumption 

SARAH (oneloop) [10, 13]  Finitesum  None  
SARAH (multipleloop) [10]  Finitesum  Assumptions 4  
SVRG [2, 16]  Finitesum  None  
SCSG [6, 7]  Finitesum  None  
SCSG  Expectation  None  
SGD  Expectation  Bounded variance  
iSARAH (oneloop)  Expectation  None  
iSARAH (multipleloop)  Expectation  Assumption 4 
Method  Bound  Problem type  Additional assumption 

SARAH (oneloop) [10, 13]  Finitesum  None  
SVRG [2, 16]  Finitesum  None  
SCSG [6, 7]  Finitesum  Bounded variance  
SCSG  Expectation  Bounded variance  
SGD  Expectation  Bounded variance  
iSARAH (oneloop)  Expectation  None 
We summarize the key results we obtained in this paper as follows.

To the best of our knowledge, the proposed algorithm is the best in terms of sample complexity for convex problems than any other algorithm for (1), under an additional relatively mild assumption.

The oneloop iSARAH presents a version of momentum SGD method. We analyze convergence rates of this algorithm in the general convex and nonconvex cases without assuming boundedness of the variance of the stochastic gradients.
1.1 Paper Organization
The remainder of the paper is organized as follows. In Section 2, we describe Inexact SARAH (iSARAH) algorithm in detail. We provide the convergence analysis of iSARAH in Section 3 which includes the sample complexity bounds for the oneloop case when applied to strongly convex, convex, and nonconvex functions; and for the multipleloop case when applied to strongly convex and convex functions. We conclude the paper and discuss future work in Section 4.
2 The Algorithm
Like SVRG and SARAH, iSARAH consists of the outer loop and the inner loop. The inner loop performs recursive stochastic gradient updates, while the outer loop of SVRG and SARAH compute the exact gradient. Specifically, given an iterate at the beginning of each outer loop, SVRG and SARAH compute . The only difference between SARAH and iSARAH is that the latter replaces the exact gradient computation by a gradient estimate based on a sample set of size .
In other words, given a iterate and a sample set size ,
is a random vector computed as
(3) 
where are i.i.d.^{1}^{1}1
Independent and identically distributed random variables. We note from probability theory that if
are i.i.d. random variables then are also i.i.d. random variables for any measurable function . and . We have . The larger is, the more accurately the gradient estimate approximates . The key idea of the analysis of iSARAH is to establish bounds on which ensure sufficient accuracy for recovering original SARAH convergence rate.The key step of the algorithm is a recursive update of the stochastic gradient estimate (SARAH update)
(4) 
followed by the iterate update
(5) 
Let be the algebra generated by . We note that is independent of and that is a biased estimator of the gradient .
The outer loop of iSARAH algorithm is summarized in Algorithm 1 and the inner loop is summarized in Algorithm 2.
As a variant of SARAH, iSARAH inherits the special property that a oneloop iSARAH, which is the variant of Algorithm 1 with and is a convergent algorithm. In the next section we provide the analysis for both oneloop and multipleloop versions of iSARAH.
Convergence criteria. Our iteration complexity analysis aims to bound the total number of stochastic gradient evaluations needed to achieve a desired bound on the gradient norm. For that we will need to bound the number of outer iterations which is needed to guarantee that and also to bound and . Since the algorithm is stochastic and is random the accurate solution is only achieved in expectation. i.e.,
(7) 
3 Convergence Analysis of iSARAH
3.1 Basic Assumptions
The analysis of the proposed algorithm will be performed under apropriate subset of the following key assumptions.
Assumption 1 (smooth).
is smooth for every realization of , i.e., there exists a constant such that
(8) 
Note that this assumption implies that is also Lsmooth. The following strong convexity assumption will be made for the appropriate parts of the analysis, otherwise, it would be dropped.
Assumption 2 (strongly convex).
The function , is strongly convex, i.e., there exists a constant such that ,
Under Assumption 2, let us define the (unique) optimal solution of (2) as , Then strong convexity of implies that
(9) 
Under strong convexity assumption we will use to denote the condition number .
Finally, as a special case of the strong convexity with , we state the general convexity assumption, which we will use for some of the convergence results.
Assumption 3.
is convex for every realization of , i.e., ,
3.2 Existing Results
We provide some wellknown results from the existing literature that support our theoretical analysis as follows. First, we start introducing two standard lemmas in smooth convex optimization ([8]) for a general function .
Lemma 1 (Theorem 2.1.5 in [8]).
Suppose that is convex and smooth. Then, for any , ,
(10)  
(11)  
(12) 
Note that (10) does not require the convexity of .
Lemma 2 (Theorem 2.1.11 in [8]).
Suppose that is strongly convex and smooth. Then, for any , ,
(13) 
The following existing results are more specific properties of component functions .
Lemma 3 ([2]).
Lemma 4 (Lemma 1 in [12]).
Lemma 5 (Lemma 1 in [11]).
Let and be i.i.d. random variables with , , for all . Then,
(16) 
Proof.
The proof of this Lemma is in [11]. We are going to use mathematical induction to prove the result. With , it is easy to see
Let assume that it is true with , we are going to show it is also true with . We have
The third and the last equalities follow since are i.i.d. with . Therefore, the desired result is achieved. ∎
Corollary 1.
Based on the above lemmas, we will show in detail how to achieve our main results in the following subsections.
3.3 Special Property of SARAH Update
The most important property of the SVRG algorithm is the variance reduction of the steps. This property holds as the number of outer iteration grows, but it does not hold, if only the number of inner iterations increases. In other words, if we simply run the inner loop for many iterations (without executing additional outer loops), the variance of the steps does not reduce in the case of SVRG, while it goes to zero in the case of SARAH with large learning rate in the strongly convex case. We recall the SARAH update as follows.
(18) 
followed by the iterate update:
(19) 
We will now show that is going to zero in expectation in the strongly convex case. These results substantiate our conclusion that SARAH uses more stable stochastic gradient estimates than SVRG.
Proposition 1.
The proof of this Proposition can be derived directly from Theorem 1b in [13]. This result implies that by choosing , we obtain the linear convergence of in expectation with the rate .
3.4 Oneloop (iSARAHIN) Results
We begin with providing two useful lemmas that do not require convexity assumption. Lemma 6 bounds the sum of expected values of ; and Lemma 7 expands the value of .
Lemma 6.
Proof.
By summing over , we have
which is equivalent to ():
where the second inequality follows since . ∎
Lemma 7.
Proof.
Let be the algebra generated by ^{2}^{2}2 contains all the information of as well as . We note that is independent of . For , we have
where the last equality follows from
By taking expectation for the above equation, we have
By summing over , we have
∎
3.4.1 General Convex Case
In this subsection, we analyze oneloop results of Inexact SARAH (Algorithm 2) in the general convex case. We first derive the bound for .
Lemma 8.
Proof.
For , we have
which, if we take expectation, implies that
when .
By summing the above inequality over , we have
(22) 
Lemma 9.
Proof.
By Corollary 1, we have
Taking the expectation and adding for both sides, the desired result is achieved. ∎