 # SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient

In this paper, we propose a StochAstic Recursive grAdient algoritHm (SARAH), as well as its practical variant SARAH+, as a novel approach to the finite-sum minimization problems. Different from the vanilla SGD and other modern stochastic methods such as SVRG, S2GD, SAG and SAGA, SARAH admits a simple recursive framework for updating stochastic gradient estimates; when comparing to SAG/SAGA, SARAH does not require a storage of past gradients. The linear convergence rate of SARAH is proven under strong convexity assumption. We also prove a linear convergence rate (in the strongly convex case) for an inner loop of SARAH, the property that SVRG does not possess. Numerical experiments demonstrate the efficiency of our algorithm.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We are interested in solving a problem of the form

 minw∈Rd⎧⎨⎩ P(w)def=1n∑i∈[n]fi(w)⎫⎬⎭, (1)

where each , , is convex with a Lipschitz continuous gradient. Throughout the paper, we assume that there exists an optimal solution of (1).

Problems of this type arise frequently in supervised learning applications

(Hastie et al., 2009). Given a training set with , the least squares regression model, for example, is written as (1) with , where denotes the -norm. The

-regularized logistic regression for binary classification is written with

.

In recent years, many advanced optimization methods have been developed for problem (1). While the objective function is smooth and convex, the traditional optimization methods, such as gradient descent (GD) or Newton method are often impractical for this problem, when – the number of training samples and hence the number of ’s – is very large. In particular, GD updates iterates as follows

 wt+1=wt−ηt∇P(wt),t=0,1,2,….

Under strong convexity assumption on and with appropriate choice of , GD converges at a linear rate in terms of objective function values . However, when is large, computing at each iteration can be prohibitive.

As an alternative, stochastic gradient descent (SGD)

111We mark here that even though stochastic gradient is referred to as SG in literature, the term stochastic gradient descent (SGD) has been widely used in many important works of large-scale learning, including SAG/SAGA, SDCA, SVRG and MISO., originating from the seminal work of Robbins and Monro in 1951 (Robbins & Monro, 1951), has become the method of choice for solving (1). At each step, SGD picks an index uniformly at random, and updates the iterate as , which is up-to times cheaper than an iteration of a full gradient method. The convergence rate of SGD is slower than that of GD, in particular, it is sublinear in the strongly convex case. The tradeoff, however, is advantageous due to the tremendous per-iteration savings and the fact that low accuracy solutions are sufficient. This trade-off has been thoroughly analyzed in (Bottou, 1998)

. Unfortunately, in practice SGD method is often too slow and its performance is too sensitive to the variance in the sample gradients

. Use of mini-batches (averaging multiple sample gradients ) was used in (Shalev-Shwartz et al., 2007; Cotter et al., 2011; Takáč et al., 2013) to reduce the variance and improve convergence rate by constant factors. Using diminishing sequence is used to control the variance (Shalev-Shwartz et al., 2011; Bottou et al., 2016), but the practical convergence of SGD is known to be very sensitive to the choice of this sequence, which needs to be hand-picked.

Recently, a class of more sophisticated algorithms have emerged, which use the specific finite-sum form of (1) and combine some deterministic and stochastic aspects to reduce variance of the steps. The examples of these methods are SAG/SAGA (Le Roux et al., 2012; Defazio et al., 2014), SDCA (Shalev-Shwartz & Zhang, 2013), SVRG (Johnson & Zhang, 2013; Xiao & Zhang, 2014), DIAG (Mokhtari et al., 2017), MISO (Mairal, 2013) and S2GD (Konečný & Richtárik, 2013), all of which enjoy faster convergence rate than that of SGD and use a fixed learning rate parameter . In this paper we introduce a new method in this category, SARAH, which further improves several aspects of the existing methods. In Table 2 we summarize complexity and some other properties of the existing methods and SARAH when applied to strongly convex problems. Although SVRG and SARAH have the same convergence rate, we introduce a practical variant of SARAH that outperforms SVRG in our experiments.

In addition, theoretical results for complexity of the methods or their variants when applied to general convex functions have been derived (Schmidt et al., 2016; Defazio et al., 2014; Reddi et al., 2016; Allen-Zhu & Yuan, 2016; Allen-Zhu, 2017). In Table 2 we summarize the key complexity results, noting that convergence rate is now sublinear.

##### Our Contributions.

In this paper, we propose a novel algorithm which combines some of the good properties of existing algorithms, such as SAGA and SVRG, while aiming to improve on both of these methods. In particular, our algorithm does not take steps along a stochastic gradient direction, but rather along an accumulated direction using past stochastic gradient information (as in SAGA) and occasional exact gradient information (as in SVRG). We summarize the key properties of the proposed algorithm below.

• [noitemsep,nolistsep]

• Similarly to SVRG, SARAH’s iterations are divided into the outer loop where a full gradient is computed and the inner loop where only stochastic gradient is computed. Unlike the case of SVRG, the steps of the inner loop of SARAH are based on accumulated stochastic information.

• Like SAG/SAGA and SVRG, SARAH has a sublinear rate of convergence for general convex functions, and a linear rate of convergence for strongly convex functions.

• SARAH uses a constant learning rate, whose size is larger than that of SVRG. We analyze and discuss the optimal choice of the learning rate and the number of inner loop steps. However, unlike SAG/SAGA but similar to SVRG, SARAH does not require a storage of past stochastic gradients.

• We also prove a linear convergence rate (in the strongly convex case) for the inner loop of SARAH, the property that SVRG does not possess. We show that the variance of the steps inside the inner loop goes to zero, thus SARAH is theoretically more stable and reliable than SVRG.

• We provide a practical variant of SARAH based on the convergence properties of the inner loop, where the simple stable stopping criterion for the inner loop is used (see Section 4 for more details). This variant shows how SARAH can be made more stable than SVRG in practice.

## 2 Stochastic Recursive Gradient Algorithm

Now we are ready to present our SARAH (Algorithm 1).

The key step of the algorithm is a recursive update of the stochastic gradient estimate (SARAH update)

 vt=∇fit(wt)−∇fit(wt−1)+vt−1, (2)

followed by the iterate update:

 wt+1=wt−ηvt. (3)

For comparison, SVRG update can be written in a similar way as

 vt=∇fit(wt)−∇fit(w0)+v0. (4)

Observe that in SVRG,

is an unbiased estimator of the gradient, while it is not true for SARAH. Specifically,

222 , which is expectation with respect to the random choice of index (conditioned on ).

 E[vt|Ft]=∇P(wt)−∇P(wt−1)+vt−1≠∇P(wt), (5)

where 333 also contains all the information of as well as is the -algebra generated by ; . Hence, SARAH is different from SGD and SVRG type of methods, however, the following total expectation holds, , differentiating SARAH from SAG/SAGA.

SARAH is similar to SVRG since they both contain outer loops which require one full gradient evaluation per outer iteration followed by one full gradient descent step with a given learning rate. The difference lies in the inner loop, where SARAH updates the stochastic step direction recursively by adding and subtracting component gradients to and from the previous in (2). Each inner iteration evaluates stochastic gradients and hence the total work per outer iteration is in terms of the number of gradient evaluations. Note that due to its nature, without running the inner loop, i.e., , SARAH reduces to the GD algorithm.

## 3 Theoretical Analysis

To proceed with the analysis of the proposed algorithm, we will make the following common assumptions.

###### Assumption 1 (L-smooth).

Each , , is -smooth, i.e., there exists a constant such that

 ∥∇fi(w)−∇fi(w′)∥≤L∥w−w′∥, ∀w,w′∈Rd.

Note that this assumption implies that is also L-smooth. The following strong convexity assumption will be made for the appropriate parts of the analysis, otherwise, it would be dropped.

###### Assumption 2c (μ-strongly convex).

The function , is -strongly convex, i.e., there exists a constant such that ,

 P(w)≥P(w′)+∇P(w′)T(w−w′)+μ2∥w−w′∥2.

Another, stronger, assumption of -strong convexity for (1) will also be imposed when required in our analysis. Note that Assumption 2d implies Assumption 2c but not vice versa.

###### Assumption 2d.

Each function , , is strongly convex with .

Under Assumption 2c, let us define the (unique) optimal solution of (1) as , Then strong convexity of implies that

 2μ[P(w)−P(w∗)]≤∥∇P(w)∥2, ∀w∈Rd. (6)

We note here, for future use, that for strongly convex functions of the form (1), arising in machine learning applications, the condition number is defined as . Furthermore, we should also notice that Assumptions 2c and 2d both cover a wide range of problems, e.g. -regularized empirical risk minimization problems with convex losses.

Finally, as a special case of the strong convexity of all ’s with , we state the general convexity assumption, which we will use for convergence analysis.

###### Assumption 5.

Each function , , is convex, i.e.,

 fi(w)≥fi(w′)+∇fi(w′)T(w−w′),∀i∈[n].

Again, we note that Assumption 2d implies Assumption 5, but Assumption 2c does not. Hence in our analysis, depending on the result we aim at, we will require Assumption 5 to hold by itself, or Assumption 2c and Assumption 5 to hold together, or Assumption 2d to hold by itself. We will always use Assumption 1.

Our iteration complexity analysis aims to bound the number of outer iterations (or total number of stochastic gradient evaluations) which is needed to guarantee that . In this case we will say that is an -accurate solution. However, as is common practice for stochastic gradient algorithms, we aim to obtain the bound on the number of iterations, which is required to guarantee the bound on the expected squared norm of a gradient, i.e.,

 E[∥∇P(wT)∥2]≤ϵ. (7)

### 3.1 Linearly Diminishing Step-Size in a Single Inner Loop

The most important property of the SVRG algorithm is the variance reduction of the steps. This property holds as the number of outer iteration grows, but it does not hold, if only the number of inner iterations increases. In other words, if we simply run the inner loop for many iterations (without executing additional outer loops), the variance of the steps does not reduce in the case of SVRG, while it goes to zero in the case of SARAH. To illustrate this effect, let us take a look at Figures 2 and 2.

In Figure 2, we applied one outer loop of SVRG and SARAH to a sum of quadratic functions in a two-dimensional space, where the optimal solution is at the origin, the black lines and black dots indicate the trajectory of each algorithm and the red point indicates the final iterate. Initially, both SVRG and SARAH take steps along stochastic gradient directions towards the optimal solution. However, later iterations of SVRG wander randomly around the origin with large deviation from it, while SARAH follows a much more stable convergent trajectory, with a final iterate falling in a small neighborhood of the optimal solution. Figure 1: A two-dimensional example of minwP(w) with n=5 for SVRG (left) and SARAH (right).

In Figure 2, the x-axis denotes the number of effective passes which is equivalent to the number of passes through all of the data in the dataset, the cost of each pass being equal to the cost of one full gradient evaluation; and y-axis represents . Figure 2 shows the evolution of for SARAH, SVRG, SGD+ (SGD with decreasing learning rate) and FISTA (an accelerated version of GD (Beck & Teboulle, 2009)) with , where the left plot shows the trend over multiple outer iterations and the right plot shows a single outer iteration444In the plots of Figure 2, since the data for SVRG is noisy, we smooth it by using moving average filters with spans 100 for the left plot and 10 for the right one.. We can see that for SVRG, decreases over the outer iterations, while it has an increasing trend or oscillating trend for each inner loop. In contrast, SARAH enjoys decreasing trends both in the outer and the inner loop iterations.

We will now show that the stochastic steps computed by SARAH converge linearly in the inner loop. We present two linear convergence results based on our two different assumptions of -strong convexity. These results substantiate our conclusion that SARAH uses more stable stochastic gradient estimates than SVRG. The following theorem is our first result to demonstrate the linear convergence of our stochastic recursive step .

###### Theorem 1b.

Suppose that Assumptions 1, 2c and 5 hold. Consider defined by (2) in SARAH (Algorithm 1) with . Then, for any ,

 E[∥vt∥2]

This result implies that by choosing , we obtain the linear convergence of in expectation with the rate . Below we show that a better convergence rate can be obtained under a stronger convexity assumption.

###### Theorem 1c.

Suppose that Assumptions 1 and 2d hold. Consider defined by (2) in SARAH (Algorithm 1) with . Then the following bound holds, ,

 E[∥vt∥2] ≤(1−2μLημ+L)tE[∥∇P(w0)∥2].

Again, by setting , we derive the linear convergence with the rate of , which is a significant improvement over the result of Theorem 1b, when the problem is severely ill-conditioned.

### 3.2 Convergence Analysis

In this section, we derive the general convergence rate results for Algorithm 1. First, we present two important Lemmas as the foundation of our theory. Then, we proceed to prove sublinear convergence rate of a single outer iteration when applied to general convex functions. In the end, we prove that the algorithm with multiple outer iterations has linear convergence rate in the strongly convex case.

We begin with proving two useful lemmas that do not require any convexity assumption. The first Lemma 1 bounds the sum of expected values of . The second, Lemma 2, bounds .

###### Lemma 1.

Suppose that Assumption 1 holds. Consider SARAH (Algorithm 1). Then, we have

 m∑t=0E[∥∇P(wt)∥2]≤2ηE[P(w0)−P(w∗)] (8) +m∑t=0E[∥∇P(wt)−vt∥2]−(1−Lη)m∑t=0E[∥vt∥2].
###### Lemma 2.

Suppose that Assumption 1 holds. Consider defined by (2) in SARAH (Algorithm 1). Then for any ,

 E[∥∇P(wt)−vt∥2]=t∑j=1E[∥vj−vj−1∥2] −t∑j=1E[∥∇P(wj)−∇P(wj−1)∥2].

Now we are ready to provide our main theoretical results.

#### 3.2.1 General Convex Case

Following from Lemma 2, we can obtain the following upper bound for for convex functions .

###### Lemma 3.

Suppose that Assumptions 1 and 5 hold. Consider defined as (2) in SARAH (Algorithm 1) with . Then we have that for any ,

 E[∥∇P(wt)−vt∥2] ≤ηL2−ηL[E[∥v0∥2]−E[∥vt∥2]] ≤ηL2−ηLE[∥v0∥2]. (9)

Using the above lemmas, we can state and prove one of our core theorems as follows.

###### Theorem 4.

Suppose that Assumptions 1 and 5 hold. Consider SARAH (Algorithm 1) with . Then for any , we have

 E[∥∇P(~ws)∥2] ≤2η(m+1)E[P(~ws−1)−P(w∗)] +ηL2−ηLE[∥∇P(~ws−1)∥2]. (10)
###### Proof.

Since implies then by Lemma 3, we can write

 ∑mt=0E[∥∇P(wt)−vt∥2]≤mηL2−ηLE[∥v0∥2]. (11)

Hence, by Lemma 1 with , we have

 ∑mt=0E[∥∇P(wt)∥2] ≤2ηE[P(w0)−P(w∗)]+∑mt=0E[∥∇P(wt)−vt∥2] 2ηE[P(w0)−P(w∗)]+mηL2−ηLE[∥v0∥2]. (12)

Since we are considering one outer iteration, with , then we have (since ), and , where is picked uniformly at random from . Therefore, the following holds,

 E[∥∇P(~ws)∥2] =1m+1∑mt=0E[∥∇P(wt)∥2] 2η(m+1)E[P(~ws−1)−P(w∗)] +ηL2−ηLE[∥∇P(~ws−1)∥2].\qed

Theorem 4, in the case when implies that

 E[∥∇P(~ws)∥2] ≤2η(m+1)E[P(~ws−1)−P(w∗)] +ηLE[∥∇P(~ws−1)∥2].

By choosing the learning rate (with such that ) we can derive the following convergence result,

 E[∥∇P(~ws)∥2] ≤√2Lm+1E[P(~ws−1)−P(w∗)+∥∇P(~ws−1)∥2].

Clearly, this result shows a sublinear convergence rate for SARAH under general convexity assumption within a single inner loop, with increasing , and consequently, we have the following result for complexity bound.

###### Corollary 1.

Suppose that Assumptions 1 and 5 hold. Consider SARAH (Algorithm 1) within a single outer iteration with the learning rate where is the total number of iterations, then converges sublinearly in expectation with a rate of , and therefore, the total complexity to achieve an -accurate solution defined in (7) is . Figure 3: Theoretical comparisons of learning rates (left) and convergence rates (middle and right) with n=1,000,000 for SVRG and SARAH in one inner loop.

We now turn to estimating convergence of SARAH with multiple outer steps. Simply using Theorem 4 for each of the outer steps we have the following result.

###### Theorem 5.

Suppose that Assumptions 1 and 5 hold. Consider SARAH (Algorithm 1) and define

 δk =2η(m+1)E[P(~wk)−P(w∗)], k=0,1,…,s−1,

and . Then we have

 E[∥∇P(~ws)∥2]−Δ≤αs(∥∇P(~w0)∥2−Δ), (13)

where and

Based on Theorem 5, we have the following total complexity for SARAH in the general convex case.

###### Corollary 2.

Let us choose , (with ), and in Theorem 5. Then, the total complexity to achieve an -accuracy solution defined in (7) is .

#### 3.2.2 Strongly Convex Case

We now turn to the discussion of the linear convergence rate of SARAH under the strong convexity assumption on . From Theorem 4, for any , using property (6) of the -strongly convex , we have

 E[∥∇P(~ws)∥2]≤2η(m+1)E[P(~ws−1)−P(w∗)] +ηL2−ηLE[∥∇P(~ws−1)∥2] (1μη(m+1)+ηL2−ηL)E[∥∇P(~ws−1)∥2],

and equivalently,

 E[∥∇P(~ws)∥2]≤σmE[∥∇P(~ws−1)∥2]. (14)

Let us define . Then by choosing and such that , and applying (14) recursively, we are able to reach the following convergence result.

###### Theorem 6.

Suppose that Assumptions 1, 2c and 5 hold. Consider SARAH (Algorithm 1) with the choice of and such that

 σmdef=1μη(m+1)+ηL2−ηL<1. (15)

Then, we have

 E[∥∇P(~ws)∥2]≤(σm)s∥∇P(~w0)∥2.
###### Remark 1.

Theorem 6 implies that any will work for SARAH. Let us compare our convergence rate to that of SVRG. The linear rate of SVRG, as presented in (Johnson & Zhang, 2013), is given by

 αm=1μη(1−2Lη)m+2ηL1−2ηL<1.

We observe that it implies that the learning rate has to satisfy , which is a tighter restriction than required by SARAH. In addition, with the same values of and , the rate or convergence of (the outer iterations) of SARAH is always smaller than that of SVRG.

 σm =1μη(m+1)+ηL2−ηL=1μη(m+1)+12/(ηL)−1 <1μη(1−2Lη)m+10.5/(ηL)−1=αm.
###### Remark 2.

To further demonstrate the better convergence properties of SARAH, let us consider following optimization problem

 min0<η<1/L σm,min0<η<1/4L αm,

which can be interpreted as the best convergence rates for different values of , for both SARAH and SVRG. After simple calculations, we plot both learning rates and the corresponding theoretical rates of convergence, as shown in Figure 3, where the right plot is a zoom-in on a part of the middle plot. The left plot shows that the optimal learning rate for SARAH is significantly larger than that of SVRG, while the other two plots show significant improvement upon outer iteration convergence rates for SARAH over SVRG.

Based on Theorem 6, we are able to derive the following total complexity for SARAH in the strongly convex case.

###### Corollary 3.

Fix , and let us run SARAH with and for iterations where then we can derive an -accuracy solution defined in (7). Furthermore, we can obtain the total complexity of SARAH, to achieve the -accuracy solution, as

## 4 A Practical Variant

While SVRG is an efficient variance-reducing stochastic gradient method, one of its main drawbacks is the sensitivity of the practical performance with respect to the choice of . It is know that should be around ,555 In practice, when is large, is often considered as a regularized Empirical Loss Minimization problem with regularization parameter , then while it still remains unknown that what the exact best choice is. In this section, we propose a practical variant of SARAH as SARAH+ (Algorithm 2), which provides an automatic and adaptive choice of the inner loop size . Guided by the linear convergence of the steps in the inner loop, demonstrated in Figure 2, we introduce a stopping criterion based on the values of while upper-bounding the total number of steps by a large enough for robustness. The other modification compared to SARAH (Algorithm 1) is the more practical choice , where is the last index of the particular inner loop, instead of randomly selected intermediate index.

Different from SARAH, SARAH+ provides a possibility of earlier termination and unnecessary careful choices of , and it also covers the classical gradient descent when we set (since the while loop does not proceed). In Figure 4 we present the numerical performance of SARAH+ with different s on rcv1 and news20 datasets. The size of the inner loop provides a trade-off between the fast sub-linear convergence in the inner loop and linear convergence in the outer loop. From the results, it appears that is the optimal choice. With a larger , i.e. , the iterates in the inner loop do not provide sufficient reduction, before another full gradient computation is required, while with an unnecessary number of inner steps is performed without gaining substantial progress. Clearly is another parameter that requires tuning, however, in our experiments, the performance of SARAH+ has been very robust with respect to the choices of and did not vary much from one data set to another.

Similarly to SVRG, decreases in the outer iterations of SARAH+. However, unlike SVRG, SARAH+ also inherits from SARAH the consistent decrease of in expectation in the inner loops. It is not possible to apply the same idea of adaptively terminating the inner loop of SVRG based on the reduction in , as may have side fluctuations as shown in Figure 2. Figure 4: An example of ℓ2-regularized logistic regression on rcv1 (left) and news20 (right) training datasets for SARAH+ with different γs on loss residuals P(w)−P(w∗).

## 5 Numerical Experiments Figure 5: Comparisons of loss residuals P(w)−P(w∗) (top) and test errors (bottom) from different modern stochastic methods on covtype, ijcnn1, news20 and rcv1.

To support the theoretical analyses and insights, we present our empirical experiments, comparing SARAH and SARAH+ with the state-of-the-art first-order methods for -regularized logistic regression problems with

 fi(w)=log(1+exp(−yixTiw))+λ2∥w∥2,

on datasets covtype, ijcnn1, news20 and rcv1 666All datasets are available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.. For ijcnn1 and rcv1 we use the predefined testing and training sets, while covtype and news20 do not have test data, hence we randomly split the datasets with for training and for testing. Some statistics of the datasets are summarized in Table 3.

The penalty parameter is set to as is common practice (Le Roux et al., 2012). Note that like SVRG/S2GD and SAG/SAGA, SARAH also allows an efficient sparse implementation named “lazy updates” (Konečný et al., 2016). We conduct and compare numerical results of SARAH with SVRG, SAG, SGD+ and FISTA. SVRG (Johnson & Zhang, 2013) and SAG (Le Roux et al., 2012) are classic modern stochastic methods. SGD+ is SGD with decreasing learning rate where is the number of effective passes and is some initial constant learning rate. FISTA (Beck & Teboulle, 2009) is the Fast Iterative Shrinkage-Thresholding Algorithm, well-known as an efficient accelerated version of the gradient descent. Even though for each method, there is a theoretical safe learning rate, we compare the results for the best learning rates in hindsight.

Figure 5 shows numerical results in terms of loss residuals (top) and test errors (bottom) on the four datasets, SARAH is sometimes comparable or a little worse than other methods at the beginning. However, it quickly catches up to or surpasses all other methods, demonstrating a faster rate of decrease across all experiments. We observe that on covtype and rcv1, SARAH, SVRG and SAG are comparable with some advantage of SARAH on covtype. On ijcnn1 and news20, SARAH and SVRG consistently surpass the other methods.

In particular, to validate the efficiency of our practical variant SARAH+, we provide an insight into how important the choices of and are for SVRG and SARAH in Table 4 and Figure 6. Table 4 presents the optimal choices of and for each of the algorithm, while Figure 6 shows the behaviors of SVRG and SARAH with different choices of for covtype and ijcnn1, where s denote the best choices. In Table 4, the optimal learning rates of SARAH vary less among different datasets compared to all the other methods and they approximate the theoretical upper bound for SARAH (); on the contrary, for the other methods the empirical optimal rates can exceed their theoretical limits (SVRG with , SAG with , FISTA with ). This empirical studies suggest that it is much easier to tune and find the ideal learning rate for SARAH. As observed in Figure 6, the behaviors of both SARAH and SVRG are quite sensitive to the choices of . With improper choices of , the loss residuals can be increased considerably from to on both covtype in 40 effective passes and ijcnn1 in 17 effective passes for SARAH/SVRG. Figure 6: Comparisons of loss residuals P(w)−P(w∗) for different inner loop sizes with SVRG (top) and SARAH (bottom) on covtype and ijcnn1.

## 6 Conclusion

We propose a new variance reducing stochastic recursive gradient algorithm SARAH, which combines some of the properties of well known existing algorithms, such as SAGA and SVRG. For smooth convex functions, we show a sublinear convergence rate, while for strongly convex cases, we prove the linear convergence rate and the computational complexity as those of SVRG and SAG. However, compared to SVRG, SARAH’s convergence rate constant is smaller and the algorithms is more stable both theoretically and numerically. Additionally, we prove the linear convergence for inner loops of SARAH which support the claim of stability. Based on this convergence we derive a practical version of SARAH, with a simple stopping criterion for the inner loops.

## Acknowledgements

The authors would like to thank the reviewers for useful suggestions which helped to improve the exposition in the paper.

## References

• Allen-Zhu (2017) Allen-Zhu, Zeyuan. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods.

Proceedings of the 49th Annual ACM on Symposium on Theory of Computing (to appear)

, 2017.
• Allen-Zhu & Yuan (2016) Allen-Zhu, Zeyuan and Yuan, Yang. Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives. In ICML, pp. 1080–1089, 2016.
• Beck & Teboulle (2009) Beck, Amir and Teboulle, Marc. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1):183–202, 2009.
• Bottou (1998) Bottou, Léon. Online learning and stochastic approximations. In Saad, David (ed.),

Online Learning in Neural Networks

, pp. 9–42. Cambridge University Press, New York, NY, USA, 1998.
ISBN 0-521-65263-4.
• Bottou et al. (2016) Bottou, Léon, Curtis, Frank E, and Nocedal, Jorge. Optimization methods for large-scale machine learning. arXiv:1606.04838, 2016.
• Cotter et al. (2011) Cotter, Andrew, Shamir, Ohad, Srebro, Nati, and Sridharan, Karthik. Better mini-batch algorithms via accelerated gradient methods. In NIPS, pp. 1647–1655, 2011.
• Defazio et al. (2014) Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, pp. 1646–1654, 2014.
• Hastie et al. (2009) Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, 2nd edition, 2009.
• Johnson & Zhang (2013) Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315–323, 2013.
• Konečný et al. (2016) Konečný, Jakub, Liu, Jie, Richtárik, Peter, and Takáč, Martin. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10:242–255, 2016.
• Konečný & Richtárik (2013) Konečný, Jakub and Richtárik, Peter. Semi-stochastic gradient descent methods. arXiv:1312.1666, 2013.
• Le Roux et al. (2012) Le Roux, Nicolas, Schmidt, Mark, and Bach, Francis. A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS, pp. 2663–2671, 2012.
• Mairal (2013) Mairal, Julien. Optimization with first-order surrogate functions. In ICML, pp. 783–791, 2013.
• Mokhtari et al. (2017) Mokhtari, Aryan, Gürbüzbalaban, Mert, and Ribeiro, Alejandro. A double incremental aggregated gradient method with linear convergence rate for large-scale optimization. Proceedings of IEEE International Conference on Acoustic, Speech and Signal Processing (to appear), 2017.
• Nesterov (2004) Nesterov, Yurii. Introductory lectures on convex optimization : a basic course. Applied optimization. Kluwer Academic Publ., Boston, Dordrecht, London, 2004. ISBN 1-4020-7553-7.
• Reddi et al. (2016) Reddi, Sashank J., Hefny, Ahmed, Sra, Suvrit, Póczos, Barnabás, and Smola, Alexander J. Stochastic variance reduction for nonconvex optimization. In ICML, pp. 314–323, 2016.
• Robbins & Monro (1951) Robbins, Herbert and Monro, Sutton. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
• Schmidt et al. (2016) Schmidt, Mark, Le Roux, Nicolas, and Bach, Francis. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, pp. 1–30, 2016.
• Shalev-Shwartz & Zhang (2013) Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013.
• Shalev-Shwartz et al. (2007) Shalev-Shwartz, Shai, Singer, Yoram, and Srebro, Nathan. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML, pp. 807–814, 2007.
• Shalev-Shwartz et al. (2011) Shalev-Shwartz, Shai, Singer, Yoram, Srebro, Nathan, and Cotter, Andrew. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1):3–30, 2011.
• Takáč et al. (2013) Takáč, Martin, Bijral, Avleen Singh, Richtárik, Peter, and Srebro, Nathan. Mini-batch primal and dual methods for SVMs. In ICML, pp. 1022–1030, 2013.
• Xiao & Zhang (2014) Xiao, Lin and Zhang, Tong. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.

## Appendix A Technical Results

###### Lemma 4 (Theorem 2.1.5 in (Nesterov, 2004)).

Suppose that is convex and -smooth. Then, for any , ,

 f(w)≤f(w′)+∇f(w′)T(w−w′)+L2∥w−w′∥2, (16) f(w)≥f(w′)+∇f(w′)T(w−w′)+12L∥∇f(w)−∇f(w′)∥2, (17) (∇f(w)−∇f(w′))T(w−w′)≥1L∥∇f(w)−∇f(w′)∥2. (18)

Note that (16) does not require the convexity of .

###### Lemma 5 (Theorem 2.1.11 in (Nesterov, 2004)).

Suppose that is -strongly convex and -smooth. Then, for any , ,

 (∇f(w)−∇f(w′))T(w−w′)≥μLμ+L∥w−w′∥2+1μ+L∥∇f(w)−∇f(w′)∥2. (19)
###### Lemma 6 (Choices of m and η).

Consider the rate of convergence in Theorem 6. If we choose with and fix , then the best choice of is

 m∗=12(2θ∗−1)2κ−1,

where with calculated as:

 θ∗=σm+1+√σm+12σm.

Furthermore, we require for .

## Appendix B Proofs

### b.1 Proof of Lemma 1

By Assumption 1 and , we have

 E[P(wt+1)] E[P(wt)]−ηE[∇P(wt)Tvt]+Lη22E[∥vt∥2] =E[P(wt)]−η2E[∥∇P(wt)∥2]+η2E[∥∇P(wt)−vt∥2]−(η2−Lη22)E[∥vt∥2],

where the last equality follows from the fact

By summing over , we have

 E[P(wm+1)] ≤E[P(w0)]−η2m∑t=0E[∥∇P(wt)∥2]+η2m∑t=0E[∥∇P(wt)−vt∥2]−(η2−Lη22)m∑t=0E[∥vt∥2],

which is equivalent to ():

 m∑t=0E[∥∇P(wt)∥2] ≤2ηE[P(w0)−P(wm+1)]+m∑t=0E[∥∇P(wt)−vt∥2]−(1−Lη)m∑t=0E[∥vt∥2] ≤2ηE[P(w0)−P(w∗)]+m∑t=0E[∥∇P(wt)−vt∥2]−(1−Lη)m∑t=0E[∥vt∥2],

where the last inequality follows since is a global minimizer of .

### b.2 Proof of Lemma 2

Note that contains all the information of as well as . For , we have

 E[∥∇P(wj)−vj∥2|Fj] =E[∥[∇P(wj−1)−vj−1]+[∇P(wj)−∇P(wj−1)]−[vj−vj−1]∥2|Fj] =∥∇P(wj−1)−vj−1∥2+∥∇P(wj)−∇P(wj−1)∥2+E[∥vj−vj−1∥2|Fj] +2(∇P(wj−1)−vj−1)T(∇P(wj)−∇P(wj−1)) −2(∇P(wj−1)−vj−1)TE[vj−vj−1|Fj] −2(∇P(wj)−∇P(wj−1))TE[vj−vj−1|Fj] =∥∇P(wj−1)−vj−1∥2−∥∇P(wj)−∇P(wj−1)∥2+E[∥vj−vj−1∥2|Fj],

where the last equality follows from

 E[vj−vj−1|Fj]E[∇fij(wj)−∇fij(wj−1)|Fj]=∇P(wj)−∇P(wj−1).

By taking expectation for the above equation, we have

 E[∥∇P(wj)−vj∥