 # Hybrid Stochastic Gradient Descent Algorithms for Stochastic Nonconvex Optimization

We introduce a hybrid stochastic estimator to design stochastic gradient algorithms for solving stochastic optimization problems. Such a hybrid estimator is a convex combination of two existing biased and unbiased estimators and leads to some useful property on its variance. We limit our consideration to a hybrid SARAH-SGD for nonconvex expectation problems. However, our idea can be extended to handle a broader class of estimators in both convex and nonconvex settings. We propose a new single-loop stochastic gradient descent algorithm that can achieve O({σ^3ε^-1,σε^-3})-complexity bound to obtain an ε-stationary point under smoothness and σ^2-bounded variance assumptions. This complexity is better than O(σ^2ε^-4) often obtained in state-of-the-art SGDs when σ < O(ε^-3). We also consider different extensions of our method, including constant and adaptive step-size with single-loop, double-loop, and mini-batch variants. We compare our algorithms with existing methods on several datasets using two nonconvex models.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider the following stochastic nonconvex optimization problem of the form:

 minx∈Rp{f(x):=Eξ[f(x;ξ)]}, (1)

where is a stochastic function defined such that for each ,

is a random variable in a given probability space

, while for each realization , is smooth on ; and is the expectation of w.r.t. over .

##### Our goals and assumptions:

Since (1) is nonconvex, our goal in this paper is to develop a new class of stochastic gradient algorithms to find an -approximate stationary point of (1) such that under mild assumptions as stated in Assumption 1.1.

###### Assumption 1.1.

The objective function of (1) satisfies the following conditions:

• (Boundedness from below) There exists a finite lower bound .

• (-average smoothness) The function is -average smooth on , i.e. there exists such that

 Eξ[∥∇f(x;ξ)−∇f(y;ξ)∥2]≤L2∥x−y∥2,  ∀x,y∈Rp. (2)
• (Bounded variance) There exists such that

 Eξ[∥∇f(x;ξ)−∇f(x)∥2]≤σ2,   ∀x∈Rp. (3)

These assumptions are very standard in stochastic optimization methods [10, 14]. The -average smoothness of is weaker than the smoothness of for each realization . Note that our methods described in the sequel are also applicable to the finite-sum problem as long as the above assumptions hold. However, we do not specify our methods to solve this problem. In this case, in (3) can be replaced by other alternatives, e.g., .

##### Our key idea:

Different from existing methods, we introduce a convex combination of a biased and unbiased estimator of the gradient of , which we call a hybrid stochastic gradient estimator. While the biased estimator exploited in this paper is SARAH in , the unbiased one can be any unbiased estimator. SARAH is a recursive biased and variance reduced estimator for . Combining it with an unbiased estimator allows us to reduce the bias and variance of the hybrid estimator. In this paper, we only focus on the standard stochastic estimator as an unbiased candidate.

##### Related work:

Under Assumption 1.1, problem (1

) covers a large number of applications in machine learning and data sciences. The stochastic gradient descent (SGD) method was first studied in

, and becomes extremely popular in recent years.  seems to be the first work showing the convergence rates of robust SGD variants in the convex setting, while  provides an intensive complexity analysis for many optimization algorithms, including stochastic methods. Variance reduction methods have also been widely studied, see, e.g. [2, 6, 8, 11, 20, 22, 26, 27, 29]. In the nonconvex setting,  seems to be the first algorithm achieving -complexity bound. Other researchers have also made significant progress in this direction, including [3, 4, 5, 9, 13, 17, 18, 19, 23, 28, 31]. A majority of these works, including [3, 4, 5, 13, 23], rely on SVRG estimator in order to obtain better complexity bounds. Hitherto, the complexity of SVRG-based methods remains worse than the best-known results, which is obtained in [9, 21, 28] via the SARAH estimator. However, as discussed in [21, 28], the method called SPIDER in [9, 13] does not practically perform well due to small step-size and its dependence on the reciprocal of the estimator’s norm.  amends this issue by using a large constant step-size, but requires large mini-batch and does not consider the single sample case and single loop variants.  provides a more general framework to treat composite problems where it covers (1) as special case, but it does not consider the single loop as in SGDs.

##### Our contribution:

To this end, our contribution can be summarized as follows:

• We propose a hybrid stochastic estimator for a stochastic gradient of a nonconvex function in (1) by combining the SARAH estimator from  and any unbiased stochastic estimator such as SGD and SVRG. However, we only focus on the SGD estimator in this paper. We prove some key properties of this hybrid estimator that can be used to design new algorithms.

• We exploit our hybrid estimator to develop a single-loop SGD algorithm that can achieve an -stationary point such that in at most stochastic gradient evaluations. This complexity significantly improves of SGD if . We extend our algorithm to a double loop variant, which requires stochastic gradient evaluations. This is the best-known complexity in the literature for stochastic gradient-type methods for solving (1).

• We also investigate other variants of our method, including adaptive step-sizes, and mini-batches. In all these cases, our methods achieve the best-known complexity bounds.

Let us emphasize the following points of our contribution. Firstly, although our single-loop method requires three gradients per iteration compared to standard SGDs, it can achieves better complexity bound. Secondly, it can be cast into a variance reduction method where it starts from a “good” approximation of , and aggressively reduces the variance. Thirdly, our step-size is which is larger than in SGDs. Fourthly, the step-size of the adaptive variant is increasing instead of diminishing as in SGDs. Finally, our method achieves the same best-known complexity as in variance reduction methods studied in [9, 21, 28]. We believe that our approach can be extended to other estimators such as SVRG  and SAGA , and can be used for Hessians to develop second-order methods as well as to solve convex and composite problems.

##### Paper organization:

The rest of this paper is organized as follows. Section 2 introduces our new hybrid stochastic estimator for the gradient of and investigates its properties. Section 3 proposes a single-loop hybrid SGD-SARAH algorithm and its complexity analysis. It also considers a double-loop and mini-batch variants with rigorous complexity analysis. Section 4 provides two numerical examples to illustrate our methods and compares them with state-of-the-art methods. All the proofs and additional experiments can be found in the Supplementary Document.

Notation: We work with Euclidean spaces, , equipped with standard inner product and norm . For a smooth function (i.e., is continuously differentiable), denotes its gradient. We use to denote a distribution on with probability . If is uniform, then we simply use . We also use to present big-O notion in complexity theory, and to denote a -field.

## 2 Hybrid stochastic gradient estimators

In this section, we propose new stochastic estimators for the gradient of a smooth function .

Let be an unbiased estimator of formed by a realization of , i.e. . We attempt to develop the following stochastic estimator for in (1):

 vt:=βt−1vt−1+βt−1(∇f(xt;ξt)−∇f(xt−1;ξt))+(1−βt−1)ut, (4)

where and are two independent realizations of on . Clearly, if , then we obtain a simple unbiased stochastic estimator, and , we obtain the SARAH estimator in . We are interested in the case , in which we call in (4) a hybrid stochastic estimator.

Note that we can rewrite as

 vt:=βt−1∇f(xt;ξt)+(1−βt−1)ut+βt−1(vt−1−∇f(xt−1;ξt)).

The first two terms are two stochastic gradients estimated at , while the third term is the difference of the previous estimator and a stochastic gradient at the previous iterate. Here, since , the main idea is to exploit more recent information than the old ones. In fact, the hybrid estimator covers many other estimators, including SGD, SVRG, and SARAH. We can use one of the following two concrete unbiased estimators of as follows:

• The SGD estimator: .

• The SVRG estimator: , where is a full gradient evaluated at a given snapshot point .

However, for the sake of presentation, we only focus on the SGD estimator .

We first prove the following property of the estimator showing how the variance is estimated.

###### Lemma 2.1.

Let be defined by (4). Then

 E(ξt,ζt)[vt]=∇f(xt)+βt−1(vt−1−∇f(xt−1)). (5)

If , then is a biased estimator. Moreover, we have

 E(ξt,ζt)[∥vt−∇f(xt)∥2]=β2t−1∥vt−1−∇f(xt−1)∥2−β2t−1∥∇f(xt−1)−∇f(xt)∥2+ β2t−1Eξt[∥∇f(xt;ξt)−∇f(xt−1;ξt)∥2]+ (1−βt−1)2Eζt[∥ut−∇f(xt)∥2]. (6)
###### Remark 2.1.

From (4), we can see that remains a biased estimator as long as . Its biased term is

 Bias(vt)=∥E(ξt,ζt)[vt−∇f(xt)∣Ft]∥=βt−1∥vt−1−∇f(xt−1)∥≤∥vt−1−∇f(xt−1)∥.

This shows that the bias estimator is smaller than the one in the SARAH estimator from , which is .

The following lemma bounds the second moment of

with defined in (4).

###### Lemma 2.2.

Assume that is -smooth and is an SGD estimator. Then, we have the following upper bound on the variance of :

 E[∥vt−∇f(xt)∥2]≤ωtE[∥v0−∇f(x0)∥2]+L2t−1∑i=0ωi,tE[∥xi+1−xi∥2]+St, (7)

where the expectation is taking over all the randomness , , for , and for .

Lemmas 2.1 and 2.2 provides two key properties to develop stochastic algorithm in Section 3.

## 3 Hybrid SARAH-SGD algorithms

In this section, we utilize our hybrid stochastic estimator in (4) to develop stochastic gradient methods for solving (1). We consider three different variants using the hybrid SARAH-SGD estimator.

### 3.1 The generic algorithm framework

Using defined by (4), we can develop a new algorithm for solving (1) as in Algorithm 1.

Algorithm 1 looks essentially the same as any SGD scheme with only one loop. The differences are at Step 3 with a mini-batch estimator and at Step 6, where we use our hybrid gradient estimator . In addition, we will show in the sequel that it uses different step-sizes and leads to different variants. Unlike the inner loop of SARAH or SVRG, each iteration of Algorithm 1 requires three individual gradient evaluations instead of two as in these methods. The snapshot at Step 3 of Algorithm 1 relies on a mini-batch of the size , which is independent of in the loop .

### 3.2 Convergence analysis

We analyze two cases: constant step-size and adaptive step-size. In both cases, is fixed for all .

#### 3.2.a Convergence of Algorithm 1 with constant step-size η and constant β

Assume that we run Algorithm 1 within iterations . In this case, given , we choose and in Algorithm 1 as follows:

 η:=2L(√1+4α2m+1)      with     β:=1−c1√b(m+1)    and    α2m:=β2(1−β2m)1−β2. (8)

The following theorem estimates the complexity of Algorithm 1 to approximate an -stationary point of (1), whose proof is given in Subsection 2.2 of the supplementary document.

###### Theorem 3.1.

Let be the sequence generated by Algorithm 1 using the step-size defined by (8). Let us choose . Then

• The step-size satisfies . In addition, we have

 E[∥∇f(˜xm)∥2]≤3b1/4L[f(x0)−f⋆]√c1(m+1)3/4+(c1+1c1)σ2√b(m+1). (9)
• If we choose for any , then to guarantee , we need to choose

 m:=⌊σε3[3Lc1/42√c1[f(x0)−f⋆]+(c1+1c1)1√c2]3/2⌋=O(σε3). (10)

In particular, if we choose , then the number of oracle calls is is

 (11)

Moreover, the step-size satisfies .

Here, stands for the number of stochastic gradient evaluations of in (1). The complexity in (11) can be written as . If , then our complexity is . Even if , then our complexity is still better than in SGD.

#### 3.2.b Convergence of Algorithm 1 with adaptive step-size ηt and constant β

Let be fixed for some . Instead of fixing step-size as in (8), we can update it adaptively as

 ηm:=1L,  and  ηt:=1L+L2[β2ηt+1+β4ηt+2+⋯+β2(m−t)ηm]  for t=0,⋯,m−1. (12)

It can be shown that . Interestingly, our step-size is updated in an increasing manner instead of diminishing as in existing SGD-type methods. Moreover, given , we can pre-compute the sequence of these step-sizes in advance within basic operations. Therefore, it does not significantly incur the computational cost of our method.

The following theorem states the convergence of Algorithm 1 under the adaptive update (12), whose proof is given in Subsection 2.3 of the supplementary document.

###### Theorem 3.2.

Let be the sequence generated by Algorithm 1 using the step-size defined by (12). Let , and with . Then

• The sum is bounded from below as .

• If we choose for any , then to guarantee , we need to choose . Therefore, the number of stochastic gradient evaluations is at most the same as in (11).

Note that in the finite sum case, i.e. , we set in both Theorems 3.1 and 3.2. This complexity remains the same as in Theorem 3.1. However, the adaptive stepsize potentially gives a better performance in practice as we will see in Section 4.

Algorithm 1 can be considered as a single-loop variance reduction method, which is similar to SAGA , but Algorithm 1 aims at solving the nonconvex problem (1). It is different from standard SGD methods, where it can be initialized by a mini-batch and then update the estimator using three individual gradients. Therefore, it has the same cost as SGD with mini-batch of size . As a compensation, we obtain an improvement on the complexity bound as in Theorems 3.1 and 3.2.

### 3.3 Convergence analysis of the double loop variant

Since the step-size depends on , it is natural to run Algorithm 1 with multiple stages. This leads to a double-loop algorithm as SVRG, SARAH, and SPIDER, where Algorithm 1 is restarted at each outer iteration . The detail of this variant is described in Algorithm 2.

To analyze Algorithm 2, we use to represent the iterate of Algorithm 1 at the -th inner iteration within each stage . From (45), we can see that each stage , the following estimate holds

 η2m∑t=0E[∥∇f(x(s)t)∥2]≤E[f(x(s)0)]−E[f(x(s)m+1)]+ησ2√m+1(1+β)√b.

Here, we assume that we fix the step-size for simplicity of analysis. The complexity of Algorithm 2 is given in the following theorem, whose proof is in Supplementary Document 2.4.

###### Theorem 3.3.

Let be the sequence generated by Algorithm 2 using constant step-size in (8). Then, the following estimate holds

 1S(m+1)S∑s=1m∑t=0E[∥∇f(x(s)t)∥2]≤3Lb1/4S(m+1)3/4[f(˜x0)−f⋆]+2σ2√b(m+1). (13)

Let . If we choose and for some constants and and , then, to guarantee , we require at most

 (14)

Consequently, the total number of stochastic gradient evaluations does not exceed

 Tge:=(b+3m)S=3L(c1+3c2)c1/41[f(˜x0)−f⋆]σc3/42(1−2√c1c2)ε3=O(σε3). (15)

Note that the complexity (15) only holds if . Otherwise, the total complexity is , where other constants independent of and , and are hidden. Practically, if is very close to , one can remove the unbiased SGD term to save one stochastic gradient evaluation. In this case, our estimator reduces to SARAH but using different step-size. We observed empirically that when , the performance of our methods is not affected if we do so.

### 3.4 Extensions to mini-batch cases

We consider a mini-batch hybrid stochastic estimator for the gradient defined as:

 ^vt:=βt−1^vt−1+βt−1^bt∑i∈^Bt(∇f(xt;ξi)−∇f(xt−1;ξi))+(1−βt−1)ut, (16)

where , and is a mini-batch of the size and independent of the unbiased estimator . Note that can also be a mini-batch unbiased estimator. For example, is a mini-batch SGD estimator with a mini-batch of size , where is independent of .

Using defined by (16), we can design a mini-batch variant of Algorithms 1 to solve (1). The following corollary is obtained as a result of Theorems 3.1 for the mini-batch variant of Algorithm 1, whose proof is in Subsection 3.2 of the supplementary document.

###### Corollary 3.1.

Let Algorithm 1 be applied to solve (1) using mini-batch update (16) for with fixed, , and the step-size

 η:=2L(1+√1+4ρα2m)    % with    α2m:=β2(1−β2m)1−β2   and   β:=1−c1√ρb(m−1). (17)

If we choose for any , then to guarantee , we need to choose

 (18)

Therefore, the number of oracle calls is is

 Tge:=O(ρ1/2σ3ε+σρ1/2ε3), (19)

where if is finite, and , otherwise. In particular, if we choose for some , then, the overall complexity is .

We can also develop a mini-batch variant of Algorithm 2 and estimate its complexity as in Theorem 3.3. For more details, we refer to Subsection 3.3 in the supplementary document due to space limit.

## 4 Numerical experiments

We verify our algorithms on two numerical examples and compare them with several existing methods: SVRG , SVRG+ , SPIDER , SpiderBoost , and SGD . Due to space limit, the detailed configuration of our experiments as well as more numerical experiments can be found in Supplementary Document D. Our numerical experiments are implemented in Python and running on a MacBook Pro. Laptop with 2.7GHz Intel Core i5 and 16Gb memory.

### 4.1 Logistic regression with nonconvex regularizer

Our first example is the following well-known problem used in may papers including :

 minx∈Rp{f(x):=1nn∑i=1[fi(x):=log(1+exp(−aTix))+λp∑j=1x2i1+x2i]}, (20)

where are given for , and is a regularization parameter. Clearly, problem (20) fits (1) well with . In this experiment, we choose and normalize the data. One can also verify Assumption 1.1 due to the bounded Hessian of .

We use three datasets from LibSVM for (20): w8a (), rcv1.binary (), and real-sim ). We run different algorithms as follows. Algorithm 1 with constant step-size (Hybrid-SGD-SL) and adaptive step-size (Hybrid-SGD-ASL) using our theoretical step-sizes (8) and (12), respectively without tuning. Hybrid-SGD-DL is Algorithm 2. SGD1 is SGD with constant step-size , and SGD2 is SGD with adaptive step-size . Since the stepsize of SPIDER depends on an accuracy , we choose to get a larger step-size. Our first result in the single-sample case (i.e. when , not using mini-batch) is plotted in Fig. 1 after epochs.

From Fig. 1, we observe that Hybrid-SGD-SL has similar convergence behavior as SGD1, but Hybrid-SGD-ASL works better. Hybrid-SGD-DL is the best but has some oscillation. SGD2 works better than SGD1 and is comparable with Hybrid-SGD-SL/ASL in the two last datasets. SVRG performs very poorly due to its small step-size. SVRG+ works much better than SVRG, and is comparable with our methods. SPIDER is also slow even when we have increased its step-size.

Now, we run 3 single-loop algorithms with mini-batch of the size . The result is shown in Fig. 2 after epochs.

Fig. 2 shows similar performance between Hybrid-SGD-SL and ASL and SGD2. Clearly, these theoretical variants of Algorithm 1 are slightly better than the adaptive SGD variant (SGD2), where a careful step-size is used.

### 4.2 Binary classification involving nonconvex loss and Tikhonov’s regularizer

We consider the following binary classification problem studied in  involving nonconvex loss:

 f⋆:=minx∈Rp{f(x):=1nn∑i=1ℓ(a⊤ix,bi)+λ2∥x∥2}, (21)

where and are given data for , is a regularization parameter, and is a nonconvex loss of the forms:

(using in two-layer neural networks). One can check that (

21) satisfies Assumption 1.1 with . We choose , and test three variants of Algorithm 2: Hybrid-SGD-DL and compare them with SpiderBoost, SVRG, and SVRG+. Due to space limit, we only plot one experiment in Fig. 3 after epochs. Additional experiments can be found in Supplementary Document D.

As we can see from Fig. 3 that Algorithm 2 performs well and is slightly better than SpiderBoost. Note that SpiderBoost simply uses SARAH estimator with constant stepsize but with mini-batch of the size . It is not surprise that SpiderBoost makes very good progress to decrease the gradient norms. Both SVRG and SVRG+ perform much worse than Hybrid-SGD-DL and SpiderBoost in this test, but SVRG+ is slightly better than SVRG. In our methods, due to the aid of SARAH part, they also make similar progress as SpiderBoost but using different step-sizes.

## 5 Conclusion

We have introduced a new hybrid SARAH-SGD estimator for the objective gradient of expectation optimization problems. Under standard assumptions, we have shown that this estimator has a better variance reduction property than SARAH. By exploiting such an estimator, we have developed a new Hybrid-SGD algorithm, Algorithm 1, that has better complexity bounds than state-of-the-art SGDs. Our algorithm works with both constant and adaptive step-sizes. We have also studied its double-loop and mini-batch variants. We believe that our approach can be extended to other choices of unbiased estimators, Hessian estimators for second-order stochastic methods, and adaptive .

## Appendix A Appendix: Properties of the hybrid stochastic estimator

This supplementary document provides the full proof of all the results in the main text. First, we need the following lemma in the sequel.

###### Lemma A.1.

Given and . Let be the sequence updated by

 (22)

for . Then

 0<η0<η1<⋯<ηm=1L,   and   Σm:=m∑t=0ηt≥(m+1)√1−ω2L. (23)
###### Proof.

First, from (22) it is obvious to show that . At the same time, since , we have . By Chebyshev’s sum inequality, we have

 (m−t)(ωηt+1+ω2ηt+2+⋯+ωm−tηm)≤(∑mj=t+1ηi)(ω+ω2+⋯+ωm−t)≤ω1−ω(∑mj=t+1ηi). (24)

From the update (22), we also have

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩L2η0(ωη1+ω2η2+⋯+ωmηm)=1−Lη0L2η1(ωη2+ω2η3+⋯+ωm−1ηm)=1−Lη1⋯⋯L2ηm−1ωηm=1−Lηm−10=1−Lηm. (25)

Using (24) into (25), we get

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩ωL21−ωη0(η0+η1+⋯+ηm)≥m−mLη0+ωL21−ωη20ωL21−ωη1(η0+η1+⋯+ηm)≥(m−1)−(m−1)Lη1+ωL21−ω(η1η0+η21)⋯⋯ωL21−ωηm−1(η0+η1+⋯+ηm)≥1−Lηm−1+ωL21−ω(ηm−1η0+⋯+η2m−1)ωL21−ωηm(η0+η1+⋯+ηm)≥1−Lηm+ωL21−ω(ηmη0+⋯+η2m).

Let and . Summing up both sides of the above inequalities, we get

 ωL21−ωΣ2m≥m2+m+22−L(mη0+(m−1)η1+⋯+ηm−1+ηm)+ωL22(1−ω)(Sm+Σ2m).

Using again Chebyshev’s sum inequality, we have

 mη0+(m−1)η1+⋯+ηm−1+ηm≤m2+m+22(m+1)(m∑t=0ηt)=(m2+m+2)Σm2(m+1).

Note that by Cauchy-Schwarz’s inequality, which shows that . Combining three last inequalities, we obtain the following quadratic inequation in

 mωL2(1−ω)Σ2m+L(m2+m+2)Σm−(m+1)(m2+m+2)≥0.

Solving this inequation with respect to , we obtain

 Σm≥(1−ω)[√(m2+m+2)2+4m(m+1)(m2+m+2)ω1−ω−(m2+m+2)]2ωmL