1 Introduction
Consider the following stochastic nonconvex optimization problem of the form:
(1) 
where is a stochastic function defined such that for each ,
is a random variable in a given probability space
, while for each realization , is smooth on ; and is the expectation of w.r.t. over .Our goals and assumptions:
Since (1) is nonconvex, our goal in this paper is to develop a new class of stochastic gradient algorithms to find an approximate stationary point of (1) such that under mild assumptions as stated in Assumption 1.1.
Assumption 1.1.
The objective function of (1) satisfies the following conditions:

(Boundedness from below) There exists a finite lower bound .

(average smoothness) The function is average smooth on , i.e. there exists such that
(2) 
(Bounded variance) There exists such that
(3)
These assumptions are very standard in stochastic optimization methods [10, 14]. The average smoothness of is weaker than the smoothness of for each realization . Note that our methods described in the sequel are also applicable to the finitesum problem as long as the above assumptions hold. However, we do not specify our methods to solve this problem. In this case, in (3) can be replaced by other alternatives, e.g., .
Our key idea:
Different from existing methods, we introduce a convex combination of a biased and unbiased estimator of the gradient of , which we call a hybrid stochastic gradient estimator. While the biased estimator exploited in this paper is SARAH in [16], the unbiased one can be any unbiased estimator. SARAH is a recursive biased and variance reduced estimator for . Combining it with an unbiased estimator allows us to reduce the bias and variance of the hybrid estimator. In this paper, we only focus on the standard stochastic estimator as an unbiased candidate.
Related work:
Under Assumption 1.1, problem (1
) covers a large number of applications in machine learning and data sciences. The stochastic gradient descent (SGD) method was first studied in
[25], and becomes extremely popular in recent years. [14] seems to be the first work showing the convergence rates of robust SGD variants in the convex setting, while [15] provides an intensive complexity analysis for many optimization algorithms, including stochastic methods. Variance reduction methods have also been widely studied, see, e.g. [2, 6, 8, 11, 20, 22, 26, 27, 29]. In the nonconvex setting, [10] seems to be the first algorithm achieving complexity bound. Other researchers have also made significant progress in this direction, including [3, 4, 5, 9, 13, 17, 18, 19, 23, 28, 31]. A majority of these works, including [3, 4, 5, 13, 23], rely on SVRG estimator in order to obtain better complexity bounds. Hitherto, the complexity of SVRGbased methods remains worse than the bestknown results, which is obtained in [9, 21, 28] via the SARAH estimator. However, as discussed in [21, 28], the method called SPIDER in [9, 13] does not practically perform well due to small stepsize and its dependence on the reciprocal of the estimator’s norm. [28] amends this issue by using a large constant stepsize, but requires large minibatch and does not consider the single sample case and single loop variants. [21] provides a more general framework to treat composite problems where it covers (1) as special case, but it does not consider the single loop as in SGDs.Our contribution:
To this end, our contribution can be summarized as follows:

We propose a hybrid stochastic estimator for a stochastic gradient of a nonconvex function in (1) by combining the SARAH estimator from [16] and any unbiased stochastic estimator such as SGD and SVRG. However, we only focus on the SGD estimator in this paper. We prove some key properties of this hybrid estimator that can be used to design new algorithms.

We exploit our hybrid estimator to develop a singleloop SGD algorithm that can achieve an stationary point such that in at most stochastic gradient evaluations. This complexity significantly improves of SGD if . We extend our algorithm to a double loop variant, which requires stochastic gradient evaluations. This is the bestknown complexity in the literature for stochastic gradienttype methods for solving (1).

We also investigate other variants of our method, including adaptive stepsizes, and minibatches. In all these cases, our methods achieve the bestknown complexity bounds.
Let us emphasize the following points of our contribution. Firstly, although our singleloop method requires three gradients per iteration compared to standard SGDs, it can achieves better complexity bound. Secondly, it can be cast into a variance reduction method where it starts from a “good” approximation of , and aggressively reduces the variance. Thirdly, our stepsize is which is larger than in SGDs. Fourthly, the stepsize of the adaptive variant is increasing instead of diminishing as in SGDs. Finally, our method achieves the same bestknown complexity as in variance reduction methods studied in [9, 21, 28]. We believe that our approach can be extended to other estimators such as SVRG [11] and SAGA [8], and can be used for Hessians to develop secondorder methods as well as to solve convex and composite problems.
Paper organization:
The rest of this paper is organized as follows. Section 2 introduces our new hybrid stochastic estimator for the gradient of and investigates its properties. Section 3 proposes a singleloop hybrid SGDSARAH algorithm and its complexity analysis. It also considers a doubleloop and minibatch variants with rigorous complexity analysis. Section 4 provides two numerical examples to illustrate our methods and compares them with stateoftheart methods. All the proofs and additional experiments can be found in the Supplementary Document.
Notation: We work with Euclidean spaces, , equipped with standard inner product and norm . For a smooth function (i.e., is continuously differentiable), denotes its gradient. We use to denote a distribution on with probability . If is uniform, then we simply use . We also use to present bigO notion in complexity theory, and to denote a field.
2 Hybrid stochastic gradient estimators
In this section, we propose new stochastic estimators for the gradient of a smooth function .
Let be an unbiased estimator of formed by a realization of , i.e. . We attempt to develop the following stochastic estimator for in (1):
(4) 
where and are two independent realizations of on . Clearly, if , then we obtain a simple unbiased stochastic estimator, and , we obtain the SARAH estimator in [16]. We are interested in the case , in which we call in (4) a hybrid stochastic estimator.
Note that we can rewrite as
The first two terms are two stochastic gradients estimated at , while the third term is the difference of the previous estimator and a stochastic gradient at the previous iterate. Here, since , the main idea is to exploit more recent information than the old ones. In fact, the hybrid estimator covers many other estimators, including SGD, SVRG, and SARAH. We can use one of the following two concrete unbiased estimators of as follows:

The SGD estimator: .

The SVRG estimator: , where is a full gradient evaluated at a given snapshot point .
However, for the sake of presentation, we only focus on the SGD estimator .
We first prove the following property of the estimator showing how the variance is estimated.
Lemma 2.1.
Remark 2.1.
Lemma 2.2.
Assume that is smooth and is an SGD estimator. Then, we have the following upper bound on the variance of :
(7) 
where the expectation is taking over all the randomness , , for , and for .
3 Hybrid SARAHSGD algorithms
In this section, we utilize our hybrid stochastic estimator in (4) to develop stochastic gradient methods for solving (1). We consider three different variants using the hybrid SARAHSGD estimator.
3.1 The generic algorithm framework
Algorithm 1 looks essentially the same as any SGD scheme with only one loop. The differences are at Step 3 with a minibatch estimator and at Step 6, where we use our hybrid gradient estimator . In addition, we will show in the sequel that it uses different stepsizes and leads to different variants. Unlike the inner loop of SARAH or SVRG, each iteration of Algorithm 1 requires three individual gradient evaluations instead of two as in these methods. The snapshot at Step 3 of Algorithm 1 relies on a minibatch of the size , which is independent of in the loop .
3.2 Convergence analysis
We analyze two cases: constant stepsize and adaptive stepsize. In both cases, is fixed for all .
3.2.a Convergence of Algorithm 1 with constant stepsize and constant
Assume that we run Algorithm 1 within iterations . In this case, given , we choose and in Algorithm 1 as follows:
(8) 
The following theorem estimates the complexity of Algorithm 1 to approximate an stationary point of (1), whose proof is given in Subsection 2.2 of the supplementary document.
Theorem 3.1.
Let be the sequence generated by Algorithm 1 using the stepsize defined by (8). Let us choose . Then

The stepsize satisfies . In addition, we have
(9) 
If we choose for any , then to guarantee , we need to choose
(10) In particular, if we choose , then the number of oracle calls is is
(11) Moreover, the stepsize satisfies .
3.2.b Convergence of Algorithm 1 with adaptive stepsize and constant
Let be fixed for some . Instead of fixing stepsize as in (8), we can update it adaptively as
(12) 
It can be shown that . Interestingly, our stepsize is updated in an increasing manner instead of diminishing as in existing SGDtype methods. Moreover, given , we can precompute the sequence of these stepsizes in advance within basic operations. Therefore, it does not significantly incur the computational cost of our method.
The following theorem states the convergence of Algorithm 1 under the adaptive update (12), whose proof is given in Subsection 2.3 of the supplementary document.
Theorem 3.2.
Note that in the finite sum case, i.e. , we set in both Theorems 3.1 and 3.2. This complexity remains the same as in Theorem 3.1. However, the adaptive stepsize potentially gives a better performance in practice as we will see in Section 4.
Algorithm 1 can be considered as a singleloop variance reduction method, which is similar to SAGA [8], but Algorithm 1 aims at solving the nonconvex problem (1). It is different from standard SGD methods, where it can be initialized by a minibatch and then update the estimator using three individual gradients. Therefore, it has the same cost as SGD with minibatch of size . As a compensation, we obtain an improvement on the complexity bound as in Theorems 3.1 and 3.2.
3.3 Convergence analysis of the double loop variant
Since the stepsize depends on , it is natural to run Algorithm 1 with multiple stages. This leads to a doubleloop algorithm as SVRG, SARAH, and SPIDER, where Algorithm 1 is restarted at each outer iteration . The detail of this variant is described in Algorithm 2.
To analyze Algorithm 2, we use to represent the iterate of Algorithm 1 at the th inner iteration within each stage . From (45), we can see that each stage , the following estimate holds
Here, we assume that we fix the stepsize for simplicity of analysis. The complexity of Algorithm 2 is given in the following theorem, whose proof is in Supplementary Document 2.4.
Theorem 3.3.
Let be the sequence generated by Algorithm 2 using constant stepsize in (8). Then, the following estimate holds
(13) 
Let . If we choose and for some constants and and , then, to guarantee , we require at most
(14) 
Consequently, the total number of stochastic gradient evaluations does not exceed
(15) 
Note that the complexity (15) only holds if . Otherwise, the total complexity is , where other constants independent of and , and are hidden. Practically, if is very close to , one can remove the unbiased SGD term to save one stochastic gradient evaluation. In this case, our estimator reduces to SARAH but using different stepsize. We observed empirically that when , the performance of our methods is not affected if we do so.
3.4 Extensions to minibatch cases
We consider a minibatch hybrid stochastic estimator for the gradient defined as:
(16) 
where , and is a minibatch of the size and independent of the unbiased estimator . Note that can also be a minibatch unbiased estimator. For example, is a minibatch SGD estimator with a minibatch of size , where is independent of .
Using defined by (16), we can design a minibatch variant of Algorithms 1 to solve (1). The following corollary is obtained as a result of Theorems 3.1 for the minibatch variant of Algorithm 1, whose proof is in Subsection 3.2 of the supplementary document.
Corollary 3.1.
Let Algorithm 1 be applied to solve (1) using minibatch update (16) for with fixed, , and the stepsize
(17) 
If we choose for any , then to guarantee , we need to choose
(18) 
Therefore, the number of oracle calls is is
(19) 
where if is finite, and , otherwise. In particular, if we choose for some , then, the overall complexity is .
4 Numerical experiments
We verify our algorithms on two numerical examples and compare them with several existing methods: SVRG [24], SVRG+ [12], SPIDER [9], SpiderBoost [28], and SGD [10]. Due to space limit, the detailed configuration of our experiments as well as more numerical experiments can be found in Supplementary Document D. Our numerical experiments are implemented in Python and running on a MacBook Pro. Laptop with 2.7GHz Intel Core i5 and 16Gb memory.
4.1 Logistic regression with nonconvex regularizer
Our first example is the following wellknown problem used in may papers including [28]:
(20) 
where are given for , and is a regularization parameter. Clearly, problem (20) fits (1) well with . In this experiment, we choose and normalize the data. One can also verify Assumption 1.1 due to the bounded Hessian of .
We use three datasets from LibSVM for (20): w8a (), rcv1.binary (), and realsim ). We run different algorithms as follows. Algorithm 1 with constant stepsize (HybridSGDSL) and adaptive stepsize (HybridSGDASL) using our theoretical stepsizes (8) and (12), respectively without tuning. HybridSGDDL is Algorithm 2. SGD1 is SGD with constant stepsize , and SGD2 is SGD with adaptive stepsize . Since the stepsize of SPIDER depends on an accuracy , we choose to get a larger stepsize. Our first result in the singlesample case (i.e. when , not using minibatch) is plotted in Fig. 1 after epochs.
From Fig. 1, we observe that HybridSGDSL has similar convergence behavior as SGD1, but HybridSGDASL works better. HybridSGDDL is the best but has some oscillation. SGD2 works better than SGD1 and is comparable with HybridSGDSL/ASL in the two last datasets. SVRG performs very poorly due to its small stepsize. SVRG+ works much better than SVRG, and is comparable with our methods. SPIDER is also slow even when we have increased its stepsize.
Now, we run 3 singleloop algorithms with minibatch of the size . The result is shown in Fig. 2 after epochs.
4.2 Binary classification involving nonconvex loss and Tikhonov’s regularizer
We consider the following binary classification problem studied in [30] involving nonconvex loss:
(21) 
where and are given data for , is a regularization parameter, and is a nonconvex loss of the forms:
(using in twolayer neural networks). One can check that (
21) satisfies Assumption 1.1 with . We choose , and test three variants of Algorithm 2: HybridSGDDL and compare them with SpiderBoost, SVRG, and SVRG+. Due to space limit, we only plot one experiment in Fig. 3 after epochs. Additional experiments can be found in Supplementary Document D.As we can see from Fig. 3 that Algorithm 2 performs well and is slightly better than SpiderBoost. Note that SpiderBoost simply uses SARAH estimator with constant stepsize but with minibatch of the size . It is not surprise that SpiderBoost makes very good progress to decrease the gradient norms. Both SVRG and SVRG+ perform much worse than HybridSGDDL and SpiderBoost in this test, but SVRG+ is slightly better than SVRG. In our methods, due to the aid of SARAH part, they also make similar progress as SpiderBoost but using different stepsizes.
5 Conclusion
We have introduced a new hybrid SARAHSGD estimator for the objective gradient of expectation optimization problems. Under standard assumptions, we have shown that this estimator has a better variance reduction property than SARAH. By exploiting such an estimator, we have developed a new HybridSGD algorithm, Algorithm 1, that has better complexity bounds than stateoftheart SGDs. Our algorithm works with both constant and adaptive stepsizes. We have also studied its doubleloop and minibatch variants. We believe that our approach can be extended to other choices of unbiased estimators, Hessian estimators for secondorder stochastic methods, and adaptive .
Appendix A Appendix: Properties of the hybrid stochastic estimator
This supplementary document provides the full proof of all the results in the main text. First, we need the following lemma in the sequel.
Lemma A.1.
Given and . Let be the sequence updated by
(22) 
for . Then
(23) 
Proof.
First, from (22) it is obvious to show that . At the same time, since , we have . By Chebyshev’s sum inequality, we have
(24) 
From the update (22), we also have
(25) 
Let and . Summing up both sides of the above inequalities, we get
Using again Chebyshev’s sum inequality, we have
Note that by CauchySchwarz’s inequality, which shows that . Combining three last inequalities, we obtain the following quadratic inequation in
Solving this inequation with respect to , we obtain
Comments
There are no comments yet.