Consider the following stochastic nonconvex optimization problem of the form:
where is a stochastic function defined such that for each ,, while for each realization , is smooth on ; and is the expectation of w.r.t. over .
Our goals and assumptions:
Since (1) is nonconvex, our goal in this paper is to develop a new class of stochastic gradient algorithms to find an -approximate stationary point of (1) such that under mild assumptions as stated in Assumption 1.1.
The objective function of (1) satisfies the following conditions:
(Boundedness from below) There exists a finite lower bound .
(-average smoothness) The function is -average smooth on , i.e. there exists such that
(Bounded variance) There exists such that
These assumptions are very standard in stochastic optimization methods [10, 14]. The -average smoothness of is weaker than the smoothness of for each realization . Note that our methods described in the sequel are also applicable to the finite-sum problem as long as the above assumptions hold. However, we do not specify our methods to solve this problem. In this case, in (3) can be replaced by other alternatives, e.g., .
Our key idea:
Different from existing methods, we introduce a convex combination of a biased and unbiased estimator of the gradient of , which we call a hybrid stochastic gradient estimator. While the biased estimator exploited in this paper is SARAH in , the unbiased one can be any unbiased estimator. SARAH is a recursive biased and variance reduced estimator for . Combining it with an unbiased estimator allows us to reduce the bias and variance of the hybrid estimator. In this paper, we only focus on the standard stochastic estimator as an unbiased candidate.
To this end, our contribution can be summarized as follows:
We propose a hybrid stochastic estimator for a stochastic gradient of a nonconvex function in (1) by combining the SARAH estimator from  and any unbiased stochastic estimator such as SGD and SVRG. However, we only focus on the SGD estimator in this paper. We prove some key properties of this hybrid estimator that can be used to design new algorithms.
We exploit our hybrid estimator to develop a single-loop SGD algorithm that can achieve an -stationary point such that in at most stochastic gradient evaluations. This complexity significantly improves of SGD if . We extend our algorithm to a double loop variant, which requires stochastic gradient evaluations. This is the best-known complexity in the literature for stochastic gradient-type methods for solving (1).
We also investigate other variants of our method, including adaptive step-sizes, and mini-batches. In all these cases, our methods achieve the best-known complexity bounds.
Let us emphasize the following points of our contribution. Firstly, although our single-loop method requires three gradients per iteration compared to standard SGDs, it can achieves better complexity bound. Secondly, it can be cast into a variance reduction method where it starts from a “good” approximation of , and aggressively reduces the variance. Thirdly, our step-size is which is larger than in SGDs. Fourthly, the step-size of the adaptive variant is increasing instead of diminishing as in SGDs. Finally, our method achieves the same best-known complexity as in variance reduction methods studied in [9, 21, 28]. We believe that our approach can be extended to other estimators such as SVRG  and SAGA , and can be used for Hessians to develop second-order methods as well as to solve convex and composite problems.
The rest of this paper is organized as follows. Section 2 introduces our new hybrid stochastic estimator for the gradient of and investigates its properties. Section 3 proposes a single-loop hybrid SGD-SARAH algorithm and its complexity analysis. It also considers a double-loop and mini-batch variants with rigorous complexity analysis. Section 4 provides two numerical examples to illustrate our methods and compares them with state-of-the-art methods. All the proofs and additional experiments can be found in the Supplementary Document.
Notation: We work with Euclidean spaces, , equipped with standard inner product and norm . For a smooth function (i.e., is continuously differentiable), denotes its gradient. We use to denote a distribution on with probability . If is uniform, then we simply use . We also use to present big-O notion in complexity theory, and to denote a -field.
2 Hybrid stochastic gradient estimators
In this section, we propose new stochastic estimators for the gradient of a smooth function .
Let be an unbiased estimator of formed by a realization of , i.e. . We attempt to develop the following stochastic estimator for in (1):
where and are two independent realizations of on . Clearly, if , then we obtain a simple unbiased stochastic estimator, and , we obtain the SARAH estimator in . We are interested in the case , in which we call in (4) a hybrid stochastic estimator.
Note that we can rewrite as
The first two terms are two stochastic gradients estimated at , while the third term is the difference of the previous estimator and a stochastic gradient at the previous iterate. Here, since , the main idea is to exploit more recent information than the old ones. In fact, the hybrid estimator covers many other estimators, including SGD, SVRG, and SARAH. We can use one of the following two concrete unbiased estimators of as follows:
The SGD estimator: .
The SVRG estimator: , where is a full gradient evaluated at a given snapshot point .
However, for the sake of presentation, we only focus on the SGD estimator .
We first prove the following property of the estimator showing how the variance is estimated.
Let be defined by (4). Then
If , then is a biased estimator. Moreover, we have
Assume that is -smooth and is an SGD estimator. Then, we have the following upper bound on the variance of :
where the expectation is taking over all the randomness , , for , and for .
3 Hybrid SARAH-SGD algorithms
3.1 The generic algorithm framework
Algorithm 1 looks essentially the same as any SGD scheme with only one loop. The differences are at Step 3 with a mini-batch estimator and at Step 6, where we use our hybrid gradient estimator . In addition, we will show in the sequel that it uses different step-sizes and leads to different variants. Unlike the inner loop of SARAH or SVRG, each iteration of Algorithm 1 requires three individual gradient evaluations instead of two as in these methods. The snapshot at Step 3 of Algorithm 1 relies on a mini-batch of the size , which is independent of in the loop .
3.2 Convergence analysis
We analyze two cases: constant step-size and adaptive step-size. In both cases, is fixed for all .
3.2.a Convergence of Algorithm 1 with constant step-size and constant
The step-size satisfies . In addition, we have
If we choose for any , then to guarantee , we need to choose
In particular, if we choose , then the number of oracle calls is is
Moreover, the step-size satisfies .
3.2.b Convergence of Algorithm 1 with adaptive step-size and constant
Let be fixed for some . Instead of fixing step-size as in (8), we can update it adaptively as
It can be shown that . Interestingly, our step-size is updated in an increasing manner instead of diminishing as in existing SGD-type methods. Moreover, given , we can pre-compute the sequence of these step-sizes in advance within basic operations. Therefore, it does not significantly incur the computational cost of our method.
Note that in the finite sum case, i.e. , we set in both Theorems 3.1 and 3.2. This complexity remains the same as in Theorem 3.1. However, the adaptive stepsize potentially gives a better performance in practice as we will see in Section 4.
Algorithm 1 can be considered as a single-loop variance reduction method, which is similar to SAGA , but Algorithm 1 aims at solving the nonconvex problem (1). It is different from standard SGD methods, where it can be initialized by a mini-batch and then update the estimator using three individual gradients. Therefore, it has the same cost as SGD with mini-batch of size . As a compensation, we obtain an improvement on the complexity bound as in Theorems 3.1 and 3.2.
3.3 Convergence analysis of the double loop variant
Since the step-size depends on , it is natural to run Algorithm 1 with multiple stages. This leads to a double-loop algorithm as SVRG, SARAH, and SPIDER, where Algorithm 1 is restarted at each outer iteration . The detail of this variant is described in Algorithm 2.
Let . If we choose and for some constants and and , then, to guarantee , we require at most
Consequently, the total number of stochastic gradient evaluations does not exceed
Note that the complexity (15) only holds if . Otherwise, the total complexity is , where other constants independent of and , and are hidden. Practically, if is very close to , one can remove the unbiased SGD term to save one stochastic gradient evaluation. In this case, our estimator reduces to SARAH but using different step-size. We observed empirically that when , the performance of our methods is not affected if we do so.
3.4 Extensions to mini-batch cases
We consider a mini-batch hybrid stochastic estimator for the gradient defined as:
where , and is a mini-batch of the size and independent of the unbiased estimator . Note that can also be a mini-batch unbiased estimator. For example, is a mini-batch SGD estimator with a mini-batch of size , where is independent of .
Using defined by (16), we can design a mini-batch variant of Algorithms 1 to solve (1). The following corollary is obtained as a result of Theorems 3.1 for the mini-batch variant of Algorithm 1, whose proof is in Subsection 3.2 of the supplementary document.
If we choose for any , then to guarantee , we need to choose
Therefore, the number of oracle calls is is
where if is finite, and , otherwise. In particular, if we choose for some , then, the overall complexity is .
4 Numerical experiments
We verify our algorithms on two numerical examples and compare them with several existing methods: SVRG , SVRG+ , SPIDER , SpiderBoost , and SGD . Due to space limit, the detailed configuration of our experiments as well as more numerical experiments can be found in Supplementary Document D. Our numerical experiments are implemented in Python and running on a MacBook Pro. Laptop with 2.7GHz Intel Core i5 and 16Gb memory.
4.1 Logistic regression with nonconvex regularizer
Our first example is the following well-known problem used in may papers including :
where are given for , and is a regularization parameter. Clearly, problem (20) fits (1) well with . In this experiment, we choose and normalize the data. One can also verify Assumption 1.1 due to the bounded Hessian of .
We use three datasets from LibSVM for (20): w8a (), rcv1.binary (), and real-sim ). We run different algorithms as follows. Algorithm 1 with constant step-size (Hybrid-SGD-SL) and adaptive step-size (Hybrid-SGD-ASL) using our theoretical step-sizes (8) and (12), respectively without tuning. Hybrid-SGD-DL is Algorithm 2. SGD1 is SGD with constant step-size , and SGD2 is SGD with adaptive step-size . Since the stepsize of SPIDER depends on an accuracy , we choose to get a larger step-size. Our first result in the single-sample case (i.e. when , not using mini-batch) is plotted in Fig. 1 after epochs.
From Fig. 1, we observe that Hybrid-SGD-SL has similar convergence behavior as SGD1, but Hybrid-SGD-ASL works better. Hybrid-SGD-DL is the best but has some oscillation. SGD2 works better than SGD1 and is comparable with Hybrid-SGD-SL/ASL in the two last datasets. SVRG performs very poorly due to its small step-size. SVRG+ works much better than SVRG, and is comparable with our methods. SPIDER is also slow even when we have increased its step-size.
Now, we run 3 single-loop algorithms with mini-batch of the size . The result is shown in Fig. 2 after epochs.
4.2 Binary classification involving nonconvex loss and Tikhonov’s regularizer
We consider the following binary classification problem studied in  involving nonconvex loss:
where and are given data for , is a regularization parameter, and is a nonconvex loss of the forms:
(using in two-layer neural networks). One can check that (21) satisfies Assumption 1.1 with . We choose , and test three variants of Algorithm 2: Hybrid-SGD-DL and compare them with SpiderBoost, SVRG, and SVRG+. Due to space limit, we only plot one experiment in Fig. 3 after epochs. Additional experiments can be found in Supplementary Document D.
As we can see from Fig. 3 that Algorithm 2 performs well and is slightly better than SpiderBoost. Note that SpiderBoost simply uses SARAH estimator with constant stepsize but with mini-batch of the size . It is not surprise that SpiderBoost makes very good progress to decrease the gradient norms. Both SVRG and SVRG+ perform much worse than Hybrid-SGD-DL and SpiderBoost in this test, but SVRG+ is slightly better than SVRG. In our methods, due to the aid of SARAH part, they also make similar progress as SpiderBoost but using different step-sizes.
We have introduced a new hybrid SARAH-SGD estimator for the objective gradient of expectation optimization problems. Under standard assumptions, we have shown that this estimator has a better variance reduction property than SARAH. By exploiting such an estimator, we have developed a new Hybrid-SGD algorithm, Algorithm 1, that has better complexity bounds than state-of-the-art SGDs. Our algorithm works with both constant and adaptive step-sizes. We have also studied its double-loop and mini-batch variants. We believe that our approach can be extended to other choices of unbiased estimators, Hessian estimators for second-order stochastic methods, and adaptive .
Appendix A Appendix: Properties of the hybrid stochastic estimator
This supplementary document provides the full proof of all the results in the main text. First, we need the following lemma in the sequel.
Given and . Let be the sequence updated by
for . Then
First, from (22) it is obvious to show that . At the same time, since , we have . By Chebyshev’s sum inequality, we have
From the update (22), we also have
Let and . Summing up both sides of the above inequalities, we get
Using again Chebyshev’s sum inequality, we have
Note that by Cauchy-Schwarz’s inequality, which shows that . Combining three last inequalities, we obtain the following quadratic inequation in
Solving this inequation with respect to , we obtain