In this paper, we consider the following stochastic composite, nonconvex, and possibly nonsmooth optimization problem:
where is the expectation of a stochastic function
depending on a random vector
in a given probability space, and is a proper, closed, and convex function.
where for . Problem (2
) is often referred to as a regularized empirical risk minimization in machine learning and finance.
Problems (1) and (2) cover a broad range of applications in machine learning and statistics, especially in neural networks, see, e.g. (Bottou, 1998, 2010; Bottou et al., 2018; Goodfellow et al., 2016; Sra et al., 2012). Hitherto, state-of-the-art numerical optimization methods for solving these problems rely on stochastic approaches, see, e.g. (Johnson and Zhang, 2013; Schmidt et al., 2017; Shapiro et al., 2009; Defazio et al., 2014). In the convex case, both non-composite and composite settings (1) and (2) have been intensively studied with different schemes such as standard stochastic gradient (Robbins and Monro, 1951), proximal stochastic gradient (Ghadimi and Lan, 2013; Nemirovski et al., 2009), stochastic dual coordinate descent (Shalev-Shwartz and Zhang, 2013), variance reduction methods (e.g., SVRG and SAGA) (Allen-Zhu, 2017a; Defazio et al., 2014; Johnson and Zhang, 2013; Nitanda, 2014; Schmidt et al., 2017; Xiao and Zhang, 2014), stochastic conditional gradient (Frank-Wolfe) methods (Reddi et al., 2016a), and stochastic primal-dual methods (Chambolle et al., 2018). Thanks to variance reduction techniques, several efficient methods with constant step-sizes have been developed for convex settings that match the lower-bound worst-case complexity (Agarwal et al., 2010). However, variance reduction methods for nonconvex settings are still limited and heavily focus on the non-composite form of (1) and (2), i.e. , and the SVRG estimator.
Theory and stochastic methods for nonconvex problems are still in progress and require substantial effort to obtain efficient algorithms with rigorous convergence guarantees. It is shown in (Fang et al., 2018; Zhou and Gu, 2019) that there is still a gap between the upper-bound complexity in state-of-the-art methods and the lower-bound worst-case complexity for the nonconvex problem (2) under standard smoothness assumption. Motivated by this fact, we make an attempt to develop a new algorithmic framework that can reduce and at least nearly close this gap in the composite finite-sum setting (2). In addition to the best-known complexity bounds, we expect to design practical algorithms advancing beyond existing methods by providing an adaptive rule to update step-sizes with rigorous complexity analysis. Our algorithms rely on a recent biased stochastic estimator for the objective gradient, called SARAH, introduced in (Nguyen et al., 2017a) for convex problems.
In the nonconvex case, both problems (1) and (2) have been intensively studied in recent years with a vast number of research papers. While numerical algorithms for solving the non-composite setting, i.e. , are well-developed and have received considerable attention (Allen-Zhu, 2017b; Allen-Zhu and Li, 2018; Allen-Zhu and Yuan, 2016; Fang et al., 2018; Lihua et al., 2017; Nguyen et al., 2017b, 2018b, 2019; Reddi et al., 2016b; Zhou et al., 2018), methods for composite setting remain limited (Reddi et al., 2016b; Wang et al., 2018). In terms of algorithms, (Reddi et al., 2016b) studies a non-composite finite-sum problem as a special case of (2) using SVRG estimator from (Johnson and Zhang, 2013). Additionally, they extend their method to the composite setting by simply applying the proximal operator of as in the well-known forward-backward scheme. Another related work using SVRG estimator can be found in (Li and Li, 2018). These algorithms have some limitation as will be discussed later. The same technique was applied in (Wang et al., 2018) to develop other variants for both (1) and (2), but using the SARAH estimator from (Nguyen et al., 2017a). The authors derive a large constant step-size, but at the same time control mini-batch size to achieve desired complexity bounds. Consequently, it has an essential limitation as will also be discussed in Subsection 3.4. Both algorithms achieve the best-known complexity bounds for solving (1) and (2). In (Reddi et al., 2016a), the authors propose a stochastic Frank-Wolfe method that can handle constraints as special cases of (2). Recently, a stochastic variance reduction method with momentum was studied in (Zhou et al., 2019) for solving (2) which can be viewed as a modification of SpiderBoost in (Wang et al., 2018).
Our algorithm remains a variance reduction stochastic method, but it is different from these works at two major points: an additional averaging step and two different step-sizes. Having two step-sizes allows us to flexibly trade-off them and develop an adaptive update rule. Note that our averaging step looks similar to the robust stochastic gradient method in (Nemirovski et al., 2009), but fundamentally different since it evaluates the proximal step at the averaging point. In fact, it is closely related to averaged fixed-point schemes in the literature, see, e.g. (Bauschke and Combettes, 2017).
In terms of theory, many researchers have focused on theoretical aspects of existing algorithms. For example, (Ghadimi and Lan, 2013)
appears to be one of the first pioneering works studying convergence rates of stochastic gradient descent-type methods for nonconvex and non-composite finite-sum problems. They later extend it to the composite setting in(Ghadimi et al., 2016). (Wang et al., 2018) also investigate the gradient dominance case, and (Karimi et al., 2016) consider both finite-sum and composite finite-sum under different assumptions.
Whereas many researchers have been trying to improve complexity upper bounds of stochastic first-order methods using different techniques (Allen-Zhu, 2017b; Allen-Zhu and Li, 2018; Allen-Zhu and Yuan, 2016; Fang et al., 2018), other researchers attempt to construct examples for lower-bound complexity estimates. In the convex case, there exist numerous research papers including (Agarwal et al., 2010; Nemirovskii and Yudin, 1983; Nesterov, 2004). In (Fang et al., 2018; Zhou and Gu, 2019), the authors have constructed a lower-bound complexity for nonconvex finite-sum problem covered by (2). They showed that the lower-bound complexity for any stochastic gradient method using only smoothness assumption to achieve an -stationary point in expectation is given that the number of objective components does not exceed .
For the expectation problem (1), the best-known complexity bound to achieve an -stationary point in expectation is as shown in (Fang et al., 2018; Wang et al., 2018), where is an upper bound of the variance (see Assumption 2.3). Unfortunately, we have not seen any lower-bound complexity for the nonconvex setting of (1) under standard assumptions in the literature.
Our approach and contribution:
We exploit the SARAH estimator, a biased stochastic recursive gradient estimator, in (Nguyen et al., 2017a), to design new proximal variance reduction stochastic gradient algorithms to solve both composite expectation and finite-sum problems (1) and (2). The SARAH algorithm is simply a double-loop stochastic gradient method with a flavor of SVRG (Johnson and Zhang, 2013), but using a novel biased estimator that is different from SVRG. SARAH is a recursive method as SAGA (Defazio et al., 2014), but can avoid the major issue of storing gradients as in SAGA. Our method will rely on the SARAH estimator as in SPIDER and SpiderBoost combining with an averaging proximal-gradient scheme to solve both (1) and (2).
The contribution of this paper is a new algorithmic framework that covers different variants with constant and adaptive step-sizes, single sample and mini-batch, and achieves best-known theoretical complexity bounds. More specifically, our main contribution can be summarized as follows:
Composite settings: We propose a general stochastic variance reduction framework relying on the SARAH estimator to solve both expectation and finite-sum problems (1) and (2) in composite settings. We analyze our framework to design appropriate constant step-sizes instead of diminishing step-sizes as in standard stochastic gradient descent methods. As usual, the algorithm has double loops, where the outer loop can either take full gradient or mini-batch to reduce computational burden in large-scale and expectation settings. The inner loop can work with single sample or a broad range of mini-batch sizes.
Best-known complexity: In the finite-sum setting (2), our method achieves complexity bound to attain an -stationary point in expectation under only the smoothness of . This complexity matches the lower-bound worst-case complexity in (Fang et al., 2018; Zhou and Gu, 2019) up to a constant factor when . In the expectation setting (1), our algorithm requires first-order oracle calls of to achieve an -stationary point in expectation under only the smoothness of and bounded variance . To the best of our knowledge, this is the best-known complexity so far for (1) under standard assumptions in both the single sample and mini-batch cases.
Adaptive step-sizes: Apart from constant step-size algorithms, we also specify our framework to obtain adaptive step-size variants for both composite and non-composite settings in both single sample and mini-batch cases. Our adaptive step-sizes are increasing along the inner iterations rather than diminishing as in stochastic proximal gradient descent methods. The adaptive variants often outperform the constant step-sizes schemes in several test cases.
Our result covers the non-composite setting in the finite-sum case (Nguyen et al., 2019), and matches the best-known complexity in (Fang et al., 2018; Wang et al., 2018) for both problems (1) and (2). Since the composite setting covers a broader class of nonconvex problems including convex constraints, we believe that our method has better chance to handle new applications than non-composite methods. It also allows one to deal with composite problems under different type of regularizers such as sparsity or constraints on weights as in neural network training applications.
Hitherto, we have found three different variance reduction algorithms of the stochastic proximal gradient method for nonconvex problems that are most related to our work: proximal SVRG (called ProxSVRG) in (Reddi et al., 2016b), ProxSVRG+ in (Li and Li, 2018), and ProxSpiderBoost in (Wang et al., 2018). Other methods such as proximal stochastic gradient descent (ProxSGD) scheme (Ghadimi et al., 2016), ProxSAGA in (Reddi et al., 2016b), and Natasha variants in (Allen-Zhu, 2017b) are quite different and already intensively compared in previous works (Li and Li, 2018; Reddi et al., 2016b; Wang et al., 2018), and hence we do not include them here.
In terms of theory, Table 1 compares different methods for solving (1) and (2) regarding the stochastic first-order oracle calls (SFO), the applicability to finite-sum and/or expectation and composite settings, step-sizes, and the use of adaptive step-sizes.
|GD (Nesterov, 2004)||NA||✓||Yes|
|SGD (Ghadimi and Lan, 2013)||NA||✓||Yes|
|SVRG/SAGA (Reddi et al., 2016b)||NA||✓||No|
|SVRG+ (Li and Li, 2018)||✓||No|
|SCSG (Lihua et al., 2017)||✗||No|
|SNVRG (Zhou et al., 2018)||✗||No|
|SPIDER (Fang et al., 2018)||✗||Yes|
|SpiderBoost (Wang et al., 2018)||✓||No|
|ProxSARAH (This work)||✓||Yes|
Single sample for the finite-sum case:
The performance of gradient descent-type algorithms crucially depends on the step-size (i.e., learning rate). Let us make a comparison between different methods in terms of step-size for single sample case, and the corresponding complexity bound.
As shown in (Reddi et al., 2016b, Theorem 1), in the single sample case, i.e. the mini-batch size of the inner loop , ProxSVRG for solving (2) has a small step-size , and its corresponding complexity is , see (Reddi et al., 2016b, Corollary 1), which is the same as in standard proximal gradient methods.
ProxSVRG+ in (Li and Li, 2018, Theorem 3) is a variant of ProxSVRG, and in the single sample case, it uses a different step-size . This step-size is only better than that of ProxSVRG if . With this step-size, the complexity of ProxSVRG+ remains as in ProxSVRG.
In the non-composite case, SPIDER (Fang et al., 2018) relies on an adaptive step-size , where is the SARAH stochastic estimator. Clearly, this step-size is very small if the target accuracy is small, and/or is large. However, SPIDER achieves complexity bound, which is nearly optimal. Note that this step-size is problem dependent since it depends on . We also emphasize that SPIDER did not consider the composite problems.
In our constant step-size ProxSARAH variants, we use two step-sizes: averaging step-size and proximal-gradient step-size , and their product presents a combined step-size, which is (see (23) for our definition of step-size). Clearly, our step-size is much larger than that of both ProxSVRG and ProxSVRG+. It can be larger than that of SPIDER if is small and is large. With these step-sizes, our complexity bound is , and if , then it reduces to , which is also nearly optimal.
As we can observe from Algorithm 1 in the sequel, the number of proximal operator calls in our method remains the same as in ProxSVRG and ProxSVRG+.
Mini-batch for the finite-sum case:
Now, we consider the case of using mini-batch.
As indicated in (Reddi et al., 2016b, Theorem 2), if we choose the batch size and , then the step-size can be chosen as , and its complexity is improved up to for ProxSVRG. However, the mini-batch size is close to the full dataset .
For SPIDER, again in the non-composite setting, if we choose the batch-size , then its step-size is . In addition, SPIDER limits the batch size in the range of , and did not consider larger mini-batch sizes.
For SpiderBoost in (Wang et al., 2018), it requires to properly set mini-batch size to achieve complexity for solving (2). More precisely, from (Wang et al., 2018, Theorem 1), we can see that one needs to set and to achieve such a complexity. This mini-batch size can be large if is large, and less flexible to adjust the performance of the algorithm. Unfortunately, ProxSpiderBoost does not have theoretical guarantee for the single sample case.
In our methods, it is flexible to choose the epoch length and the batch size such that we can obtain different step-sizes and complexity bounds. Our batch-size can be any value in for (2). Given , we can properly choose to obtain the best-known complexity bound when and , otherwise. More details can be found in Subsection 3.4.
Online or expectation problems:
For online or expectation problems, a mini-batch is required to evaluate snapshot gradient estimators for the outer loop.
In the online or expectation case (1), SPIDER in (Fang et al., 2018, Theorem 1) achieves an complexity. In the single sample case, SPIDER’s step-size becomes , which can be very small, and depends on and . Note that is often unknown or hard to estimate. Moreover, in early iterations, is often large potentially making this method slow.
ProxSpiderBoost in (Wang et al., 2018) achieves the same complexity bound as SPIDER for the composite problem (1), but requires to set the mini-batch for both outer and inner loops. The size of these mini-batches has to be fixed a priori in order to use a constant step-size, which is certainly less flexible. The total complexity of this method is .
As shown in Theorem 3.5, our complexity is given that . Otherwise, it is , which is the same as in ProxSpiderBoost. Note that our complexity can be achieved for both single sample and a wide range of mini-batch sizes as opposed to a predefined mini-batch size of ProxSpiderBoost.
From an algorithmic point of view, our method is fundamentally different from existing methods due to its averaging step and large step-sizes in the composite settings. Moreover, our methods have more chance to improve the performance due to the use of adaptive step-sizes and an additional damped step-size , and the flexibility to choose the epoch length , the inner mini-batch size , and the snapshot batch size .
The rest of this paper is organized as follows. Section 2 discusses the fundamental assumptions and optimality conditions. Section 3 presents the main algorithmic framework and its convergence results for two settings. Section 4 considers extensions and special cases of our algorithms. Section 5 provides some numerical examples to verify our methods and compare them with existing state-of-the-arts.
2 Mathematical tools and preliminary results
Firstly, we recall some basic notation and concepts in optimization, which can be found in (Bauschke and Combettes, 2017; Nesterov, 2004). Next, we state our blanket assumptions and discuss the optimality condition of (1) and (2). Finally, we provide preliminary results needed in the sequel.
2.1 Basic notation and concepts
We work with finite dimensional spaces, , equipped with standard inner product and Euclidean norm . Given a function , we use to denote its (effective) domain. If is proper, closed, and convex, denotes its subdifferential at , and denotes its proximal operator. Note that if is the indicator of a nonempty, closed, and convex set , i.e. , then , the projection of onto . Any element of is called a subgradient of at . If is differentiable at , then , the gradient of at . A continuous differentiable function is said to be -smooth if is Lipschitz continuous on its domain, i.e. for . We use to denote a finite set
equipped with a probability distributionover . If is uniform, then we simply use . For any real number , denotes the largest integer less than or equal to . We use to denote the set .
2.2 Fundamental assumptions
Assumption 2.1 (Bounded from below)
This assumption usually holds in practice since
often represents a loss function which is nonnegative or bounded from below. In addition, the regularizeris also nonnegative or bounded from below, and its domain intersects .
Our next assumption is the smoothness of with respect to the argument .
Assumption 2.2 (-average smoothness)
We can write (4) as . Note that (4) is weaker than assuming that each component is -smooth, i.e., for all . Indeed, the individual -smoothness implies (4) with . Conversely, if (4) holds, then for . Therefore, each component is -smooth, which is larger than (4) within a factor of in the worst-case. We emphasize that ProxSVRG, ProxSVRG+, and ProxSpiderBoost all require the -smoothness of each component in (2).
It is well-known that the -smooth condition leads to the following bound
Indeed, from (3), we have
In the expectation setting (1), we need the following bounded variance condition:
Assumption 2.3 (Bounded variance)
For the expectation problem (1), there exists a uniform constant such that
This assumption is standard in stochastic optimization and often required in almost any solution method for solving (1), see, e.g. (Ghadimi and Lan, 2013). For problem (2), if is extremely large, passing over data points is exhaustive or impossible. We refer to this case as the online case mentioned in (Fang et al., 2018), and can be cast into Assumption 2.3. Therefore, we do not consider this case separately. However, our theory and algorithms developed in this paper do apply to such a setting.
2.3 Optimality conditions
Since is proper, closed, and convex, its proximal operator satisfies the nonexpansiveness, i.e. for all .
Now, for any fixed , we define the following quantity
The condition (11) is standard in stochastic nonconvex optimization methods. Stronger results such as approximate second-order optimality or strictly local minimum require additional assumptions and more sophisticated optimization methods such as cubic regularized Newton-type schemes, see, e.g., (Nesterov and Polyak, 2006).
2.4 Stochastic gradient estimators
Single sample estimators:
A simple estimator of can be computed as follows:
where is a realization of . This estimator is unbiased, i.e., , but its variance is fixed for any , where is the history of randomness collected up to the -th iteration, i.e.:
This is a
-field generated by random variables. In the finite-sum setting (2), we have , where with .
In recent years, there has been huge interest in designing stochastic estimators with variance reduction properties. The first variance reduction method was perhaps proposed in (Schmidt et al., 2017) since 2013, and then in (Defazio et al., 2014) for convex optimization. However, the most well-known method is SVRG introduced by Johnson and Zhang in (Johnson and Zhang, 2013) that works for both convex and nonconvex problems. The SVRG estimator for in (2) is given as
where is the full gradient of at a snapshot point , and is a uniformly random index in . It is clear that , which shows that
is an unbiased estimator of. Moreover, its variance is reduced along the snapshots.
Our methods rely on the SARAH estimator introduced in (Nguyen et al., 2017a) for the non-composite convex problem instances of (2). We instead consider it in a more general setting to cover both (2) and (1), which is defined as follows:
for a given realization of . Each evaluation of requires two gradient evaluations. Clearly, the SARAH estimator is biased, since . But it has a variance reduced property.
where is a mini-batch of the size . For the finite-sum problem (2), we replace by . In this case, is a uniformly random subset of . Clearly, if , then we take the full gradient as the exact estimator.
2.5 Basic properties of stochastic and SARAH estimators
Consequently, for any , we have
Our next result is some properties of the mini-batch estimators in (16). Most of the proof is presented in (Harikandeh et al., 2015; Lohr, 2009; Nguyen et al., 2017b, 2018a), and we only provide the missing proof of (21) and (22) in Appendix A.
If is generated by (16) for the finite support case , then
where is defined as
3 ProxSARAH framework and convergence analysis
We describe our unified algorithmic framework and then specify it to solve different instances of (1) and (2) under appropriate structures. The general algorithm is described in Algorithm 1, which is abbreviated by ProxSARAH.
In terms of algorithm, ProxSARAH is different from SARAH where it has one proximal step followed by an additional averaging step, Step 8. However, using the gradient mapping defined by (9), we can view Step 8 as:
Hence, this step is similar to a gradient step applying to the gradient mapping . In particular, if we set , then we obtain a vanilla proximal SARAH variant which is similar to ProxSVRG, ProxSVRG+, and ProxSpiderBoost discussed above. ProxSVRG, ProxSVRG+, and ProxSpiderBoost are simply vanilla proximal gradient-type methods in stochastic setttings. If , then ProxSARAH is reduced to SARAH in (Nguyen et al., 2017a, b, 2018b) with a step-size . Note that Step 8 can be represented as a weighted averaging step with given weights :
which is similar to averaged fixed-point schemes (e.g. the Krasnosel’skiĭ – Mann scheme) in the literature, see, e.g., (Bauschke and Combettes, 2017).
In addition, we will show in our analysis a key difference in terms of step-sizes and , mini-batch, and epoch length between ProxSARAH and existing methods, including SPIDER (Fang et al., 2018) and SpiderBoost (Wang et al., 2018).
3.1 Analysis of the inner-loop: Key estimates
This subsection proves two key estimates of the inner loop for to . We break our analysis into two different lemmas, which provide key estimates for our convergence analysis. We assume that the mini-batch size in the inner loop is fixed.
where , , and are any given positive sequences, , , and
The proof of Lemma 3.1 is deferred to Appendix B.1. The next lemma shows how to choose constant step-sizes and by fixing other parameters in Lemma 3.1 to obtain a descent property. The proof of this lemma is given in Appendix B.2.
Under Assumption 2.2 and