1 Introduction
In this paper, we consider the following stochastic composite, nonconvex, and possibly nonsmooth optimization problem:
(1) 
where is the expectation of a stochastic function
depending on a random vector
in a given probability space
, and is a proper, closed, and convex function.As a special case of (1), if is a uniformly random vector defined on a finite support set , then (1) reduces to the following composite nonconvex finitesum minimization problem:
(2) 
where for . Problem (2
) is often referred to as a regularized empirical risk minimization in machine learning and finance.
Motivation:
Problems (1) and (2) cover a broad range of applications in machine learning and statistics, especially in neural networks, see, e.g. (Bottou, 1998, 2010; Bottou et al., 2018; Goodfellow et al., 2016; Sra et al., 2012). Hitherto, stateoftheart numerical optimization methods for solving these problems rely on stochastic approaches, see, e.g. (Johnson and Zhang, 2013; Schmidt et al., 2017; Shapiro et al., 2009; Defazio et al., 2014). In the convex case, both noncomposite and composite settings (1) and (2) have been intensively studied with different schemes such as standard stochastic gradient (Robbins and Monro, 1951), proximal stochastic gradient (Ghadimi and Lan, 2013; Nemirovski et al., 2009), stochastic dual coordinate descent (ShalevShwartz and Zhang, 2013), variance reduction methods (e.g., SVRG and SAGA) (AllenZhu, 2017a; Defazio et al., 2014; Johnson and Zhang, 2013; Nitanda, 2014; Schmidt et al., 2017; Xiao and Zhang, 2014), stochastic conditional gradient (FrankWolfe) methods (Reddi et al., 2016a), and stochastic primaldual methods (Chambolle et al., 2018). Thanks to variance reduction techniques, several efficient methods with constant stepsizes have been developed for convex settings that match the lowerbound worstcase complexity (Agarwal et al., 2010). However, variance reduction methods for nonconvex settings are still limited and heavily focus on the noncomposite form of (1) and (2), i.e. , and the SVRG estimator.
Theory and stochastic methods for nonconvex problems are still in progress and require substantial effort to obtain efficient algorithms with rigorous convergence guarantees. It is shown in (Fang et al., 2018; Zhou and Gu, 2019) that there is still a gap between the upperbound complexity in stateoftheart methods and the lowerbound worstcase complexity for the nonconvex problem (2) under standard smoothness assumption. Motivated by this fact, we make an attempt to develop a new algorithmic framework that can reduce and at least nearly close this gap in the composite finitesum setting (2). In addition to the bestknown complexity bounds, we expect to design practical algorithms advancing beyond existing methods by providing an adaptive rule to update stepsizes with rigorous complexity analysis. Our algorithms rely on a recent biased stochastic estimator for the objective gradient, called SARAH, introduced in (Nguyen et al., 2017a) for convex problems.
Related work:
In the nonconvex case, both problems (1) and (2) have been intensively studied in recent years with a vast number of research papers. While numerical algorithms for solving the noncomposite setting, i.e. , are welldeveloped and have received considerable attention (AllenZhu, 2017b; AllenZhu and Li, 2018; AllenZhu and Yuan, 2016; Fang et al., 2018; Lihua et al., 2017; Nguyen et al., 2017b, 2018b, 2019; Reddi et al., 2016b; Zhou et al., 2018), methods for composite setting remain limited (Reddi et al., 2016b; Wang et al., 2018). In terms of algorithms, (Reddi et al., 2016b) studies a noncomposite finitesum problem as a special case of (2) using SVRG estimator from (Johnson and Zhang, 2013). Additionally, they extend their method to the composite setting by simply applying the proximal operator of as in the wellknown forwardbackward scheme. Another related work using SVRG estimator can be found in (Li and Li, 2018). These algorithms have some limitation as will be discussed later. The same technique was applied in (Wang et al., 2018) to develop other variants for both (1) and (2), but using the SARAH estimator from (Nguyen et al., 2017a). The authors derive a large constant stepsize, but at the same time control minibatch size to achieve desired complexity bounds. Consequently, it has an essential limitation as will also be discussed in Subsection 3.4. Both algorithms achieve the bestknown complexity bounds for solving (1) and (2). In (Reddi et al., 2016a), the authors propose a stochastic FrankWolfe method that can handle constraints as special cases of (2). Recently, a stochastic variance reduction method with momentum was studied in (Zhou et al., 2019) for solving (2) which can be viewed as a modification of SpiderBoost in (Wang et al., 2018).
Our algorithm remains a variance reduction stochastic method, but it is different from these works at two major points: an additional averaging step and two different stepsizes. Having two stepsizes allows us to flexibly tradeoff them and develop an adaptive update rule. Note that our averaging step looks similar to the robust stochastic gradient method in (Nemirovski et al., 2009), but fundamentally different since it evaluates the proximal step at the averaging point. In fact, it is closely related to averaged fixedpoint schemes in the literature, see, e.g. (Bauschke and Combettes, 2017).
In terms of theory, many researchers have focused on theoretical aspects of existing algorithms. For example, (Ghadimi and Lan, 2013)
appears to be one of the first pioneering works studying convergence rates of stochastic gradient descenttype methods for nonconvex and noncomposite finitesum problems. They later extend it to the composite setting in
(Ghadimi et al., 2016). (Wang et al., 2018) also investigate the gradient dominance case, and (Karimi et al., 2016) consider both finitesum and composite finitesum under different assumptions.Whereas many researchers have been trying to improve complexity upper bounds of stochastic firstorder methods using different techniques (AllenZhu, 2017b; AllenZhu and Li, 2018; AllenZhu and Yuan, 2016; Fang et al., 2018), other researchers attempt to construct examples for lowerbound complexity estimates. In the convex case, there exist numerous research papers including (Agarwal et al., 2010; Nemirovskii and Yudin, 1983; Nesterov, 2004). In (Fang et al., 2018; Zhou and Gu, 2019), the authors have constructed a lowerbound complexity for nonconvex finitesum problem covered by (2). They showed that the lowerbound complexity for any stochastic gradient method using only smoothness assumption to achieve an stationary point in expectation is given that the number of objective components does not exceed .
For the expectation problem (1), the bestknown complexity bound to achieve an stationary point in expectation is as shown in (Fang et al., 2018; Wang et al., 2018), where is an upper bound of the variance (see Assumption 2.3). Unfortunately, we have not seen any lowerbound complexity for the nonconvex setting of (1) under standard assumptions in the literature.
Our approach and contribution:
We exploit the SARAH estimator, a biased stochastic recursive gradient estimator, in (Nguyen et al., 2017a), to design new proximal variance reduction stochastic gradient algorithms to solve both composite expectation and finitesum problems (1) and (2). The SARAH algorithm is simply a doubleloop stochastic gradient method with a flavor of SVRG (Johnson and Zhang, 2013), but using a novel biased estimator that is different from SVRG. SARAH is a recursive method as SAGA (Defazio et al., 2014), but can avoid the major issue of storing gradients as in SAGA. Our method will rely on the SARAH estimator as in SPIDER and SpiderBoost combining with an averaging proximalgradient scheme to solve both (1) and (2).
The contribution of this paper is a new algorithmic framework that covers different variants with constant and adaptive stepsizes, single sample and minibatch, and achieves bestknown theoretical complexity bounds. More specifically, our main contribution can be summarized as follows:

Composite settings: We propose a general stochastic variance reduction framework relying on the SARAH estimator to solve both expectation and finitesum problems (1) and (2) in composite settings. We analyze our framework to design appropriate constant stepsizes instead of diminishing stepsizes as in standard stochastic gradient descent methods. As usual, the algorithm has double loops, where the outer loop can either take full gradient or minibatch to reduce computational burden in largescale and expectation settings. The inner loop can work with single sample or a broad range of minibatch sizes.

Bestknown complexity: In the finitesum setting (2), our method achieves complexity bound to attain an stationary point in expectation under only the smoothness of . This complexity matches the lowerbound worstcase complexity in (Fang et al., 2018; Zhou and Gu, 2019) up to a constant factor when . In the expectation setting (1), our algorithm requires firstorder oracle calls of to achieve an stationary point in expectation under only the smoothness of and bounded variance . To the best of our knowledge, this is the bestknown complexity so far for (1) under standard assumptions in both the single sample and minibatch cases.

Adaptive stepsizes: Apart from constant stepsize algorithms, we also specify our framework to obtain adaptive stepsize variants for both composite and noncomposite settings in both single sample and minibatch cases. Our adaptive stepsizes are increasing along the inner iterations rather than diminishing as in stochastic proximal gradient descent methods. The adaptive variants often outperform the constant stepsizes schemes in several test cases.
Our result covers the noncomposite setting in the finitesum case (Nguyen et al., 2019), and matches the bestknown complexity in (Fang et al., 2018; Wang et al., 2018) for both problems (1) and (2). Since the composite setting covers a broader class of nonconvex problems including convex constraints, we believe that our method has better chance to handle new applications than noncomposite methods. It also allows one to deal with composite problems under different type of regularizers such as sparsity or constraints on weights as in neural network training applications.
Comparison:
Hitherto, we have found three different variance reduction algorithms of the stochastic proximal gradient method for nonconvex problems that are most related to our work: proximal SVRG (called ProxSVRG) in (Reddi et al., 2016b), ProxSVRG+ in (Li and Li, 2018), and ProxSpiderBoost in (Wang et al., 2018). Other methods such as proximal stochastic gradient descent (ProxSGD) scheme (Ghadimi et al., 2016), ProxSAGA in (Reddi et al., 2016b), and Natasha variants in (AllenZhu, 2017b) are quite different and already intensively compared in previous works (Li and Li, 2018; Reddi et al., 2016b; Wang et al., 2018), and hence we do not include them here.
In terms of theory, Table 1 compares different methods for solving (1) and (2) regarding the stochastic firstorder oracle calls (SFO), the applicability to finitesum and/or expectation and composite settings, stepsizes, and the use of adaptive stepsizes.
Algorithms  Finitesum  Expectation  Composite  Stepsize  Adaptive stepsize 

GD (Nesterov, 2004)  NA  ✓  Yes  
SGD (Ghadimi and Lan, 2013)  NA  ✓  Yes  
SVRG/SAGA (Reddi et al., 2016b)  NA  ✓  No  
SVRG+ (Li and Li, 2018)  ✓  No  
SCSG (Lihua et al., 2017)  ✗  No  
SNVRG (Zhou et al., 2018)  ✗  No  
SPIDER (Fang et al., 2018)  ✗  Yes  
SpiderBoost (Wang et al., 2018)  ✓  No  
ProxSARAH (This work)  ✓  Yes 
Assumptions:
Single sample for the finitesum case:
The performance of gradient descenttype algorithms crucially depends on the stepsize (i.e., learning rate). Let us make a comparison between different methods in terms of stepsize for single sample case, and the corresponding complexity bound.

As shown in (Reddi et al., 2016b, Theorem 1), in the single sample case, i.e. the minibatch size of the inner loop , ProxSVRG for solving (2) has a small stepsize , and its corresponding complexity is , see (Reddi et al., 2016b, Corollary 1), which is the same as in standard proximal gradient methods.

ProxSVRG+ in (Li and Li, 2018, Theorem 3) is a variant of ProxSVRG, and in the single sample case, it uses a different stepsize . This stepsize is only better than that of ProxSVRG if . With this stepsize, the complexity of ProxSVRG+ remains as in ProxSVRG.

In the noncomposite case, SPIDER (Fang et al., 2018) relies on an adaptive stepsize , where is the SARAH stochastic estimator. Clearly, this stepsize is very small if the target accuracy is small, and/or is large. However, SPIDER achieves complexity bound, which is nearly optimal. Note that this stepsize is problem dependent since it depends on . We also emphasize that SPIDER did not consider the composite problems.

In our constant stepsize ProxSARAH variants, we use two stepsizes: averaging stepsize and proximalgradient stepsize , and their product presents a combined stepsize, which is (see (23) for our definition of stepsize). Clearly, our stepsize is much larger than that of both ProxSVRG and ProxSVRG+. It can be larger than that of SPIDER if is small and is large. With these stepsizes, our complexity bound is , and if , then it reduces to , which is also nearly optimal.

As we can observe from Algorithm 1 in the sequel, the number of proximal operator calls in our method remains the same as in ProxSVRG and ProxSVRG+.
Minibatch for the finitesum case:
Now, we consider the case of using minibatch.

As indicated in (Reddi et al., 2016b, Theorem 2), if we choose the batch size and , then the stepsize can be chosen as , and its complexity is improved up to for ProxSVRG. However, the minibatch size is close to the full dataset .

For SPIDER, again in the noncomposite setting, if we choose the batchsize , then its stepsize is . In addition, SPIDER limits the batch size in the range of , and did not consider larger minibatch sizes.

For SpiderBoost in (Wang et al., 2018), it requires to properly set minibatch size to achieve complexity for solving (2). More precisely, from (Wang et al., 2018, Theorem 1), we can see that one needs to set and to achieve such a complexity. This minibatch size can be large if is large, and less flexible to adjust the performance of the algorithm. Unfortunately, ProxSpiderBoost does not have theoretical guarantee for the single sample case.

In our methods, it is flexible to choose the epoch length and the batch size such that we can obtain different stepsizes and complexity bounds. Our batchsize can be any value in for (2). Given , we can properly choose to obtain the bestknown complexity bound when and , otherwise. More details can be found in Subsection 3.4.
Online or expectation problems:
For online or expectation problems, a minibatch is required to evaluate snapshot gradient estimators for the outer loop.

In the online or expectation case (1), SPIDER in (Fang et al., 2018, Theorem 1) achieves an complexity. In the single sample case, SPIDER’s stepsize becomes , which can be very small, and depends on and . Note that is often unknown or hard to estimate. Moreover, in early iterations, is often large potentially making this method slow.

ProxSpiderBoost in (Wang et al., 2018) achieves the same complexity bound as SPIDER for the composite problem (1), but requires to set the minibatch for both outer and inner loops. The size of these minibatches has to be fixed a priori in order to use a constant stepsize, which is certainly less flexible. The total complexity of this method is .

As shown in Theorem 3.5, our complexity is given that . Otherwise, it is , which is the same as in ProxSpiderBoost. Note that our complexity can be achieved for both single sample and a wide range of minibatch sizes as opposed to a predefined minibatch size of ProxSpiderBoost.
From an algorithmic point of view, our method is fundamentally different from existing methods due to its averaging step and large stepsizes in the composite settings. Moreover, our methods have more chance to improve the performance due to the use of adaptive stepsizes and an additional damped stepsize , and the flexibility to choose the epoch length , the inner minibatch size , and the snapshot batch size .
Paper organization:
The rest of this paper is organized as follows. Section 2 discusses the fundamental assumptions and optimality conditions. Section 3 presents the main algorithmic framework and its convergence results for two settings. Section 4 considers extensions and special cases of our algorithms. Section 5 provides some numerical examples to verify our methods and compare them with existing stateofthearts.
2 Mathematical tools and preliminary results
Firstly, we recall some basic notation and concepts in optimization, which can be found in (Bauschke and Combettes, 2017; Nesterov, 2004). Next, we state our blanket assumptions and discuss the optimality condition of (1) and (2). Finally, we provide preliminary results needed in the sequel.
2.1 Basic notation and concepts
We work with finite dimensional spaces, , equipped with standard inner product and Euclidean norm . Given a function , we use to denote its (effective) domain. If is proper, closed, and convex, denotes its subdifferential at , and denotes its proximal operator. Note that if is the indicator of a nonempty, closed, and convex set , i.e. , then , the projection of onto . Any element of is called a subgradient of at . If is differentiable at , then , the gradient of at . A continuous differentiable function is said to be smooth if is Lipschitz continuous on its domain, i.e. for . We use to denote a finite set
equipped with a probability distribution
over . If is uniform, then we simply use . For any real number , denotes the largest integer less than or equal to . We use to denote the set .2.2 Fundamental assumptions
To develop numerical methods for solving (1) and (2), we rely on some basic assumptions usually used in stochastic optimization methods.
Assumption 2.1 (Bounded from below)
This assumption usually holds in practice since
often represents a loss function which is nonnegative or bounded from below. In addition, the regularizer
is also nonnegative or bounded from below, and its domain intersects .Our next assumption is the smoothness of with respect to the argument .
Assumption 2.2 (average smoothness)
We can write (4) as . Note that (4) is weaker than assuming that each component is smooth, i.e., for all . Indeed, the individual smoothness implies (4) with . Conversely, if (4) holds, then for . Therefore, each component is smooth, which is larger than (4) within a factor of in the worstcase. We emphasize that ProxSVRG, ProxSVRG+, and ProxSpiderBoost all require the smoothness of each component in (2).
It is wellknown that the smooth condition leads to the following bound
(5) 
Indeed, from (3), we have
which shows that . Hence, using either (3) or (4), we get
(6) 
In the expectation setting (1), we need the following bounded variance condition:
Assumption 2.3 (Bounded variance)
For the expectation problem (1), there exists a uniform constant such that
(7) 
This assumption is standard in stochastic optimization and often required in almost any solution method for solving (1), see, e.g. (Ghadimi and Lan, 2013). For problem (2), if is extremely large, passing over data points is exhaustive or impossible. We refer to this case as the online case mentioned in (Fang et al., 2018), and can be cast into Assumption 2.3. Therefore, we do not consider this case separately. However, our theory and algorithms developed in this paper do apply to such a setting.
2.3 Optimality conditions
Under Assumption 2.1, we have . When is nonconvex in , the first order optimality condition of (1) can be stated as
(8) 
Here, is called a stationary point of . We denote the set of all stationary points. The condition (8) is called the firstorder optimality condition, and also holds for (2).
Since is proper, closed, and convex, its proximal operator satisfies the nonexpansiveness, i.e. for all .
Now, for any fixed , we define the following quantity
(9) 
This quantity is called the gradient mapping of (Nesterov, 2004). Indeed, if , then , which is exactly the gradient of . By using , the optimality condition (8) can be equivalently written as
(10) 
If we apply gradienttype methods to solve (1) or (2), then we can only aim at finding an approximate stationary point to in (10) after at most iterations within a given accuracy , i.e.:
(11) 
The condition (11) is standard in stochastic nonconvex optimization methods. Stronger results such as approximate secondorder optimality or strictly local minimum require additional assumptions and more sophisticated optimization methods such as cubic regularized Newtontype schemes, see, e.g., (Nesterov and Polyak, 2006).
2.4 Stochastic gradient estimators
One key step to design a stochastic gradient method for (1) or (2) is to query an estimator for the gradient at any . Let us recall some existing stochastic estimators.
Single sample estimators:
A simple estimator of can be computed as follows:
(12) 
where is a realization of . This estimator is unbiased, i.e., , but its variance is fixed for any , where is the history of randomness collected up to the th iteration, i.e.:
(13) 
This is a
field generated by random variables
. In the finitesum setting (2), we have , where with .In recent years, there has been huge interest in designing stochastic estimators with variance reduction properties. The first variance reduction method was perhaps proposed in (Schmidt et al., 2017) since 2013, and then in (Defazio et al., 2014) for convex optimization. However, the most wellknown method is SVRG introduced by Johnson and Zhang in (Johnson and Zhang, 2013) that works for both convex and nonconvex problems. The SVRG estimator for in (2) is given as
(14) 
where is the full gradient of at a snapshot point , and is a uniformly random index in . It is clear that , which shows that
is an unbiased estimator of
. Moreover, its variance is reduced along the snapshots.Our methods rely on the SARAH estimator introduced in (Nguyen et al., 2017a) for the noncomposite convex problem instances of (2). We instead consider it in a more general setting to cover both (2) and (1), which is defined as follows:
(15) 
for a given realization of . Each evaluation of requires two gradient evaluations. Clearly, the SARAH estimator is biased, since . But it has a variance reduced property.
Minibatch estimators:
We consider a minibatch estimator of the gradient in (12) and of the SARAH estimator (15) respectively as follows:
(16) 
where is a minibatch of the size . For the finitesum problem (2), we replace by . In this case, is a uniformly random subset of . Clearly, if , then we take the full gradient as the exact estimator.
2.5 Basic properties of stochastic and SARAH estimators
We recall some basic properties of the standard stochastic and SARAH estimators for (1) and (2). The following result was proved in (Nguyen et al., 2017a).
Let be defined by (15) and be defined by (13). Then
(17) 
Consequently, for any , we have
(18) 
Our next result is some properties of the minibatch estimators in (16). Most of the proof is presented in (Harikandeh et al., 2015; Lohr, 2009; Nguyen et al., 2017b, 2018a), and we only provide the missing proof of (21) and (22) in Appendix A.
If is generated by (16), then, under Assumption 2.3, we have
(19) 
If is generated by (16) for the finite support case , then
(20) 
where is defined as
If is generated by (16) for the case in the finitesum problem (2), then
(21) 
If is generated by (16) for the case in the expectation problem (1), then
(22) 
Note that if , i.e., we take a full gradient estimate, then the second estimate of (20) is vanished and independent of . The second term of (21) is also vanished.
3 ProxSARAH framework and convergence analysis
We describe our unified algorithmic framework and then specify it to solve different instances of (1) and (2) under appropriate structures. The general algorithm is described in Algorithm 1, which is abbreviated by ProxSARAH.
In terms of algorithm, ProxSARAH is different from SARAH where it has one proximal step followed by an additional averaging step, Step 8. However, using the gradient mapping defined by (9), we can view Step 8 as:
(23) 
Hence, this step is similar to a gradient step applying to the gradient mapping . In particular, if we set , then we obtain a vanilla proximal SARAH variant which is similar to ProxSVRG, ProxSVRG+, and ProxSpiderBoost discussed above. ProxSVRG, ProxSVRG+, and ProxSpiderBoost are simply vanilla proximal gradienttype methods in stochastic setttings. If , then ProxSARAH is reduced to SARAH in (Nguyen et al., 2017a, b, 2018b) with a stepsize . Note that Step 8 can be represented as a weighted averaging step with given weights :
Compared to (Ghadimi and Lan, 2012; Nemirovski et al., 2009), ProxSARAH evaluates at the averaged point instead of . Therefore, it can be written as
which is similar to averaged fixedpoint schemes (e.g. the Krasnosel’skiĭ – Mann scheme) in the literature, see, e.g., (Bauschke and Combettes, 2017).
In addition, we will show in our analysis a key difference in terms of stepsizes and , minibatch, and epoch length between ProxSARAH and existing methods, including SPIDER (Fang et al., 2018) and SpiderBoost (Wang et al., 2018).
3.1 Analysis of the innerloop: Key estimates
This subsection proves two key estimates of the inner loop for to . We break our analysis into two different lemmas, which provide key estimates for our convergence analysis. We assume that the minibatch size in the inner loop is fixed.
Let be generated by the innerloop of Algorithm 1 with fixed. Then, under Assumption 2.2, we have
(24) 
where , , and are any given positive sequences, , , and
(25) 
Here, if Algorithm 1 solves (1), and if Algorithm 1 solves (2).
The proof of Lemma 3.1 is deferred to Appendix B.1. The next lemma shows how to choose constant stepsizes and by fixing other parameters in Lemma 3.1 to obtain a descent property. The proof of this lemma is given in Appendix B.2.
Under Assumption 2.2 and
Comments
There are no comments yet.