Variance_Reduced_Optimizers_Pytorch
PyTorch Implementation of Variance Reduced Optimization Algorithms  SARAH and SVRG.
view repo
There has been extensive research on developing stochastic variance reduced methods to solve largescale optimization problems. More recently, a novel algorithm of such a type named SPIDER has been developed in Fang2018, which was shown to outperform existing algorithms of the same type and meet the lower bound in certain regimes. Though interesting in theory, SPIDER requires ϵlevel stepsize to guarantee the convergence, and consequently runs slow in practice. This paper proposes SpiderBoost as an improved SPIDER scheme, which comes with two major advantages compared to SPIDER. First, it allows much larger stepsize without sacrificing the convergence rate, and hence runs substantially faster than SPIDER in practice. Second, it extends much more easily to proximal algorithms with guaranteed convergence for solving composite optimization problems, which appears challenging for SPIDER due to stringent requirement on periteration increment to guarantee its convergence. Both advantages can be attributed to the new convergence analysis we develop for SpiderBoost that allows much more flexibility for choosing algorithm parameters. As further generalization of SpiderBoost, we show that proximal SpiderBoost achieves a stochastic firstorder oracle (SFO) complexity of O({n^1/2ϵ^1,ϵ^3/2}) for composite optimization, which improves the existing best results by a factor of O({n^1/6,ϵ^1/6}).
READ FULL TEXT VIEW PDFPyTorch Implementation of Variance Reduced Optimization Algorithms  SARAH and SVRG.
Largescale machine learning problems can typically be modeled as the following finitesum optimization problem
(P) 
where the function denotes the total loss on the training samples and in general is nonconvex. Since the sample size
can be very large, the fullbatch gradient descent algorithm has high computational complexity. Various stochastic gradient descent (SGD) algorithms have been proposed and extensively studied. For nonconvex optimization, the basic SGD algorithm, which calculates one gradient per iteration, has been shown to yield an overall stochastic firstorder oracle (SFO) complexity, i.e., gradient complexity, of
(Ghadimi et al., 2016) to attain a firstorder stationary point that satisfies . It has been shown that the vanilla SGD with constant stepsize converges only to the neighborhood of a firstorder stationary point. Such an issue can be further addressed by diminishing the stepsize (Bottou et al., 2018) or choosing a sufficiently large batch size in each iteration.Furthermore, various variance reduction methods have been proposed to reduce the variance of the gradient estimator in SGD by constructing a more sophisticated and accurate gradient estimator such as SAG
(Roux et al., 2012), SAGA (Defazio et al., 2014) and SVRG (Johnson and Zhang, 2013). In particular, SAGA and SVRG have been shown to yield an overall SFO complexity of (Reddi et al., 2016a; AllenZhu and Hazan, 2016) to obtain an approximate firstorder stationary point for nonconvex problems. These variance reduction methods also for the first time demonstrate that stochastic gradientbased methods dominate deterministic gradient descent methods with an order of for nonconvex optimization.More recently, Nguyen et al. (2017a, b) proposed a novel variance reduction method named SARAH, where the gradient estimator is designed to be sequentially updated with the iterate in the inner loop to improve the estimation accuracy. In particular, SARAH takes the stepsize (where is the number of iterations in each inner loop), and has been shown in Nguyen et al. (2017a) to achieve an overall SFO complexity to attain an approximate firstorder stationary point for nonconvex optimization. Another variance reduction method of the same type named SPIDER was also proposed in Fang et al. (2018), which uses the same gradient estimator as SARAH but adopts a natural gradient update with a learning rate . Fang et al. (2018) showed that SPIDER achieves an overall SFO, which was further shown to be optimal for the regime with .
Though SPIDER is theoretically appealing, two important issues of SPIDER requires further attention. (1) SPIDER requires a very restrictive stepsize ^{1}^{1}1SPIDER in Fang et al. (2018) takes a natural gradient descent update with a stepsize . It can be equivalently viewed as a gradient descent update with an adaptive stepsize , where is the estimate of the gradient at the th step. During the initial stage of the algorithm, can be much larger than so that the resulting stepsize can be very small. in order to guarantee the convergence, which prevents SPIDER from making big process even if it is possible. Relaxing such a condition appears not easy under the current convergence analysis framework. (2) The convergence analysis of SPIDER requires a very small periteration increment , which is difficult to guarantee if one generalizes it to a proximal algorithm for solving a composite optimization problem, due to the nonlinearity of the proximal operator. Hence, generalizing SPIDER to proximal algorithms with provable convergence guarantee is challenging, if not impossible. Thus, two natural questions arise as follows.
Can we relax the parameter restrictions of SPIDER without losing the guaranteed convergence rate?
If an improved SPIDER can be designed, does such improvement facilitates the generalization to proximal algorithms with convergence guarantee? Does the resulting algorithm improves the SFO complexity of existing proximal algorithms?
Our study here provides affirmative answers to both of the above questions. Our contributions are summarized as follows.
Inspired by SARAH and SPIDER, we propose a more practical variant, which we call as SpiderBoost. SpiderBoost has two main advantages. (1) SpiderBoost allows a much larger stepsize than the stepsize (if viewed under the gradient descent update) adopted by SPIDER, and at the same time achieves the same stateoftheart complexity order as SPIDER (see Table 1). This is due to the new convergence analysis idea that we develop, which analyzes the increments of variables over each entire inner loop rather than over each innerloop iteration, and hence yields tighter bound and consequently more relaxed stepsize requirement. As a result, SpiderBoost achieves significantly larger progress towards a firstorder stationary point than SPIDER especially in the initial optimization phase where is large, as demonstrated in Figure 1. (2) SpiderBoost comes with a natural generalization to proximal algorithms for solving composite optimization problems with convergence guarantee. This is because the convergence analysis we develop for SpiderBoost does not require a bound on , and such an attribute significantly facilitates the convergence analysis for proximal algorithms. This is in contrast to the convergence analysis of SPIDER, which explicitly exploits the condition which is difficult to hold for proximal algorithms.
Algorithms  Stepsize  Finitesum  Finitesum/Online^{2}^{2}2 The online setting refers to the case, where the objective function takes the form of the expected value of the loss function over the data distribution. In such a case, the batch size for estimating the gradient is typically chosen to be dependent. Such a method can also be applied to solve the finitesum problem, and hence the SFO complexity in the last column of Table 1 is applicable to both the finitesum and online problems. Thus, for algorithms in Table 1 that have SFO bounds available in both of the last two columns, the minimum between the two bounds provides the best bound for the finitesum problem. 


SFO  SFO  
GD  (Nesterov, 2014)  N/A^{3}^{3}3For deterministic algorithms, the online setting does not exist.  
SGD  (Ghadimi et al., 2016)  N/A  
SVRG  (Reddi et al., 2016a)  N/A  
(AllenZhu and Hazan, 2016)  
SCSG  (Lei et al., 2017)  
SARAH  (Nguyen et al., 2017b, a)  N/A  
SNVRG  (Zhou et al., 2018)  
SPIDER  (Fang et al., 2018)  ^{4}^{4}4SPIDER uses the natural gradient descent, which can also be viewed as the gradient descent with the stepszie .  


SpiderBoost  (This Work)  

Algorithms  Stepsize  FiniteSum  FiniteSum/Online  
SFO  PO  SFO  PO  
ProxGD  (Ghadimi et al., 2016)  N/A  N/A  
ProxSGD  (Ghadimi et al., 2016)  N/A  N/A  
ProxSVRG/SAGA  (Reddi et al., 2016b)  N/A  N/A  
Natasha1.5  (AllenZhu, 2017)  N/A  N/A  
ProxSVRG  (Li and Li, 2018) ^{5}^{5}5Li and Li (2018) contains a detailed discussion on the choice of the outerloop batch size. Here, we include only the best result. Moreover, their result based on a additional assumption that the total number of iterations is a multiple of the number of iterations of the inner loop, thus the additional term in the other bounds disappeared in their bound.  


ProxSpiderBoost  (This Work)  

Taking the aforementioned second advantage, we propose ProxSpiderBoost for solving the composite problem (Q) (see Section 3), where the objective function consists of a finitesum function and a nonsmooth regulizer function. We show that ProxSpiderBoost achieves a SFO complexity of and a proximal oracle (PO) complexity of . Such a SFO complexity improves the existing best results by a factor of (see Table 3). We further extend ProxSpiderBoost for solving the constrained composite optimization problem using the proximal mapping under a nonEuclidean geometry, i.e., by replacing in the problem (Q) with a convex constraint set , and replace the Euclidean distance with a generalized Bregman distance. Under certain conditions, we prove that the obtained algorithm achieves the same SFO complexity and PO complexity as ProxSpiderBoost for solving the unconstrained problem (Q). For nonconvex composite optimization problems that satisfy the socalled gradient dominance condition (see Definition 1), we propose a variant of the ProxSpiderBoost algorithm and establish its oracle complexity result for finding a stationary point. Our proposed algorithm achieves a SFO complexity in the order , outperforms the stateofart complexity bounds achieved by other stochastic proximal algorithms in several regions (see Table 3).
We finally propose and study ProxSpiderBoosto for the online stochastic composite optimization problem, where the objective function takes the form with the expectation over the underlying data distribution rather than the finitesum form. Our results show that ProxSpiderBoosto achieves a SFO complexity of which improves the existing best SFO complexity (see Table 3) for online stochastic composite optimization by a factor of . The same complexity result also holds for the general constrained optimization under a nonEuclidean geometry.
We note that two very recent studies (Zhou et al., 2018; Zhang et al., 2018) have extended the idea of SARAH and SPIDER to optimization problems over manifolds. We anticipate that Spiderboost may help to improve the practical performance of these studies.
Notations:
For a vector
, denotes the norm of the vector . We use to denote the gradient of . We use , and to denote the set of all real numbers, nonnegative real numbers and dimension real vectors, respectively.In this section, we introduce the SpiderBoost algorithm inspired by the SARAH and SPIDER algorithms. Recall the following finitesum nonconvex optimization problem.
(P) 
In Nguyen et al. (2017a), a novel estimator of the full gradient of the problem (P) was introduced for reducing the variance. More specifically, consider a certain inner loop of the SPIDER algorithm. The initialization of the estimator is set to be . Then, for each subsequent iteration , an index set is sampled and the corresponding estimator is constructed as
(1) 
Comparing the estimator in eq. 1 with the estimator used in the conventional SVRG (Johnson and Zhang, 2013), the estimator in eq. 1 is constructed iteratively based on the information that are obtained from the previous update, whereas the SVRG estimator is constructed based on the information of the initialization of that loop (i.e., replace in eq. 1 with , respectively). Therefore, the estimator in eq. 1 utilizes fresh information and yields more accurate estimation of the full gradient than that provided by the SVRG estimator.
The estimator in eq. 1 has been adopted by Nguyen et al. (2017a, b) and Fang et al. (2018) for proposing the SARAH (see Algorithm 1) and SPIDER (see Algorithm 2) algorithms, respectively. The comparison of their complexity can be seen in Table 1, where SPIDER outperforms SARAH for nonconvex optimization, and was shown in Fang et al. (2018) to be optimal for the regime with .
Though SPIDER has desired performance in theory, it can run very slowly in practice due to the choice of a conservative stepsize. To illustrate, as can be seen from Algorithm 2, SPIDER uses a very small stepsize (where is the desired accuracy). Then, the normalized gradient descent step yields that , i.e., a small increment per iteration. By following the analysis of SPIDER, such a stepsize appears to be necessary in order to achieve the desired convergence rate.
Such a conservative stepsize adopted by SPIDER motivates our design of an improved algorithm named SpiderBoost (see Algorithm 3), which uses the same estimator eq. 1 as SARAH and SPIDER, but adopts a much larger stepsize , as opposed to taken by SPIDER. Also, SpiderBoost updates the variable via a gradient descent step (same as SARAH), as opposed to the normalized gradient descent step taken by SPIDER. Furthermore, SpiderBoost generates the output variable via a random strategy whereas SPIDER outputs deterministically. Collectively, SpiderBoost can make a considerably larger progress per iteration than SPIDER, especially in the initial optimization phase where the estimated gradient norm is large, and is still guaranteed to achieve the same desirable convergence rate as SPIDER, as we show in the next subsection.
Comparison between SPIDER and SpiderBoost for solving a nonconvex logistic regression problem. Left: gradient norm v.s. # of epochs. Right: function value gap v.s. # of epochs.
Next, as an illustration, we compare the practical performance of SPIDER and SpiderBoost for solving a logistic regression problem with a nonconvex regularizer, which takes the following form
(2) 
For both algorithms, we use the same parameter setting except for the stepsize, and wish to achieve a firstorder stationary condition . As specified in Fang et al. (2018) for SPIDER, we set . On the other hand, SpiderBoost allows to set . Figure 1 shows the convergence of the gradient norm and the function value gap of both algorithms versus the number of passes that are taken over the data. It can be seen that SpiderBoost enjoys a much faster convergence than that of SPIDER due to the allowance of a large stepsize. Furthermore, SPIDER oscillates around a point where the gradient norm is about , which is the predefined accuracy value. This implies that setting a larger stepsize for SPIDER would cause it to saturate and start to oscillate at a larger gradient norm as well as the loss value, which is undesired.
To summarize, SpiderBoost takes updates with a more aggressive stepsize that can substantially accelerate the convergence in practice without sacrificing the theoretical performance as we show in the next subsection. Moreover, SpiderBoost is more amenable than SPIDER to further extend to solving composite nonconvex optimization problems, and achieves an improved complexity than the stateoftheart result as we study in Section 3.
In this subsection, we study the convergence rate and complexity of SpiderBoost for finding a firstorder stationary point within accuracy. In particular, we adopt the following standard assumptions on the objective function in the problem (P).
The objective function in the problem (P) satisfies:
[leftmargin=*]
Function is continuously differentiable and bounded below, i.e., ;
For every , the gradient is Lipschitz continuous, i.e.,
(3) 
creftypecap 1 essentially assumes that the smooth objective function has a nontrivial minimum and the corresponding gradient is Lipschitz continuous, which are valid and standard conditions in nonconvex optimization. Then, we obtain the following convergence result for SpiderBoost.
Let creftypecap 1 hold and apply SpiderBoost in Algorithm 3 to solve the problem (P) with parameters and stepsize . Then, the corresponding output satisfies provided that the total number of iterations satisfies
Moreover, the resulting total SFO complexity is .
Theorem 1 shows that the output of SpiderBoost achieves the firstorder stationary condition within accuracy with a total SFO complexity . This matches the lower bound that one can expect for firstorder algorithms in the regime (Fang et al., 2018). As we explain in Section 2.1, SpiderBoost differs from SPIDER mainly in the utilization of a large constant stepsize, which yields significant acceleration over the original SPIDER in practice as we illustrate in Figure 1.
We note that the analysis of SpiderBoost in Theorem 1 is very different from that of SPIDER that depends on an level stepsize and the normalized gradient descent step to guarantee a constant increment in every iteration. In contrast, SpiderBoost exploits the special structure of SPIDER estimator and analyzes the algorithm over the entire inner loop rather than over each iteration, and thus yields a better bound.
Many machine learning optimization problems add a regularization term to the original loss function in order to promote certain desired structures (e.g., sparsity) to the obtained solution. Such a regularization technique can substantially improve the solution quality. In such a case, the resulting optimization problem has a composite objective function that is more challenging to solve, especially when the regularization term is a nonsmooth function. To handle such nonsmoothness, we next generalize the SpiderBoost algorithm to solve nonconvex composite optimization problems, which take the form
(Q) 
where the function denotes the total loss on the training samples, and the function corresponds to a possibly nonsmooth regularizer. To handle the nonsmoothness, we next introduce the proximal mapping which is an effective tool for composite optimization.
Consider a proper and lowersemicontinuous function (which can be nondifferentiable). We define its proximal mapping at with parameter as
(4) 
Such a mapping is well defined and is unique particularly for convex functions. Furthermore, the proximal mapping can be used to generalize the firstorder stationary condition of smooth optimization to nonsmooth composite optimization via the following fact.
Let be a proper and convex function. Define the following notion of generalized gradient
(5) 
Then, is a critical point of the function (i.e., ) if and only if
(6) 
creftypecap 1 introduces a generalized notion of gradient for nonsmooth composite optimization. To elaborate, consider the case so that the corresponding proximal mapping is the identity mapping. Then, the generalized gradient reduces to the gradient of the unconstrained optimization problem. Therefore, the firstorder stationary condition for composite optimization is naturally defined as .
Next, we introduce the algorithm scheme of ProxSpiderBoost for solving composite optimization problems and study its oracle complexity.
To generalize to composite optimization, SpiderBoost admits a natural extension ProxSpiderBoost, whereas SPIDER encounters challenges. The main reason is because SpiderBoost admits a constant stepsize and its convergence guarantee does not have any restriction on the periteration increment of the variable. However, the convergence of SPIDER requires the periteration increment of the variable to be at the level, which is quite challenging to satisfy under the nonlinear proximal operator in composite optimization. For example, one way to guarantee periteration condition is to add such a condition to the proximal map, which consequently complicates the computation of proximity, because the proximity of many major regularizers no longer has analytical forms under such an additional constraint. Another possible approach is to further normalize the progress direction suggested by the proximity of the current variable to satisfy the periteration condition, but such an update becomes problematic since it loses the property of being a minimizer of the proximal mapping. Moreover, the conservative stepsize slows down the convergence. In contrast, SpiderBoost does not require such a restriction in convergence guarantee, and hence comes with flexible extendability to nonconvex composite optimization.
The detailed steps of ProxSpiderBoost (which generalizes SpiderBoost to composite optimization objectives) are described in Algorithm 4. Compared to SpiderBoost that uses a gradient step for smooth optimization, ProxSpiderBoost updates the variable via a proximal gradient step to handle the possible nonsmoothness in composite optimization.
Next, we characterize the oracle complexity of ProxSpiderBoost for achieving the generalized firstorder stationary condition.
Let creftypecap 1 hold and consider the problem (Q) with a convex regularizer . Apply the ProxSpiderBoost in Algorithm 4 to solve the problem (Q) with parameters and stepsize . Then, the corresponding output satisfies provided that the total number of iterations satisfies
Moreover, the total SFO complexity is , and the proximal oracle (PO) complexity is .
Theorem 2 shows that the output of ProxSpiderBoost achieves the generalized firstorder stationary condition within accuracy with a total SFO complexity . This improves the stateofart complexity result by a factor of . On the other hand, note that the complexity lower bound for achieving the firstorder stationary condition in unregularized optimization (Fang et al., 2018) is also a proper lower bound for composite optimization (by considering the special case ). Therefore, the SFO complexity of our ProxSpiderBoost matches the corresponding complexity lower bound in the regime with , and is hence near optimal.
ProxSpiderBoost proposed in the previous subsection adopts the proximal mapping that solves an unconstrained subproblem under the Euclidean distance (see the definition of the proximal mapping). Such a mapping can be further generalized to solve constrained composite optimization under a nonEuclidean geometry.
To elaborate, consider solving the composite optimization problem (Q) subject to a convex constraint set . We introduce the following Bregman distance associated with a kernel function that is defined as: for all ,
(7) 
Here, the function is smooth and strongly convex with respect to a certain generic norm. The specific choice of the kernel function should be compatible to the underlying geometry of the constraint set. As an example, for the unconstrained case one can choose so that , which is 1strongly convex with regard to the norm, whereas for the simplex constraint set, one can choose that yields the KL relative entropy distance , which is strongly convex with regard to the norm.
Based on the Bregman distance, the proximal gradient step in Algorithm 4 can be generalized to the following update rule for solving the constrained composite optimization.
(8) 
Moreover, the characterization of critical points in creftypecap 1 remains valid by defining the generalized gradient as . Then, we obtain the following oracle complexity result of ProxSpiderBoost under the Bregman distance (replace the proximal step in Algorithm 4 by its general version (eq. 8) ) for solving constrained composite optimization.
Let creftypecap 1 hold and consider the problem (Q) with a convex regularizer and subject to a convex constrained set . Apply ProxSpiderBoost with a proper Bregman distance that is strongly convex, where . Choose the parameters and stepsize . Then, the algorithm outputs a point satisfying provided that the total number of iterations is at least
Moreover, the total SFO complexity is , and the PO complexity is .
Despite the nonconvexity geometry of many machine learning problems, their landscapes have been shown to have amenable properties for optimization. In particular, the socalled gradient dominance condition has been shown to hold for a variety of nonconvex problems such as phase retrieval (Zhou et al., 2016), blind deconvolution (Li et al., 2018)
and neural networks
(Zhou and Liang, 2017), etc. Such a desirable property has been shown to accelerate the convergence of various firstorder algorithms.This motivate us to explore the theoretical performance of the ProxSpiderBoost for solving the composite optimization problem (P) under the generalized gradient dominance geometry we define below, where the function can still be nonconvex.
Let be a minimizer of function . Then, is said to be gradient dominated if for all and one has
where is the generalized gradient defined in creftypecap 1.
Definition 1 generalizes the traditional gradient dominance condition for single smooth objective functions to composite objective functions. In particular, such a condition allows the objective function to be nonsmooth and nonconvex, and it requires the growth of the function value to be controlled by the gradient norm.
In order to solve the composite optimization problems under the generalized gradient dominance condition, we propose a variant of ProxSpiderBoost, which we refer to as ProxSpiderBoostgd, described in Algorithm 5. We note that ProxSpiderBoostgd in Algorithm 5 is different from ProxSpiderBoost in Algorithm 4 (for general nonconvex optimization) from several aspects. First, ProxSpiderBootgd uses a constant level minibatch size in the gradient estimator. Second, after every iterations (i.e., ), we set to be a random draw from the previous iterations, whereas ProxSpiderBoost chooses the output randomly from all the iteration history. ProxSpiderBoostgd can also be viewed as a generalization of SARAH Nguyen et al. (2017b) to a proximal algorithm with further differences lying in a much larger stepsize than that chosen by SARAH and random sampling with replacement for inner loop iterations, as opposed to sampling without replacement taken by SARAH.
Next, we present the convergence rate characterization of Algorithm 5 for solving composite optimization problems under the generalized gradient dominance condition.
Let Assumprion 1 hold and apply ProxSpiderBoostgd in Algorithm 5 to solve the problem (Q). Assume the objective function is gradient dominated and set . Then, the generated variable sequence satisfies, for all
Consequently, the oracle complexity of Algorithm 5 for finding a point that satisfies is in the order .
Theorem 4 shows that ProxSpiderBoostgd in Algorithm 5 converges linearly to a stationary point for solving composite optimization problems under the generalized gradient dominance condition. We compare the oracle complexity in Theorem 4 with those of other stochastic proximal algorithms in Table 3. We note that both the results of ProxSVRG and ProxSVRG+ requires the condition number to satisfy , in which regime our ProxSpiderBoostgd outperform the SFO complexity of these existing algorithms. Furthermore, our result of ProxSpiderBoostgd does not require the aforementioned condition, and has the most relaxed dependency on and , demonstrating the superior performance of ProxSpiderBoostgd for optimizing gradient dominant functions.
For the case with (i.e., the problem objective reduces to the smooth function ), our algorithm achieves a total SFO complexity of (Nguyen et al., 2017b), which is the same as that achieved by SARAH. However, we note that our algorithm allows to use a constant stepsize i.e., , while the stepsize used in SARAH is in the order .
Algorithms  Stepsize  FiniteSum  Additional  
SFO  PO  Condition  
ProxGD  (Karimi et al., 2016)    
ProxSVRG/SAGA  (Reddi et al., 2016b)  ^{7}^{7}7Following the convenience, we treat as constant in all the bound, thus the original requirement becomes .  
ProxSVRG  (Li and Li, 2018)  


ProxSpiderBoostgd  (This Work)    

In this section, we study the performance of a variant of ProxSpiderBoost for solving nonconvex composite optimization problems under the online setting.
In this subsection, we study the following composite optimization problem.
(R) 
Here, the objective function consists of a population risk over the underlying data distribution and a regularizer . Such a problem can be viewed to have infinite samples as opposed to finite samples in the finitesum problem (as in problem (Q)), and the underlying data distribution is typically unknown a priori. Therefore, one cannot evaluate the fullgradient over the underlying data distribution in practice. For such a type of problems, we propose a variant of ProxSpiderBoost, which applies stochastic sampling to estimate the full gradient for initializing the gradient estimator in the inner loops. We refer to this variant as ProxSpiderBoosto, the details of which are summarized in Algorithm 6.
It can be seen that ProxSpiderBoosto in Algorithm 6 draws stochastic samples to estimate the full gradient for initializing the gradient estimator. To analyze its performance, we introduce the following standard assumption on the variance of stochastic gradients.
The variance of stochastic gradients is bounded, i.e., there exists a constant such that for all and all random draws of , it holds that .
Under creftypecap 2, the total variance of a minibatch of stochastic gradients can be upper bounded by . We obtain the following result on the oracle complexity for ProxSpiderBoosto in Algorithm 6.
Let Assumptions 1 and 2 hold and consider the problem (R) with a convex regularizer . Apply ProxSpiderBoosto in Algorithm 6 to solve the problem (R) with parameters . Then, the corresponding output satisfies provided that the total number of iterations satisfies
Moreover, the resulting total SFO complexity is , and the PO complexity is .
To the best of our knowledge, the SFO complexity of Algorithm 6 improves the stateofart result (Li and Li, 2018; AllenZhu, 2017) of online stochastic composite optimization by a factor of .
In the smooth case with , the problem (R) reduces to the online case of problem (P), and Algorithm 6 reduces to the SpiderBoost algorithm except the outer loop gradient is estimated by a batch of samples instead of the full gradient. We refer to such an algorithm as SpiderBoosto. The following corollary characterizes the performance of SpiderBoosto to solve an online problem.
Let Assumptions 1 and 2 hold and consider the online setting of problem (P). Apply SpiderBooto with parameters to solve such a problem. Then, the corresponding output satisfies provided that the total number of iterations satisfies
Moreover, the resulting total SFO complexity is , and the PO complexity is .
creftypecap 1 follows directly from Theorem 5, becasue the online setting of problem (P) is a special case of problem (R). ∎
Algorithm 6 can be generalized to solve the online optimization problem (R) subject to a convex constraint set with a general distance function. To do this, one replaces the proximal gradient update in Algorithm 6 with the generalized proximal gradient step in eq. 8 which is based on a proper Bregman distance . For such an algorithm, we obtain the following result on the oracle complexity for ProxSpiderBoosto in solving constrained stochastic composite optimization under nonEuclidean geometry.
Let Assumptions 1 and 2 hold and consider the problem (R) with a convex regularizer , which is further subject to a convex constraint set . Apply ProxSpiderBoosto with a proper Bregman distance that is strongly convex with . Choose the parameters as and stepsize Then, the algorithm outputs a point that satisfies provided that the total number of iterations is at least
Moreover, the resulting total SFO complexity is and the PO is .
In this paper, we proposed an algorithm named SpiderBoost for solving smooth nonconvex optimization problem, which is guaranteed to output an approximate firstorder stationary point with at most SFO complexity (same as SPIDER), but allows much larger stepsize than SPIDER and hence runs faster in practice than SPIDER. Moreover, we extend the proposed SpiderBoost to ProxSpiderBoost to solve nonsmooth nonconvex optimization, which achieves an approximate firstorder stationary point with at most SFO complexity and PO complexity. The SFO complexity outperforms the existing best result by a factor of . We anticipate that SpiderBoost has a great potential to be applied to various other largescale optimization problems.
Throughout the paper, let such that . We first present an auxiliary lemma from Fang et al. (2018).
Under creftypecap 1, the SPIDER estimator satisfies for all ,
(9) 
Telescoping Lemma 1 over from to , we obtain that
(10) 
We note that the above inequality also holds for , which can be simply checked by plugging into above inequality.
Next, we prove our main result that yields Theorem 1.
Under creftypecap 1, if the parameters and are chosen such that
(11) 
and if it holds that for , we always have
(12) 
then the output point of SpiderBoost satisfies
(13) 
By creftypecap 1, the entire objective function is smooth, which further implies that
where (i) follows from the update rule of SpiderBoost, (ii) uses the inequality that for . Taking expectation on both sides of the above inequality yields that