1 Introduction
In this paper, we study the optimization problem
(1.1) 
where the stochastic component
, indexed by some random vector
, is smooth and possibly nonconvex. Nonconvex optimization problem of form (1.1) contains many largescale statistical learning tasks. Optimization methods that solve (1.1) are gaining tremendous popularity due to their favorable computational and statistical efficiencies (Bottou, 2010; Bubeck et al., 2015; Bottou et al., 2018). Typical examples of form (1.1) include principal component analysis, estimation of graphical models, as well as training deep neural networks
(Goodfellow et al., 2016). The expectationminimization structure of stochastic optimization problem (1.1) allows us to perform iterative updates and minimize the objective using its stochastic gradient as an estimator of its deterministic counterpart.A special case of central interest is when the stochastic vector is finitely sampled. In such finitesum (or offline) case, we denote each component function as and (1.1) can be restated as
(1.2) 
where is the number of individual functions. Another case is when is reasonably large or even infinite, running across of the whole dataset is exhaustive or impossible. We refer it as the online (or streaming) case. For simplicity of notations we will study the optimization problem of form (1.2) in both finitesum and online cases till the rest of this paper.
One important task for nonconvex optimization is to search for, given the precision accuracy , an approximate firstorder stationary point or . In this paper, we aim to propose a new technique, called the Stochastic PathIntegrated Differential EstimatoR (Spider), which enables us to construct an estimator that tracks a deterministic quantity with significantly lower sampling costs. As the readers will see, the Spider technique further allows us to design an algorithm with a faster rate of convergence for nonconvex problem (1.2), in which we utilize the idea of Normalized Gradient Descent (NGD) (Nesterov, 2004; Hazan et al., 2015). NGD is a variant of Gradient Descent (GD) where the stepsize is picked to be inverseproportional to the norm of the full gradient. Compared to GD, NGD exemplifies faster convergence, especially in the neighborhood of stationary points (Levy, 2016). However, NGD has been less popular due to its requirement of accessing the full gradient and its norm at each update. In this paper, we estimate and track the gradient and its norm via the Spider technique and then hybrid it with NGD. Measured by gradient cost which is the total number of computation of stochastic gradients, our proposed SpiderSFO algorithm achieves a faster rate of convergence in which outperforms the previous bestknown results in both finitesum (AllenZhu & Hazan, 2016)(Reddi et al., 2016) and online cases (Lei et al., 2017) by a factor of .
For the task of finding stationary points for which we already achieved a faster convergence rate via our proposed SpiderSFO algorithm, a followup question to ask is: is our proposed SpiderSFO algorithm optimal for an appropriate class of smooth functions? In this paper, we provide an affirmative answer to this question in the finitesum case. To be specific, inspired by a counterexample proposed by Carmon et al. (2017b) we are able to prove that the gradient cost upper bound of SpiderSFO algorithm matches the algorithmic lower bound. To put it differently, the gradient cost of SpiderSFO cannot be further improved for finding stationary points for some particular nonconvex functions.
Nevertheless, it has been shown that for machine learning methods such as deep learning, approximate stationary points that have at least one negative Hessian direction, including saddle points and local maximizers, are often
not sufficient and need to be avoided or escaped from (Dauphin et al., 2014; Ge et al., 2015). Specifically, under the smoothness condition for and an additional HessianLipschitz condition for , we aim to find an approximate secondorder stationary point which is a point satisfying and (Nesterov & Polyak, 2006). As a side result, we propose a variant of our SpiderSFO algorithm, named SpiderSFO^{+} (Algorithm 2) for finding an approximate secondorder stationary point, based a socalled NegativeCurvatureSearch method. Under an additional HessianLipschitz assumption, SpiderSFO^{+} achieves an approximate secondorder stationary point at a gradient cost of . In the online case, this indicates that our SpiderSFO algorithm improves upon the bestknown gradient cost in the online case by a factor of (AllenZhu & Li, 2018). For the finitesum case, the gradient cost of Spider is sharper than that of the stateoftheart Neon+FastCubic/CDHS algorithm in Agarwal et al. (2017); Carmon et al. (2016) by a factor of when .^{1}^{1}1In the finitesum case, when SpiderSFO has a slower rate of than the stateofart rate achieved by Neon+FastCubic/CDHS (AllenZhu & Li, 2018). Neon+FastCubic/CDHS has exploited appropriate acceleration techniques, which has not been considered for Spider.1.1 Related Works
In the recent years, there has been a surge of literatures in machine learning community that analyze the convergence property of nonconvex optimization algorithms. Limited by space and our knowledge, we have listed all literatures that we believe are mostly related to this work. We refer the readers to the monograph by Jain et al. (2017) and the references therein on recent general and modelspecific convergence rate results on nonconvex optimization.
First and ZerothOrder Optimization and Variance Reduction
For the general problem of finding approximate stationary points, under the smoothness condition of
, it is known that vanilla Gradient Descent (GD) and Stochastic Gradient Descent (SGD), which can be traced back to
Cauchy (1847) and Robbins & Monro (1951) and achieve an approximate stationary point with a gradient cost of (Nesterov, 2004; Ghadimi & Lan, 2013; Nesterov & Spokoiny, 2011; Ghadimi & Lan, 2013; Shamir, 2017).Recently, the convergence rate of GD and SGD have been improved by the variancereduction type of algorithms
(Johnson & Zhang, 2013; Schmidt et al., 2017). In special, the finitesum Stochastic VarianceReduced Gradient (SVRG) and online Stochastically Controlled Stochastic Gradient (SCSG), to the gradient cost of (AllenZhu & Hazan, 2016; Reddi et al., 2016; Lei et al., 2017).Firstorder method for finding approximate stationary points
Recently, many literature study the problem of how to avoid or escape saddle points and achieve an approximate secondorder stationary point at a polynomial gradient cost (Ge et al., 2015; Jin et al., 2017a; Xu et al., 2017; AllenZhu & Li, 2018; Hazan et al., 2015; Levy, 2016; AllenZhu, 2018; Reddi et al., 2018; Tripuraneni et al., 2018; Jin et al., 2017b; Lee et al., 2016; Agarwal et al., 2017; Carmon et al., 2016; Paquette et al., 2018). Among them, the group of authors Ge et al. (2015); Jin et al. (2017a) proposed the noiseperturbed variants of Gradient Descent (PGD) and Stochastic Gradient Descent (SGD) that escape from all saddle points and achieve an approximate secondorder stationary point in gradient cost of stochastic gradients. Levy (2016) proposed the noiseperturbed variant of NGD which yields faster evasion of saddle points than GD.
The breakthrough of gradient cost for finding secondorder stationary points were achieved in 2016/2017, when the two recent lines of literatures, namely FastCubic (Agarwal et al., 2017) and CDHS (Carmon et al., 2016) as well as their stochastic versions (AllenZhu, 2018; Tripuraneni et al., 2018), achieve a gradient cost of which serve as the bestknown gradient cost for finding an approximate secondorder stationary point before the initial submission of this paper.^{2}^{2}2AllenZhu (2018) also obtains a gradient cost of to achieve a (modified and weakened) approximate secondorder stationary point. ^{3}^{3}3Here and in many places afterwards, the gradient cost also includes the number of stochastic Hessianvector product accesses, which has similar running time with computing peraccess stochastic gradient. In particular, Agarwal et al. (2017); Tripuraneni et al. (2018) converted the cubic regularization method for finding secondorder stationary points (Nesterov & Polyak, 2006) to stochasticgradient based and stochasticHessianvectorproductbased methods, and Carmon et al. (2016); AllenZhu (2018) used a NegativeCurvature Search method to avoid saddle points. See also recent works by Reddi et al. (2018) for related saddlepointescaping methods that achieve similar rates for finding an approximate secondorder stationary point.
Online PCA and the NEON method
In late 2017, two groups Xu et al. (2017); AllenZhu & Li (2018) proposed a generic saddlepointescaping method called Neon, a NegativeCurvatureSearch method using stochastic gradients. Using such Neon method, one can convert a series of optimization algorithms whose update rules use stochastic gradients and Hessianvector products (GD, SVRG, FastCubic/CDHS, SGD, SCSG, Natasha2, etc.) to the ones using only stochastic gradients without increasing the gradient cost. The idea of Neon was built upon Oja’s iteration for principal component estimation (Oja, 1982), and its global convergence rate was proved to be nearoptimal (Li et al., 2017; Jain et al., 2016). AllenZhu & Li (2017) later extended such analysis to the rank case as well as the gapfree case, the latter of which serves as the pillar of the Neon method.
Other concurrent works
As the current work is carried out in its final phase, the authors became aware that an idea of resemblance was earlier presented in an algorithm named the StochAstic Recursive grAdient algoritHm (SARAH) (Nguyen et al., 2017a, b). Both our Spidertype of algorithms and theirs adopt the recursive stochastic gradient update framework. Nevertheless, our techniques essentially differ from the works Nguyen et al. (2017a, b) in two aspects:
Soon after the initial submission to NIPS and arXiv release of this paper, we became aware that similar convergence rate results for stochastic firstorder method were also achieved independently by the socalled SNVRG algorithm (Zhou et al., 2018b, a).^{4}^{4}4To our best knowledge, the work by Zhou et al. (2018b, a) appeared online on June 20, 2018 and June 22, 2018, separately. SNVRG (Zhou et al., 2018b) obtains a gradient complexity of for finding an approximate firstorder stationary point, and achieves gradient complexity for finding an approximate secondorder stationary point (Zhou et al., 2018a) for a wide range of . By exploiting the thirdorder smoothness condition, SNVRG can also achieve an approximate secondorder stationary point in gradient costs.
1.2 Our Contributions
In this work, we propose the Stochastic PathIntegrated Differential Estimator (Spider) technique, which significantly avoids excessive access of stochastic oracles and reduces the time complexity. Such technique can be potential applied in many stochastic estimation problems.

As a first application of our Spider technique, we propose the SpiderSFO algorithm (Algorithm 1) for finding an approximate firstorder stationary point for nonconvex stochastic optimization problem (1.2), and prove the optimality of such rate in at least one case. Inspired by recent works Johnson & Zhang (2013); Carmon et al. (2016, 2017b) and independent of Zhou et al. (2018b, a), this is the first time that the gradient cost of in both upper and lower (finitesum only) bound for finding firstorder stationary points for problem (1.2) were obtained.

Following Carmon et al. (2016); AllenZhu & Li (2018); Xu et al. (2017), we propose SpiderSFO^{+} algorithm (Algorithm 2) for finding an approximate secondorder stationary point for nonconvex stochastic optimization problem. To best of our knowledge, this is also the first time that the gradient cost of achieved with standard assumptions.

As a second application of our Spider technique, we apply it to zerothorder optimization for problem (1.2) and achieves individual function accesses of . To best of our knowledge, this is also the first time that using Variance Reduction technique (Schmidt et al., 2017; Johnson & Zhang, 2013) to reduce the individual function accesses for nonconvex problems to the aforementioned complexity.
Organization. The rest of this paper is organized as follows. §2 presents the core idea of stochastic pathintegrated differential estimator that can track certain quantities with much reduced computational costs. §3 provides the Spider method for stochastic firstorder methods and convergence rate theorems of this paper for finding approximate firstorder stationary and secondorder stationary points, and details a comparison with concurrent works. §4 provides the Spider method for stochastic zerothorder methods and relevant convergence rate theorems. §5 concludes the paper with future directions. All the detailed proofs are deferred to the appendix in their order of appearance.
Notation. Throughout this paper, we treat the parameters and , to be specified later as global constants. Let denote the Euclidean norm of a vector or spectral norm of a square matrix. Denote for a sequence of vectors and positive scalars if there is a global constant such that , and such hides a polylogarithmic factor of the parameters. Denote if there is a global constant such that . Let
denote the least eigenvalue of a real symmetric matrix
. For fixed , let denote the sequence . Let and denote the cardinality of a multiset of samples (a generic set that allows elements of multiple instances). For simplicity, we further denote the averaged subsampled stochastic estimator and averaged subsampled gradient . Other notations are explained at their first appearance.2 Stochastic PathIntegrated Differential Estimator: Core Idea
In this section, we present in detail the underlying idea of our Stochastic PathIntegrated Differential Estimator (Spider) technique behind the algorithm design. As the readers will see, such technique significantly avoids excessive access of the stochastic oracle and reduces the complexity, which is of independent interest and has potential applications in many stochastic estimation problems.
Let us consider an arbitrary deterministic vector quantity . Assume that we observe a sequence , and we want to dynamically track for Assume further that we have an initial estimate
, and an unbiased estimate
of such that for eachThen we can integrate (in the discrete sense) the stochastic differential estimate as
(2.1) 
We call estimator the Stochastic PathIntegrated Differential EstimatoR, or Spider for brevity. We conclude the following proposition which bounds the error of our estimator
, in terms of both expectation and high probability:
Proposition 1.
We have

The martingale variance bound has
(2.2) 
Suppose
(2.3) and for each
(2.4) Then for any and a given we have with probability at least
(2.5)
Proposition 1(i) can be easily concluded using the property of squareintegrable martingales. To prove the highprobability bound in Proposition 1(ii), we need to apply an AzumaHoeffdingtype concentration inequality (Pinelis, 1994). See §A in the Appendix for more details.
Now, let map any to a random estimate such that, conditioning on the observed sequence , we have for each ,
(2.6) 
At each step let be a subset that samples elements in with replacement, and let the stochastic estimator satisfy
(2.7) 
and for all . Finally, we set our estimator of as
Applying Proposition 1 immediately concludes the following lemma, which gives an error bound of the estimator
in terms of the second moment of
:Lemma 1.
We have under the condition (2.7) that for all ,
(2.8) 
It turns out that one can use Spider to track many quantities of interest, such as stochastic gradient, function values, zeroorder estimate gradient, functionals of Hessian matrices, etc. Our proposed Spiderbased algorithms in this paper take as the stochastic gradient and the zerothorder estimate gradient, separately.
3 SPIDER for Stochastic FirstOrder Method
In this section, we apply Spider to the task of finding both firstorder and secondorder stationary points for nonconvex stochastic optimization. The main advantage of SpiderSFO lies in using SPIDER to estimate the gradient with a low computation cots. We introduce the basic settings and assumptions in §3.1 and propose the main errorbound theorems for finding approximate firstorder and secondorder stationary points, separately in §3.2 and §3.3.
3.1 Settings and Assumptions
We first introduce the formal definition of approximate firstorder and secondorder stationary points, as follows.
Definition 1.
We call an approximate firstorder stationary point, or simply an FSP, if
(3.1) 
Also, call an approximate secondorder stationary point, or simply an SSP, if
(3.2) 
The definition of an approximate secondorder stationary point generalizes the classical version where , see e.g. Nesterov & Polyak (2006). For our purpose of analysis, we also pose the following additional assumption:
Assumption 1.
We assume the following

The where is the global infimum value of ;

The component function has an averaged Lipschitz gradient, i.e. for all ,

(For online case only) the stochastic gradient has a finite variance bounded by , i.e.
Alternatively, to obtain highprobability results using concentration inequalities, we propose the following more stringent assumptions:
Assumption 2.
We assume that Assumption 1 holds and, in addition,

(Optional) each component function has Lipschitz continuous gradient, i.e. for all ,
Note when is twice continuously differentiable, Assumption 1 (ii) is equivalent to for all and is weaker than the additional Assumption 2 (ii’), since the absolute norm squared bounds the variance for any random vector.

(For online case only) the gradient of each component function has finite bounded variance by (with probability ) , i.e. for all ,
Assumption 2 is common in applying concentration laws to obtain high probability result^{5}^{5}5In this paper, we use AzumaHoeffdingtype concentration inequality to obtain high probability results like Xu et al. (2017); AllenZhu & Li (2018). By applying Bernstein inequality, under the Assumption 1, the parameters in the Assumption 2 are allowed to be larger without hurting the convergence rate..
For the problem of finding an approximate secondorder stationary point, we pose in addition to Assumption 1 the following assumption:
Assumption 3.
We assume that Assumption 2 (including (ii’)) holds and, in addition, each component function has Lipschitz continuous Hessian, i.e. for all ,
3.2 FirstOrder Stationary Point
Recall that NGD has iteration update rule
(3.3) 
where is a constant step size. The NGD update rule (3.3) ensures being constantly equal to the stepsize , and might fastly escape from saddle points and converge to a secondorder stationary point (Levy, 2016). We propose SpiderSFO in Algorithm 1, which is like a stochastic variant of NGD with the Spider
technique applied, so as to maintain an estimator in each epoch
at a higher accuracy under limited gradient budgets.To analyze the convergence rate of SpiderSFO, let us first consider the online case for Algorithm 1. We let the input parameters be
(3.4) 
where is a free parameter to choose.^{6}^{6}6When , the minibatch size is , which is the largest minibatch size that Algorithm 1 allows to choose. In this case, in Line 5 of Algorithm 1 is a Spider for . To see this, recall is the stochastic gradient drawn at step and
(3.5) 
Plugging in and in Lemma 1 of §2, we can use in Algorithm 1 as the Spider and conclude the following lemma that is pivotal to our analysis.
Lemma 2.
Lemma 2 shows that our Spider of maintains an error of . Using this lemma, we are ready to present the following results for Stochastic FirstOrder (SFO) method for finding firstorder stationary points of (1.2).
Upper Bound for Finding FirstOrder Stationary Points, in Expectation
Theorem 1 (FirstOrder Stationary Point, online setting, expectation).
The relatively reduced minibatch size serves as the key ingredient for the superior performance of SpiderSFO. For illustrations, let us compare the sampling efficiency among SGD, SCSG and SpiderSFO in their special cases. With some involved analysis of these algorithms, we can conclude that to ensure a sufficient function value decrease of at each iteration,

for SGD the choice of minibatch size is ;

for our SpiderSFO only needs a reduced minibatch size of
Turning to the finitesum case, analogous to the online case we let
(3.7) 
where . In this case, one computes the full gradient in Line 3 of Algorithm 1. We conclude our second upperbound result:
Theorem 2 (FirstOrder Stationary Point, finitesum setting).
Lower Bound for Finding FirstOrder Stationary Points
To conclude the optimality of our algorithm we need an algorithmic lower bound result (Carmon et al., 2017b; Woodworth & Srebro, 2016). Consider the finitesum case and any random algorithm that maps functions to a sequence of iterates in , with
(3.8) 
where are measure mapping into , is the individual function chosen by at iteration , and is uniform random vector from . And , where is a measure mapping. The lowerbound result for solving (1.2) is stated as follows:
Theorem 3 (Lower bound for SFO for the finitesum setting).
Note the condition in Theorem 3 ensures that our lower bound , and hence our upper bound in Theorem 1 matches the lower bound in Theorem 3 up to a constant factor of relevant parameters, and is hence nearoptimal. Inspired by Carmon et al. (2017b), our proof of Theorem 3 utilizes a specific counterexample function that requires at least stochastic gradient accesses. Note Carmon et al. (2017b) analyzed such counterexample in the deterministic case and we generalize such analysis to the finitesum case .
Remark 1.
Note by setting the lower bound complexity in Theorem 3 can be as large as . We emphasize that this does not violate the upper bound in the online case [Theorem 1], since the counterexample established in the lower bound depends not on the stochastic gradient variance specified in Assumption 1(iii), but on the component number . To obtain the lower bound result for the online case with the additional Assumption 1(iii), with more efforts one might be able to construct a second counterexample that requires stochastic gradient accesses with the knowledge of instead of . We leave this as a future work.
Upper Bound for Finding FirstOrder Stationary Points, in HighProbability
We consider obtaining highprobability results. With Theorem 1 and Theorem 2 in hand, by Markov Inequality, we have with probability . Thus a straightforward way to obtain a high probability result is by adding an additional verification step in the end of Algorithm 1, in which we check whether satisfies (for the online case when are unaccessible, under Assumption 2 (iii’), we can draw samples to estimate in high accuracy). If not, we can restart Algorithm 1 (at most in times) until it find a desired solution. However, because the above way needs running Algorithm 1 in multiple times, in the following, we show with Assumption 2 (including (2)), original Algorithm 1 obtains a solution with an additional polylogarithmic factor under high probability.
Theorem 4 (FirstOrder Stationary Point, online setting, high probability).
For the online case, set the parameters , , and in (3.4). Set . Then under the Assumption 2 (including (ii’)), with probability at least , Algorithm 1 terminates before iterations and outputs an satisfying
(3.9) 
The gradient costs to find a FSP satisfying (3.9) with probability are bounded by for any choice of of . Treating , and as constants, the stochastic gradient complexity is .
Theorem 5 (FirstOrder Stationary Point, finitesum setting).
In the finitesum case, set the parameters , , , and as (3.7). let , i.e. we obtain the full gradient in Line 3. Then under the Assumption 2 (including (ii’)), with probability at least , Algorithm 1 terminates before iterations and outputs an satisfying
(3.10) 
where . So the gradient costs to find a FSP satisfying (3.10) with probability are bounded by with any choice of . Treating , and as constants, the stochastic gradient complexity is .
3.3 SecondOrder Stationary Point
To find a secondorder stationary point with (3.1), we can fuse our SpiderSFO in Algorithm 1 with a NegativeCurvatureSearch (NCSearch) iteration that solves the following task: given a point , decide if or find a unit vector such that (for numerical reasons, one has to leave some room between the two bounds). For the online case, NCSearch can be efficiently solved by Oja’s algorithm (Oja, 1982; AllenZhu, 2018) and also by Neon (AllenZhu & Li, 2018; Xu et al., 2017) with the gradient cost of .^{7}^{7}7Recall that the NEgativecurvatureOriginatedfromNoise method (or Neon method for short) proposed independently by AllenZhu & Li (2018); Xu et al. (2017) is a generic procedure that convert an algorithm that finds an approximate firstorder stationary points to the one that finds an approximate secondorder stationary point. When is found, one can set where is a random sign. Then under Assumption 3, Taylor’s expansion implies that (AllenZhu & Li, 2018)
(3.11) 
Taking expectation, one has This indicates that when we find a direction of negative curvature or Hessian, updating decreases the function value by in expectation. Our SpiderSFO algorithm fused with NCSearch is described in the following steps:
[style=exampledefault]

Run an efficient NCSearch iteration to find an approximate negative Hessian direction using stochastic gradients, e.g. Neon2 (AllenZhu & Li, 2018).

If NCSearch find a , update in ministeps, and simultaneously use Spider to maintain an estimate of . Then Goto Step 1.

If not, run SpiderSFO for steps directly using the Spider (without restart) in Step 2. Then Goto Step 1.

During Step 3, if we find , return .
The formal pseudocode of the algorithm described above, which we refer to as SpiderSFO^{+}, is detailed in Algorithm 2^{8}^{8}8In our initial version, SpiderSFO^{+} first find a FSP and then run NCsearch iteration to find a SSP, which also ensures competitive rate. Our newly SpiderSFO^{+} are easier to fuse momentum technique when is small. Please see the discussion later.. The core reason that SpiderSFO^{+} enjoys a highly competitive convergence rate is that, instead of performing a single large step at the approximate direction of negative curvature as in Neon2(AllenZhu & Li, 2018), we split such one large step into small, equallength ministeps in Step 2, where each ministep moves the iteration by an distance. This allows the algorithm to successively maintain the Spider estimate of the current gradient in Step 3 and avoid recomputing the gradient in Step 1.
Our final result on the convergence rate of Algorithm 2 is stated as:
Theorem 6 (SecondOrder Stationary Point).
Let Assumptions 3 hold. For the online case, set in (3.4), with any choice of , then with probability at least ^{9}^{9}9By multiple times (at most in times) of verification and restarting Algorithm 2 , one can also obtain a highprobability result., Algorithm 2 outputs an with , and satisfying
(3.12) 
with . The gradient cost to find a SecondOrder Stationary Point with probability at least is upper bounded by
Analogously for the finitesum case, under the same setting of Theorem 2, set in (3.7), , , with probability , Algorithm 2 outputs an satisfying (3.12) in and with gradients cost of
Corollary 7.
Treating , , , and as positive constants, with high probability the gradient cost for finding an approximate secondorder stationary point is for the online case and for the finitesum case, respectively. When , the gradient cost is .
Notice that one may directly apply an online variant of the Neon method to the SpiderSFO Algorithm 1 which alternately does SecondOrder Descent (but not maintaining Spider) and FirstOrder Descent (Running a new SpiderSFO). Simple analysis suggests that the Neon+ SpiderSFO algorithm achieves a gradient cost of for the online case and for the finitesum case (AllenZhu & Li, 2018; Xu et al., 2017). We discuss the differences in detail.

The dominate term in the gradient cost of Neon+ SpiderSFO is the socalled coupling term in the regime of interest: for the online case and for the finitesum case, separately. Due to this term, most convergence rate results in concurrent works for the online case such as Reddi et al. (2018); Tripuraneni et al. (2018); Xu et al. (2017); AllenZhu & Li (2018); Zhou et al. (2018a) have gradient costs that cannot break the barrier when is chosen to be . Observe that we always need to run a new SpiderSFO which at least costs stochastic gradient accesses.

Our analysis sharpens the seemingly nonimprovable coupling term by modifying the single large Neon step to many ministeps. Such modification enables us to maintain the Spider estimates and obtain a coupling term of SpiderSFO^{+}, which improves upon the Neon coupling term by a factor of .

For the finitesum case, SpiderSFO^{+} enjoys a convergence rate that is faster than existing methods only in the regime [Table 1]. For the case of , using Spider to track the gradient in the Neon procedure can be more costly than applying appropriate acceleration techniques (Agarwal et al., 2017; Carmon et al., 2016).^{10}^{10}10SpiderSFO^{+} enjoys a faster rate than Neon+SpiderSFO where computing the “full” gradient dominates the gradient cost, namely in the online case and for the finitesum case. Beacause it is wellknown that momentum technique (Nesterov, 1983) provably ensures faster convergence rates when is sufficient small (ShalevShwartz & Zhang, 2016). One can also apply momentum technique to solve the subproblem in Step 1 and 3 like Carmon et al. (2016); AllenZhu & Li (2018) when , and thus can achieve the stateoftheart gradient cost of
in all scenarios.
3.4 Comparison with Concurrent Works
Algorithm  Online  FiniteSum  

GD / SGD  (Nesterov, 2004)  
SVRG / SCSG 


SpiderSFO  (this work)  

Perturbed GD / SGD 








N/A  


