1 Introduction
In the era of machine learning, optimization problems associated with practical applications have a rapidly increasing data volume. In many scenarios, such optimization problems take the following composite form:
(P)  
where is the optimization variable, integer denotes the total sample size, is a differentiable function that corresponds to the loss on the th data sample and denotes a possibly nonsmooth regularizer function. In particular, solving the above problem (P) can be demanding due to the tremendous data size
and complex machine learning models (e.g., neural networks) that result in highly nonconvex and nonsmooth loss landscape
(Goodfellow et al., 2016). Therefore, stochastic gradientlike algorithms are commonly used in practice to leverage their sample efficiency and implementation simplicity while maintaining provable convergence guarantee in nonconvex optimization.A variety of stochastic algorithms have been proposed in the literature for solving the problem (P) without the existence of the regularizer
(i.e., smooth nonconvex optimization). The simplest algorithm is the stochastic gradient descent (SGD) algorithm
(Robbins & Monro, 1951; Bottou, 2010)that approximates the full gradient by one minibatch of stochastic samples. Although SGD has a low periteration complexity, its convergence rate can be significantly deteriorated by the intrinsic variance of its stochastic estimator. Such an issue has been successfully resolved by using more advanced stochastic variancereduced gradient estimators that induce a smaller variance, leading to the design of a variety of stochastic variancereduced algorithms such as SAG
(Schmidt et al., 2017), SAGA (Defazio et al., 2014), SVRG (Johnson & Zhang, 2013), etc. To further handle the nonsmooth regularizer , proximal versions of these advanced algorithms have been developed (Xiao & Zhang, 2014; Ghadimi et al., 2016; Reddi et al., 2016). However, these algorithms do not yield an optimal stochastic gradient oracle complexity for generic nonconvex optimization.Recently, (Nguyen et al., 2017a, b) and (Fang et al., 2018) proposed a new type of stochastic variancereduced algorithms called SARAH and SPIDER, respectively. In specific, under an accuracydependent stepsize, it has been shown in (Fang et al., 2018) that a natural gradient descent step taken in SPIDER yields the optimal^{1}^{1}1In the parameter regime , where corresponds to the desired accuracy. stochastic gradient oracle complexity for solving the problem (P) without the regularizer . In a subsequent work (Wang et al., 2018a), the authors further proposed an improved algorithm scheme called Proximal SpiderBoost that allows to use a much larger constantlevel stepsize and achieves the same orderlevel stochastic gradient oracle complexity for solving the problem (P) under a convex regularizer .
Although the aforementioned SPIDERbased algorithms achieve the optimal stochastic gradient oracle complexity in nonconvex optimization, their practical performance has been found in recent works (Nguyen et al., 2017b; Fang et al., 2018) to be hardly advantageous to that of the traditional SVRG. Therefore, it is of vital importance to exploit the structure of the SPIDER estimator in other algorithmic dimensions to further improve the practical performance of SPIDERbased algorithms. Momentum is such a promising and important perspective. In fact, there are still two major challenges ahead to design momentum schemes for variancereduced algorithms in nonconvex optimization. First, while momentum scheme has been well studied for (stochastic) gradient algorithms (Ghadimi & Lan, 2016) in nonconvex optimization, the convergence guarantee of stochastic variancereducedlike algorithms is only explored for SVRG in certain convex scenarios (Nitanda, 2016; AllenZhu, 2017, 2018; Shang et al., 2018). Therefore, it is not clear whether a certain momentum scheme can be applied to stochastic variancereduced algorithms based on SPIDER and yield the optimal oracle gradient complexity for nonconvex optimization. Furthermore, the existing momentum scheme for stochastic algorithms to solve the nonconvex problem (P) has convergence guarantee only for convex regularizers that have a bounded domain (Ghadimi & Lan, 2016), which are not applicable to a variety of application scenarios where regularizers with unbounded domain (e.g., ) are commonly used.
In this paper, we explore momentum schemes for SPIDERbased variance reduction algorithms that can solve the nonconvex and nonsmooth problem (P) under a much broader choice of regularizers with convergence guarantee. We summarize our contributions as follows.
Summary of Contributions
We consider solving the problem (P) with nonconvex loss functions and an arbitrary convex regularizer (possibly nonsmooth). We propose Proximal SPIDERM, which is a proximal stochastic algorithm that exploits both the SPIDER variancereduction scheme and a momentum scheme for solving the problem (P). We show that the output point generated by the Proximal SPIDERM satisfies a generalized
firstorder stationary condition within number of iterations, and the corresponding stochastic gradient oracle complexity is in the order of , matching the complexity lower bound for nonconvex optimization. To the best of our knowledge, this is the first known theoretical guarantee for stochastic variancereduced type of algorithms with momentum in nonconvex optimization. We also note that the design of our momentum scheme is applicable to arbitrary convex regularizers, which significantly relaxes the constraint of the existing momentum scheme that requires the regularizer to have a bounded domain in order to have a convergence guarantee for nonconvex optimization (Ghadimi & Lan, 2016).We further propose two variants of the momentum scheme, i.e., epochwise diminishing momentum and epochwise restart momentum, for Proximal SPIDERM. We establish the same orderlevel oracle complexity result in nonconvex optimization as mentioned above. To the best of our knowledge, this is the first formal theoretical guarantee for epochwise diminishing and restart momentum schemes in nonconvex optimization. Moreover, we generalize the Proximal SPIDERM to solve the problem (P) in an online setting, and show that the algorithm satisfies the generalized firstorder stationary condition within number of iterations, and the associated stochastic gradient oracle complexity is in the order , matching the stateoftheart result. Our numerical experiments demonstrate that the momentum scheme does substantially improve the practical performance of SPIDER and outperform other momentumbased variancereduced algorithms.
Related Work
Stochastic algorithms for nonconvex optimization: For nonconvex optimization, SGD has been shown to achieve an firstorder stationary condition with an overall stochastic gradient oracle complexity of (Ghadimi et al., 2016). Convergence guarantee for various stochastic variancereduced algorithms have been established in nonconvex optimization. In specific, SAGA and SVRG have been shown to yield an overall stochastic gradient oracle complexity of (Reddi et al., 2016; AllenZhu & Hazan, 2016) to achieve an firstorder stationary condition. More recently, (Nguyen et al., 2017a, b) proposed a novel stochastic variance reduction algorithm named SARAH and showed that the corresponding stochastic gradient oracle complexity is to attain an firstorder stationary point. The SPIDER algorithm (Fang et al., 2018) is a variant of SARAH that uses the same gradient estimator as SARAH but adopts a natural gradient descent update. (Fang et al., 2018) showed that SPIDER achieves an overall stochastic gradient oracle complexity, which is optimal within the regime . (Wang et al., 2018a) further proposed an improved SPIDER scheme that allows to use a constantlevel stepsize and can solve composite nonconvex optimization problems. In (Zhou et al., 2018), the authors proposed a nested stochastic variance reduction scheme for nonconvex optimization and achieve the same orderlevel oracle complexity result as that of SPIDER. More recently, (Zhou et al., 2019; Zhang et al., 2018) further applied the SARAH and SPIDER estimators to nonconvex optimization problems over manifolds.
Momentum schemes for nonconvex optimization: Momentum scheme is originally designed for accelerating gradient algorithms to achieve an optimal convergence rate in convex optimization (Nesterov, 2014; Beck & Teboulle, 2009; Tseng, 2010; Ghadimi & Lan, 2016). For nonconvex optimization, (Ghadimi & Lan, 2016) established convergence of stochastic gradient algorithms with momentum to an firstorder stationary point with an overall stochastic gradient oracle complexity of . The convergence guarantee of SVRG with momentum has been explored under a certain local gradient dominance geometry in nonconvex optimization (Li et al., 2017). However, the momentum scheme there requires to compare the objective function value (and hence calculate the total loss) at each iteration and hence is not sample efficient. Similar momentum scheme has also been explored in secondorder algorithms for nonconvex optimization (Wang et al., 2018b).
2 Preliminaries
In this section, we introduce some definitions and assumptions that are used throughout the paper. Recall that we are interested in solving the following optimization problem with composite objective function
(P)  
where the function denotes the total loss on the training data and the function corresponds to the regularizer that penalizes the violation of a desired structure (e.g., sparsity, lowrankness, etc). We adopt the following standard assumptions on the problem (P).
Assumption 1.
The objective function in the problem (P) satisfies:

[leftmargin=*,topsep=0pt,noitemsep]

Function is bounded below, i.e.,
(1) 
The loss functions are smooth, i.e., for all , there exists an such that
(2) 
The regularizer function is proper^{2}^{2}2An extendedvalued function is called proper if its domain is nonempty. and convex.
Intuitively, item 1 of creftypecap 1 guarantees the feasibility of the optimization problem (P) and item 2 imposes smoothness on the individual loss functions. For the set of convex regularizers, many of them (e.g., , elastic net, etc) are not differentiable and therefore one cannot use gradient to evaluate the firstorder stationary condition for such a nonsmooth composite objective function. This motivates us to introduce a generalized notion of gradient as we elaborate below.
We first introduce the following proximal mapping that is useful to handle the nonsmoothness of a function.
Definition 1 (Proximal mapping).
For any proper and convex function , its proximal mapping evaluated at with parameter is the unique point defined as
The proximal mapping is uniquely defined for convex functions. Particularly, in the special case where is the indicator function of a convex set, its proximal mapping reduces to the projection operator onto the convex set. More importantly, the proximal mapping can be used to characterize the firstorder stationary condition of nonsmooth composite functions in the following way.
Fact 1.
(Bauschke & Combettes, 2011) Let be a proper and convex function. Define the following notion of generalized gradient
(3) 
Then, is a critical point of the function (i.e., ) if and only if .
Intuitively, can be understood as a generalized notion of gradient for composite objective function. In the special case where , the generalized gradient reduces to the usual notion of the gradient .
Based on the above definition, throughout the paper, we say that a point satisfies an firstorder stationary condition of the problem (P) if .
3 Proximal SPIDERM for Nonconvex Composite Optimization
In this section, we propose a proximal SPIDER algorithm that incorporates a momentum scheme (referred to as Proximal SPIDERM) for solving the composite problem (P), and study its theoretical guarantee as well as the oracle complexity.
3.1 Algorithm Design
We present the detailed update rule of the Proximal SPIDERM in Algorithm 1, where “Unif” denotes the uniform sampling scheme with replacement.
To elaborate on the algorithm design, note that Proximal SPIDERM generates a tuple of variable sequences according to the momentum scheme. In specific, the variables , are updated via proximal gradientlike steps using the gradient estimate proposed for SARAH in (Nguyen et al., 2017a, b) and different stepsizes , respectively. Then, their convex combination with momentum coefficient yields the variable . We choose a standard diminishing momentum coefficient that serves for proving convergence guarantee in nonconvex optimization. We also note that the two updates for and do not introduce extra computation overhead as compared to a single update, since they both depend on the same stochastic gradient .
We want to highlight the difference between our momentum scheme design for Proximal SPIDERM and the existing momentum scheme design for proximal SGD in (Ghadimi & Lan, 2016) and proximal SVRG in (AllenZhu, 2017). In these works, they use the following proximal gradient steps for updating the variables and :
(4)  
(5) 
Note that eq. 4 and eq. 5 use different proximal gradient updates that are based on and , respectively. As a comparison, our momentum scheme in Algorithm 1 applies the same proximal gradient term to update both variables and , and therefore requires less computation for evaluating the proximal mapping. Moreover, our update for the variable is not a single proximal gradient update (as opposed to eq. 5), and it couples with the variables and .
The momentum scheme introduced in (Ghadimi & Lan, 2016) based on eq. 4 and eq. 5 was shown to have convergence guarantee in nonconvex composite optimization only for convex regularizers that have a bounded domain. Therefore, it cannot yield a provable convergence guarantee for regularizers with unbounded domain, which are commonly used in practical applications, e.g., , , elastic net, etc. On the other hand, the momentum scheme introduced in (AllenZhu, 2017) was not proven to have a convergence guarantee in nonconvex optimization. In the next subsection, we prove that our momentum scheme in Algorithm 1 has a provable convergence guarantee for nonconvex composite optimization with arbitrary convex regularizers, therefore eliminating the restriction on the regularizers in (Ghadimi & Lan, 2016).
3.2 Convergence and Complexity Analysis
In this subsection, we study the convergence guarantee as well as the stochastic gradient oracle complexity of Proximal SPIDERM for solving the problem (P). We obtain the following main theorem.
Theorem 1.
Let creftypecap 1 hold. Apply the Proximal SPIDERM (see Algorithm 1) to solve the problem (P) with parameters , , and . Then, the output produced by the algorithm satisfies for any provided that the total number of iterations satisfies
(6) 
Moreover, the total number of stochastic gradient oracle calls is at most and the total number of proximal mapping oracle calls is at most .
Theorem 1 establishes the convergence rate of Proximal SPIDERM to satisfy the generalized firstorder stationary condition and the corresponding oracle complexity. Specifically, the iteration complexity to achieve the generalized firstorder stationary condition is in the order of , which matches the stateofart result of stochastic nonsmooth nonconvex optimization (Wang et al., 2018a). Furthermore, the corresponding stochastic gradient oracle complexity matches the lower bound for nonconvex optimization (Fang et al., 2018). Therefore, Proximal SPIDERM enjoys the same optimal convergence guarantee as that for the Proximal SpiderBoost (Wang et al., 2018a) in nonconvex optimization, and it further benefits from the momentum scheme that can lead to significant acceleration in practical applications (as we demonstrate via experiments in Section 6). We also note that the design of Proximal SPIDERM allows to use constant stepsizes as opposed to the accuracydependent stepsize adopted by the original SPIDER (Fang et al., 2018). This also facilitates the convergence of the algorithm in practice.
Outline of Proof for Theorem 1.
As the technical proof is involved, we briefly outline the key intermediate steps below to convey some intuition on the analysis. The detailed proof is provided in the supplementary materials.
Based on the definition of generalized gradient (see creftypecap 1), we can rewrite the updates for in Algorithm 1 as follows:
It can be seen that the term serves as a generalized gradient in the updates. Then, under the above momentum scheme, we can characterize the periteration progress of Proximal SPIDERM by bounding the progressive function value gap as
where and we have hidden the constant factors for simplicity of presentation. The next key step is to bound the estimation error term in terms of the generalized gradient term as
where denotes the index of the period that iteration belongs to. Then, combining the above two inequalities, telescoping and simplifying with much effort yield that
Based on the above result, we further exploit the randomized output strategy and finally obtain that
where is selected from uniformly at random. Then, the desired convergence rate and oracle complexity results follow. ∎
From a technical perspective, we highlight the following three major new developments in the proof of Theorem 1 that is different from the proof for the basic stochastic gradient algorithm with momentum (Ghadimi & Lan, 2016) for nonconvex optimization: 1) our proof exploits the martingale structure of the SPIDER estimate which allows to bound the meansquare error term in a tight way under the momentum scheme. In traditional analysis of stochastic algorithms with momentum (Ghadimi & Lan, 2016), such an error term corresponds to the variance of the stochastic estimator and is assumed to be bounded by a universal constant. 2): Our proof requires a very careful manipulation of the bounding strategy to handle the accumulation of the meansquare error over the entire optimization path. 3): Our design of the momentum scheme allows to prove the convergence under arbitrary convex regularizers, whereas the proof of (Ghadimi & Lan, 2016) requires the regularizer to have a bounded domain.
4 Other Momentum Scheduling Schemes for Proximal SPIDERM
It turns out that the design of Proximal SPIDERM in Algorithm 1 allows to use more flexible momentum schemes in nonconvex optimization. In this section, we explore two variant momentum schemes for Proximal SPIDERM that can be useful in practice and study the corresponding convergence guarantees.
4.1 Epochwisediminishing Momentum
The Proximal SPIDERM in Algorithm 1 uses a momentum coefficient
that diminishes to zero iterationwisely. As the epoch length
usually consists of many inner iterations (e.g., multiple passes over the data), the momentum coefficient can be very small after several epochs and hence leads to limited acceleration. Therefore, one strategy to alleviate such a problem is to set the momentum coefficient to diminish epochwisely, i.e., set(Epochwisediminish momentum):
where corresponds to the number of inner iterations within each epoch and ‘’ denotes the ceiling function. Under such a momentum setting, the momentum coefficient remains to be constant within each epoch and diminishes slowly along progressive epochs. We note that a similar momentum coefficient setting is adopted in (AllenZhu, 2017; Shang et al., 2018) for accelerating SVRG. However, the focus there is to solve convex optimization problems and no convergence guarantee was established for nonconvex optimization.
4.2 Epochwiserestart Momentum
Another widely used momentum setting is to restart the momentum scheme after a fixed number of iterations. Specifically in the context of Proximal SPIDERM, we synchronize the variables and to be the obtained in the previous iteration after every epoch, i.e., we add the following algorithmic code to the Proximal SPIDERM in Algorithm 1
If  
This can be understood as a reinitialization of the variables epochwisely. On the other hand, we restart the momentum coefficient after every epoch as
Under such a momentum scheme, the momentum coefficient reboots at the beginning of every epoch, injecting a periodic momentum into the algorithm dynamic consistently. Finally, the algorithm outputs the point where is selected from uniformly at random.
The momentum scheme with restart has been applied to the gradient descent algorithm in (O’donoghue & Candès, 2015). There, it has been justified that a proper restart scheme can significantly accelerate the practical convergence of the algorithm. However, it is unclear whether a restart momentum scheme can have a convergence guarantee in nonconvex and nonsmooth optimization, especially under the more sampleefficient SPIDER scheme. We establish such a theoretical result in the next subsection.
To further illustrate the differences among these three momentum schemes, we plot and compare the scheduling of the momentum coefficient of these momentum schemes in Figure 1. The area below each curve roughly corresponds to the total momentum that is injected into the algorithm dynamic by the corresponding momentum scheme. One can see that the original momentum scheme that diminishes iterationwisely has the smallest total momentum, whereas the epochwisediminishing momentum scheme has the largest total momentum (within a considerable number of epochs). We further demonstrate that the practical performance of these momentum schemes is highly correlated with the accumulative momentum via numerical experiments in Section 6.
4.3 Convergence and Complexity Analysis
In this subsection, we present the convergence result and the corresponding oracle complexity of Proximal SPIDERM under the variants of momentum schemes introduced in the previous subsections. We obtain the following main result.
Theorem 2.
Let creftypecap 1 hold. Apply the Proximal SPIDERM with either epochwisediminishing momentum or epochwiserestart momentum to solve the problem (P). Set parameters , and . Then, the output of the algorithm satisfies for any under the same complexity requirements as those in Theorem 1.
From Theorem 2, it can be seen that the Proximal SPIDERM maintains the optimal stochastic gradient oracle complexity in nonconvex optimization under the more flexible epochwisediminishing and the epochwise restart momentum schemes. Therefore, this demonstrates that the algorithmic structure of SPIDER provides much flexibility in designing compatible momentum schemes in the nonconvex regime.
5 Proximal SPIDERM for Online Nonconvex Composite Optimization
The objective function in the optimization problem (P) contains a finite number of data samples that are typically drawn from an underlying data distribution. Therefore, it can be viewed as a finitesample approximation of the population risk , where the data sample is generated from an underlying distribution . In this section, we study the following online composite optimization problem that involves the population risk:
(R)  
where the function corresponds to the regularizer. As the problem (R) depends on the population risk that contains infinite samples, we propose a variant of Proximal SPIDERM that can solve it in an online setting. We summarize the detailed steps of the algorithm in Algorithm 2, where we refer to it as Online Proximal SPIDERM.
Note that unlike the Proximal SPIDERM for the finitesum case, the Online Proximal SPIDERM keeps drawing new data samples from the underlying distribution (uniformly at random) to construct the gradient estimate . To study its convergence guarantee, we make the following standard assumption on the variance of the random sampling.
Assumption 2.
There exists a constant such that for all and all random samples , it holds that .
Based on creftypecap 2, we obtain the following convergence guarantee for Online Proximal SPIDERM.
Theorem 3.
Let Assumptions 1 and 2 hold. Apply Online Proximal SPIDERM (see Algorithm 2) to solve the problem (R). Choose any desired accuracy and set parameters , and . Then, the output of the algorithm satisfies provided that the total number of iterations satisfies
(7) 
Moreover, the total number of stochastic gradient calls is at most and the total number of proximal mapping calls is at most .
The orders of the results in Theorem 3 match those of stateofarts (Fang et al., 2018; Wang et al., 2018a). Our result demonstrates that the momentum scheme can be applied to facilitate the convergence of Proximal SPIDER for solving online nonsmooth and nonconvex problems with a provable convergence guarantee. Moreover, we obtain a similar complexity result for Online Proximal SPIDERM under the other two momentum schemes proposed in Section 4 in the following theorem.
Theorem 4.
Let Assumptions 1 and 2 hold. Apply Online ProxSPIDERM with either epochwisediminishing momentum or epochwise restart momentum to solve the problem (R). Choose any desired accuracy and set parameters , and . Then, the output of the algorithm satisfies under the same complexity requirements as those in Theorem 3.
6 Experiments
In this section, we compare the practical performance of the following stochastic variancereduced algorithms: SVRG in (Johnson & Zhang, 2013), SpiderBoost in (Wang et al., 2018a), Katyusha in (AllenZhu, 2017), ASVRG in (Shang et al., 2018), RSAG in (Ghadimi & Lan, 2016), Proximal SPIDERM (Algorithm 1 in this paper), Proximal SPIDERMED (epochwisediminishing momentum) and Proximal SPIDERMER (epochwise restart momentum). We note that all algorithms use certain momentum schemes except for SVRG and SpiderBoost. For all algorithms considered, we set their learning rates to be
. For each experiment, we initialize all the algorithms at the same point that is generated randomly from the normal distribution. Also, we choose a fixed minibatch size
and set the epoch length to be such that all algorithms pass over the entire dataset twice in each epoch.6.1 Unregularized Nonconvex Optimization
We first apply these algorithms to solve an unregularized nonconvex optimization problem. The first problem is the following nonconvex logistic regression problem
where denotes the features and corresponds to the labels, and we set the loss to be the crossentropy loss and . For this problem, we use two different datasets from the LIBSVM (Chang & Lin, 2011): the a9a dataset () and the w8a dataset (). We report the learning curves on the function value gap of these algorithms in Figure 2.
In this experiment, one can see from Figure 2 that our SPIDERMED with epochwise diminishing momentum achieves the best performance and significantly outperforms other algorithms. Also, we note that the performances of both Katyusha and ASVRG do not achieve much acceleration in such a nonconvex case, as these algorithms are originally developed to accelerate solving convex problems. This demonstrates that our design of SPIDERM has a stable performance in nonconvex optimization as well as provable theoretical guarantee. We note that the curve of SpiderBoost overlaps with that of SVRG (both algorithms have similar performance). On the other hand, among all SPIDERMtype of algorithms, the one that uses the epochwise diminishing momentum (SPIDERMED) has the best performance, whereas the one that uses the iterationwise diminish momentum (SPIDERM) is the slowest. This corroborates the comparison of the total momentum that we illustrate in Figure 1.
Next, we compare these algorithms in solving the following nonconvex robust linear regression problem
where we use the nonconvex loss function . We report the learning curves on the function value gap of these algorithms in Figure 3. One can see that our SPIDERMED with epochwise diminishing momentum has a comparable performance to that of Katyusha, and they both outperform other algorithms.
6.2 Nonsmooth and Nonconvex Optimization
We further add an nonsmooth regularizer with weight coefficient to the objective functions of the above two optimization problems, and apply the corresponding proximal versions of these algorithms to solve the nonconvex composite optimization problems. All the results are presented in Figures 4 and 5. One can see that our Proximal SPIDERMED still significantly outperforms all the other algorithms in these nonconvex and nonsmooth scenarios. This demonstrates that our novel design of the coupled update for in the momentum scheme is efficient in the nonsmooth and nonconvex setting. Also, it turns out that Katyusha and ASVRG are suffering from a slow convergence (their convergences occur at around 40 epochs). Together with the first two experiments, this implies that their performance is not stable and may not be generally suitable for solving nonsmooth and nonconvex problems.
7 Conclusion
In this paper, we design an efficient proximal stochastic variancereduced algorithm with momentum to solve nonconvex composite optimization problems with provable convergence guarantee. Under a basic momentum scheme, we show that our Proximal SPIDERM achieves the best possible stochastic gradient oracle complexity for nonconvex optimization. Our algorithm design further allows to apply other momentum schemes and to solve online composite optimization problems with an optimal oracle complexity. We anticipate our algorithm design to inspire the development of more advanced momentum acceleration schemes for stochastic nonconvex optimization. On the other hand, it is also interesting to explore whether our algorithm can achieve the best possible convergence rate in convex optimization.
References
 AllenZhu (2017) AllenZhu, Z. Katyusha: The first direct acceleration of stochastic gradient methods. Journal of Machine Learning Research, 18(1):8194–8244, January 2017.
 AllenZhu (2018) AllenZhu, Z. Katyusha x: Simple momentum method for stochastic sumofnonconvex optimization. In Proc. International Conference on Machine Learning (ICML), volume 80, pp. 179–185, 10–15 Jul 2018.
 AllenZhu & Hazan (2016) AllenZhu, Z. and Hazan, E. Variance reduction for faster nonconvex optimization. In Proc. International Conference on Machine Learning (ICML), pp. 699–707, 2016.
 Bauschke & Combettes (2011) Bauschke, H. and Combettes, P. L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2011.
 Beck & Teboulle (2009) Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, March 2009.
 Bottou (2010) Bottou, L. Largescale machine learning with stochastic gradient descent. In Proc. International Conference on Computational Statistics (COMPSTAT), pp. 177–186, 2010.

Chang & Lin (2011)
Chang, C. and Lin, C.
LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2(3):1–27, 2011.  Defazio et al. (2014) Defazio, A., Bach, F., and LacosteJulien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 1646–1654. 2014.
 Fang et al. (2018) Fang, C., Li, C., Lin, Z., and Zhang, T. SPIDER: Nearoptimal nonconvex optimization via stochastic pathintegrated differential estimator. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 687–697. 2018.
 Ghadimi & Lan (2016) Ghadimi, S. and Lan, G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(12):59–99, March 2016.
 Ghadimi et al. (2016) Ghadimi, S., Lan, G., and Zhang, H. Minibatch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1):267–305, Jan 2016.
 Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016.
 Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Proc. International Conference on Neural Information Processing Systems (NIPS), pp. 315–323, 2013.
 Li et al. (2017) Li, Q., Zhou, Y., Liang, Y., and Varshney, P. K. Convergence analysis of proximal gradient with momentum for nonconvex optimization. In Proc. International Conference on Machine Learning (ICML), volume 70, pp. 2111–2119, 2017.
 Nesterov (2014) Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2014.
 Nguyen et al. (2017a) Nguyen, L., Liu, J., Scheinberg, K., and Takáč, K. Stochastic recursive gradient algorithm for nonconvex optimization. ArXiv:1705.07261, May 2017a.
 Nguyen et al. (2017b) Nguyen, L., Liu, J., Scheinberg, K., and Takáč, M. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proc. International Conference on Machine Learning (ICML), 2017b.

Nitanda (2016)
Nitanda, A.
Accelerated stochastic gradient descent for minimizing finite sums.
In
Proc. International Conference on Artificial Intelligence and Statistics (AISTATS)
, volume 51, pp. 195–203, May 2016.  O’donoghue & Candès (2015) O’donoghue, B. and Candès, E. Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, 15(3):715–732, June 2015.
 Reddi et al. (2016) Reddi, S. J., Sra, S., Poczos, B., and Smola, A. Proximal stochastic methods for nonsmooth nonconvex finitesum optimization. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 1145–1153. 2016.
 Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 09 1951.
 Schmidt et al. (2017) Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83–112, Mar 2017.
 Shang et al. (2018) Shang, F., Jiao, L., Zhou, K., Cheng, J., Ren, Y., and Jin, Y. Asvrg: Accelerated proximal svrg. In Proc. Asian Conference on Machine Learning, volume 95, pp. 815–830, 2018.
 Tseng (2010) Tseng, P. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming, 125(2):263–295, Oct 2010.
 Wang et al. (2018a) Wang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. SpiderBoost: A class of faster variancereduced algorithms for nonconvex optimization. ArXiv:1810.10690, October 2018a.
 Wang et al. (2018b) Wang, Z., Zhou, Y., Liang, Y., and Lan, G. Cubic regularization with momentum for nonconvex optimization. ArXiv:1810.03763, October 2018b.
 Xiao & Zhang (2014) Xiao, L. and Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
 Zhang et al. (2018) Zhang, J., Zhang, H., and Sra, S. RSPIDER: A Fast Riemannian Stochastic Optimization Algorithm with Curvature Independent Rate. ArXiv:811.04194, 2018.
 Zhou et al. (2018) Zhou, D., Xu, P., and Gu, Q. Stochastic nested variance reduced gradient descent for nonconvex optimization. In Proc. Advances in Neural Information Processing Systems (NeurIps), pp. 3925–3936. 2018.
 Zhou et al. (2019) Zhou, P., Yuan, X., and Feng, J. Faster FirstOrder Methods for Stochastic NonConvex Optimization on Riemannian Manifolds. 2019.
Appendix A Auxiliary Lemmas for Analysis of Algorithm 1
In this section, we collect some auxiliary results that facilitate the analysis of Algorithm 1. Throughout, for any , denote the unique integer such that . We also define and for . Since we set , it is easy to check that .
We first present an auxiliary lemma from (Fang et al., 2018).
Lemma 1.
(Fang et al., 2018) Under creftypecap 1, the estimation of gradient constructed by SPIDER satisfies that for all ,
Telescoping Lemma 1 and noting that for all such that , we obtain the following bound.
Lemma 2.
Under creftypecap 1, the estimation of gradient constructed by SPIDER satisfies that for all ,
(8) 
Next, recall the following definition of the gradient mapping for some and :
Based on this definition, we can rewrite the updates of Algorithm 1 as follows:
Next, we prove the following auxiliary lemma.
Lemma 3.
Let the sequences be generated by Algorithm 1. Then, the following inequalities hold
(9)  
(10)  
(11) 
Proof.
We prove the first equality. By the update rule of the momentum scheme, we obtain that
(12) 
Dividing both sides by and noting that , we further obtain that
(13) 
Telescoping the above equality over yields the first desired equality.
Next, we prove the second inequality. Based on the first equality, we obtain that
(14) 
where (i) uses the facts that is a decreasing sequence, and Jensen’s inequality.
Finally, we prove the third inequality. By the update rule of the momentum scheme, we obtain that . Then, we further obtain that
The desired result follows by taking the square on both sides of the above inequality and using the fact that . ∎
We also need the following lemma, which was established as Lemma 1 and Proposition 1 in (Ghadimi et al., 2016).
Lemma 4 (Lemma 1 and Proposition 1, (Ghadimi et al., 2016)).
Let be a proper and closed convex function. Then, for all and , the following statements hold:
Appendix B Proof of Theorem 1
Consider any iteration of the algorithm. By smoothness of , we obtain that
where (i) follows from Lemma 4. Rearranging the above inequality and using CauchySwartz inequality yields that