Large-scale machine learning problems can typically be modeled as the following finite-sum optimization problem
where the function denotes the total loss on the training samples and in general is nonconvex. Since the sample size
can be very large, the full-batch gradient descent algorithm has high computational complexity. Various stochastic gradient descent (SGD) algorithms have been proposed and extensively studied. For nonconvex optimization, the basic SGD algorithm, which calculates one gradient per iteration, has been shown to yield an overall stochastic first-order oracle (SFO) complexity, i.e., gradient complexity, of(Ghadimi et al., 2016) to attain a first-order stationary point that satisfies . It has been shown that the vanilla SGD with constant stepsize converges only to the neighborhood of a first-order stationary point. Such an issue can be further addressed by diminishing the step-size (Bottou et al., 2018) or choosing a sufficiently large batch size in each iteration.
Furthermore, various variance reduction methods have been proposed to reduce the variance of the gradient estimator in SGD by constructing a more sophisticated and accurate gradient estimator such as SAG(Roux et al., 2012), SAGA (Defazio et al., 2014) and SVRG (Johnson and Zhang, 2013). In particular, SAGA and SVRG have been shown to yield an overall SFO complexity of (Reddi et al., 2016a; Allen-Zhu and Hazan, 2016) to obtain an -approximate first-order stationary point for nonconvex problems. These variance reduction methods also for the first time demonstrate that stochastic gradient-based methods dominate deterministic gradient descent methods with an order of for nonconvex optimization.
More recently, Nguyen et al. (2017a, b) proposed a novel variance reduction method named SARAH, where the gradient estimator is designed to be sequentially updated with the iterate in the inner loop to improve the estimation accuracy. In particular, SARAH takes the stepsize (where is the number of iterations in each inner loop), and has been shown in Nguyen et al. (2017a) to achieve an overall SFO complexity to attain an -approximate first-order stationary point for nonconvex optimization. Another variance reduction method of the same type named SPIDER was also proposed in Fang et al. (2018), which uses the same gradient estimator as SARAH but adopts a natural gradient update with a learning rate . Fang et al. (2018) showed that SPIDER achieves an overall SFO, which was further shown to be optimal for the regime with .
Though SPIDER is theoretically appealing, two important issues of SPIDER requires further attention. (1) SPIDER requires a very restrictive stepsize 111SPIDER in Fang et al. (2018) takes a natural gradient descent update with a stepsize . It can be equivalently viewed as a gradient descent update with an adaptive stepsize , where is the estimate of the gradient at the -th step. During the initial stage of the algorithm, can be much larger than so that the resulting stepsize can be very small. in order to guarantee the convergence, which prevents SPIDER from making big process even if it is possible. Relaxing such a condition appears not easy under the current convergence analysis framework. (2) The convergence analysis of SPIDER requires a very small per-iteration increment , which is difficult to guarantee if one generalizes it to a proximal algorithm for solving a composite optimization problem, due to the nonlinearity of the proximal operator. Hence, generalizing SPIDER to proximal algorithms with provable convergence guarantee is challenging, if not impossible. Thus, two natural questions arise as follows.
Can we relax the parameter restrictions of SPIDER without losing the guaranteed convergence rate?
If an improved SPIDER can be designed, does such improvement facilitates the generalization to proximal algorithms with convergence guarantee? Does the resulting algorithm improves the SFO complexity of existing proximal algorithms?
Our study here provides affirmative answers to both of the above questions. Our contributions are summarized as follows.
Inspired by SARAH and SPIDER, we propose a more practical variant, which we call as SpiderBoost. SpiderBoost has two main advantages. (1) SpiderBoost allows a much larger stepsize than the stepsize (if viewed under the gradient descent update) adopted by SPIDER, and at the same time achieves the same state-of-the-art complexity order as SPIDER (see Table 1). This is due to the new convergence analysis idea that we develop, which analyzes the increments of variables over each entire inner loop rather than over each inner-loop iteration, and hence yields tighter bound and consequently more relaxed stepsize requirement. As a result, SpiderBoost achieves significantly larger progress towards a first-order stationary point than SPIDER especially in the initial optimization phase where is large, as demonstrated in Figure 1. (2) SpiderBoost comes with a natural generalization to proximal algorithms for solving composite optimization problems with convergence guarantee. This is because the convergence analysis we develop for SpiderBoost does not require a bound on , and such an attribute significantly facilitates the convergence analysis for proximal algorithms. This is in contrast to the convergence analysis of SPIDER, which explicitly exploits the condition which is difficult to hold for proximal algorithms.
The online setting refers to the case, where the objective function takes the form of the expected value of the loss function over the data distribution. In such a case, the batch size for estimating the gradient is typically chosen to be-dependent. Such a method can also be applied to solve the finite-sum problem, and hence the SFO complexity in the last column of Table 1 is applicable to both the finite-sum and online problems. Thus, for algorithms in Table 1 that have SFO bounds available in both of the last two columns, the minimum between the two bounds provides the best bound for the finite-sum problem.
|GD||(Nesterov, 2014)||N/A333For deterministic algorithms, the online setting does not exist.|
|SGD||(Ghadimi et al., 2016)||N/A|
|SVRG||(Reddi et al., 2016a)||N/A|
|(Allen-Zhu and Hazan, 2016)|
|SCSG||(Lei et al., 2017)|
|SARAH||(Nguyen et al., 2017b, a)||N/A|
|SNVRG||(Zhou et al., 2018)|
|SPIDER||(Fang et al., 2018)||444SPIDER uses the natural gradient descent, which can also be viewed as the gradient descent with the stepszie .|
|ProxGD||(Ghadimi et al., 2016)||N/A||N/A|
|ProxSGD||(Ghadimi et al., 2016)||N/A||N/A|
|ProxSVRG/SAGA||(Reddi et al., 2016b)||N/A||N/A|
|ProxSVRG||(Li and Li, 2018) 555Li and Li (2018) contains a detailed discussion on the choice of the outer-loop batch size. Here, we include only the best result. Moreover, their result based on a additional assumption that the total number of iterations is a multiple of the number of iterations of the inner loop, thus the additional term in the other bounds disappeared in their bound.|
Taking the aforementioned second advantage, we propose Prox-SpiderBoost for solving the composite problem (Q) (see Section 3), where the objective function consists of a finite-sum function and a nonsmooth regulizer function. We show that Prox-SpiderBoost achieves a SFO complexity of and a proximal oracle (PO) complexity of . Such a SFO complexity improves the existing best results by a factor of (see Table 3). We further extend Prox-SpiderBoost for solving the constrained composite optimization problem using the proximal mapping under a non-Euclidean geometry, i.e., by replacing in the problem (Q) with a convex constraint set , and replace the Euclidean distance with a generalized Bregman distance. Under certain conditions, we prove that the obtained algorithm achieves the same SFO complexity and PO complexity as Prox-SpiderBoost for solving the unconstrained problem (Q). For nonconvex composite optimization problems that satisfy the so-called -gradient dominance condition (see Definition 1), we propose a variant of the Prox-SpiderBoost algorithm and establish its oracle complexity result for finding a stationary point. Our proposed algorithm achieves a SFO complexity in the order , outperforms the state-of-art complexity bounds achieved by other stochastic proximal algorithms in several regions (see Table 3).
We finally propose and study Prox-SpiderBoost-o for the online stochastic composite optimization problem, where the objective function takes the form with the expectation over the underlying data distribution rather than the finite-sum form. Our results show that Prox-SpiderBoost-o achieves a SFO complexity of which improves the existing best SFO complexity (see Table 3) for online stochastic composite optimization by a factor of . The same complexity result also holds for the general constrained optimization under a non-Euclidean geometry.
We note that two very recent studies (Zhou et al., 2018; Zhang et al., 2018) have extended the idea of SARAH and SPIDER to optimization problems over manifolds. We anticipate that Spiderboost may help to improve the practical performance of these studies.
For a vector, denotes the norm of the vector . We use to denote the gradient of . We use , and to denote the set of all real numbers, non-negative real numbers and -dimension real vectors, respectively.
2 SpiderBoost for Nonconvex Optimization
2.1 SpiderBoost Algorithm
In this section, we introduce the SpiderBoost algorithm inspired by the SARAH and SPIDER algorithms. Recall the following finite-sum nonconvex optimization problem.
In Nguyen et al. (2017a), a novel estimator of the full gradient of the problem (P) was introduced for reducing the variance. More specifically, consider a certain inner loop of the SPIDER algorithm. The initialization of the estimator is set to be . Then, for each subsequent iteration , an index set is sampled and the corresponding estimator is constructed as
Comparing the estimator in eq. 1 with the estimator used in the conventional SVRG (Johnson and Zhang, 2013), the estimator in eq. 1 is constructed iteratively based on the information that are obtained from the previous update, whereas the SVRG estimator is constructed based on the information of the initialization of that loop (i.e., replace in eq. 1 with , respectively). Therefore, the estimator in eq. 1 utilizes fresh information and yields more accurate estimation of the full gradient than that provided by the SVRG estimator.
The estimator in eq. 1 has been adopted by Nguyen et al. (2017a, b) and Fang et al. (2018) for proposing the SARAH (see Algorithm 1) and SPIDER (see Algorithm 2) algorithms, respectively. The comparison of their complexity can be seen in Table 1, where SPIDER outperforms SARAH for nonconvex optimization, and was shown in Fang et al. (2018) to be optimal for the regime with .
Though SPIDER has desired performance in theory, it can run very slowly in practice due to the choice of a conservative stepsize. To illustrate, as can be seen from Algorithm 2, SPIDER uses a very small stepsize (where is the desired accuracy). Then, the normalized gradient descent step yields that , i.e., a small increment per iteration. By following the analysis of SPIDER, such a stepsize appears to be necessary in order to achieve the desired convergence rate.
Such a conservative stepsize adopted by SPIDER motivates our design of an improved algorithm named SpiderBoost (see Algorithm 3), which uses the same estimator eq. 1 as SARAH and SPIDER, but adopts a much larger stepsize , as opposed to taken by SPIDER. Also, SpiderBoost updates the variable via a gradient descent step (same as SARAH), as opposed to the normalized gradient descent step taken by SPIDER. Furthermore, SpiderBoost generates the output variable via a random strategy whereas SPIDER outputs deterministically. Collectively, SpiderBoost can make a considerably larger progress per iteration than SPIDER, especially in the initial optimization phase where the estimated gradient norm is large, and is still guaranteed to achieve the same desirable convergence rate as SPIDER, as we show in the next subsection.
Next, as an illustration, we compare the practical performance of SPIDER and SpiderBoost for solving a logistic regression problem with a nonconvex regularizer, which takes the following form
For both algorithms, we use the same parameter setting except for the stepsize, and wish to achieve a first-order stationary condition . As specified in Fang et al. (2018) for SPIDER, we set . On the other hand, SpiderBoost allows to set . Figure 1 shows the convergence of the gradient norm and the function value gap of both algorithms versus the number of passes that are taken over the data. It can be seen that SpiderBoost enjoys a much faster convergence than that of SPIDER due to the allowance of a large stepsize. Furthermore, SPIDER oscillates around a point where the gradient norm is about , which is the predefined accuracy value. This implies that setting a larger stepsize for SPIDER would cause it to saturate and start to oscillate at a larger gradient norm as well as the loss value, which is undesired.
To summarize, SpiderBoost takes updates with a more aggressive stepsize that can substantially accelerate the convergence in practice without sacrificing the theoretical performance as we show in the next subsection. Moreover, SpiderBoost is more amenable than SPIDER to further extend to solving composite nonconvex optimization problems, and achieves an improved complexity than the state-of-the-art result as we study in Section 3.
2.2 Convergence Analysis of SpiderBoost
In this subsection, we study the convergence rate and complexity of SpiderBoost for finding a first-order stationary point within accuracy. In particular, we adopt the following standard assumptions on the objective function in the problem (P).
The objective function in the problem (P) satisfies:
Function is continuously differentiable and bounded below, i.e., ;
For every , the gradient is -Lipschitz continuous, i.e.,
creftypecap 1 essentially assumes that the smooth objective function has a non-trivial minimum and the corresponding gradient is Lipschitz continuous, which are valid and standard conditions in nonconvex optimization. Then, we obtain the following convergence result for SpiderBoost.
Theorem 1 shows that the output of SpiderBoost achieves the first-order stationary condition within accuracy with a total SFO complexity . This matches the lower bound that one can expect for first-order algorithms in the regime (Fang et al., 2018). As we explain in Section 2.1, SpiderBoost differs from SPIDER mainly in the utilization of a large constant stepsize, which yields significant acceleration over the original SPIDER in practice as we illustrate in Figure 1.
We note that the analysis of SpiderBoost in Theorem 1 is very different from that of SPIDER that depends on an -level stepsize and the normalized gradient descent step to guarantee a constant increment in every iteration. In contrast, SpiderBoost exploits the special structure of SPIDER estimator and analyzes the algorithm over the entire inner loop rather than over each iteration, and thus yields a better bound.
3 Prox-SpiderBoost for Nonconvex Composite Optimization
Many machine learning optimization problems add a regularization term to the original loss function in order to promote certain desired structures (e.g., sparsity) to the obtained solution. Such a regularization technique can substantially improve the solution quality. In such a case, the resulting optimization problem has a composite objective function that is more challenging to solve, especially when the regularization term is a non-smooth function. To handle such non-smoothness, we next generalize the SpiderBoost algorithm to solve nonconvex composite optimization problems, which take the form
where the function denotes the total loss on the training samples, and the function corresponds to a possibly non-smooth regularizer. To handle the non-smoothness, we next introduce the proximal mapping which is an effective tool for composite optimization.
3.1 Preliminaries on Proximal Mapping
Consider a proper and lower-semicontinuous function (which can be non-differentiable). We define its proximal mapping at with parameter as
Such a mapping is well defined and is unique particularly for convex functions. Furthermore, the proximal mapping can be used to generalize the first-order stationary condition of smooth optimization to non-smooth composite optimization via the following fact.
Let be a proper and convex function. Define the following notion of generalized gradient
Then, is a critical point of the function (i.e., ) if and only if
creftypecap 1 introduces a generalized notion of gradient for non-smooth composite optimization. To elaborate, consider the case so that the corresponding proximal mapping is the identity mapping. Then, the generalized gradient reduces to the gradient of the unconstrained optimization problem. Therefore, the -first-order stationary condition for composite optimization is naturally defined as .
Next, we introduce the algorithm scheme of Prox-SpiderBoost for solving composite optimization problems and study its oracle complexity.
3.2 Prox-SpiderBoost and Oracle Complexity
To generalize to composite optimization, SpiderBoost admits a natural extension Prox-SpiderBoost, whereas SPIDER encounters challenges. The main reason is because SpiderBoost admits a constant stepsize and its convergence guarantee does not have any restriction on the per-iteration increment of the variable. However, the convergence of SPIDER requires the per-iteration increment of the variable to be at the -level, which is quite challenging to satisfy under the nonlinear proximal operator in composite optimization. For example, one way to guarantee per-iteration condition is to add such a condition to the proximal map, which consequently complicates the computation of proximity, because the proximity of many major regularizers no longer has analytical forms under such an additional constraint. Another possible approach is to further normalize the progress direction suggested by the proximity of the current variable to satisfy the per-iteration condition, but such an update becomes problematic since it loses the property of being a minimizer of the proximal mapping. Moreover, the conservative stepsize slows down the convergence. In contrast, SpiderBoost does not require such a restriction in convergence guarantee, and hence comes with flexible extendability to nonconvex composite optimization.
The detailed steps of Prox-SpiderBoost (which generalizes SpiderBoost to composite optimization objectives) are described in Algorithm 4. Compared to SpiderBoost that uses a gradient step for smooth optimization, Prox-SpiderBoost updates the variable via a proximal gradient step to handle the possible non-smoothness in composite optimization.
Next, we characterize the oracle complexity of Prox-SpiderBoost for achieving the generalized -first-order stationary condition.
Let creftypecap 1 hold and consider the problem (Q) with a convex regularizer . Apply the Prox-SpiderBoost in Algorithm 4 to solve the problem (Q) with parameters and stepsize . Then, the corresponding output satisfies provided that the total number of iterations satisfies
Moreover, the total SFO complexity is , and the proximal oracle (PO) complexity is .
Theorem 2 shows that the output of Prox-SpiderBoost achieves the generalized first-order stationary condition within accuracy with a total SFO complexity . This improves the state-of-art complexity result by a factor of . On the other hand, note that the complexity lower bound for achieving the -first-order stationary condition in un-regularized optimization (Fang et al., 2018) is also a proper lower bound for composite optimization (by considering the special case ). Therefore, the SFO complexity of our Prox-SpiderBoost matches the corresponding complexity lower bound in the regime with , and is hence near optimal.
3.3 Constrained Optimization under Non-Euclidean Distance
Prox-SpiderBoost proposed in the previous subsection adopts the proximal mapping that solves an unconstrained subproblem under the Euclidean distance (see the definition of the proximal mapping). Such a mapping can be further generalized to solve constrained composite optimization under a non-Euclidean geometry.
To elaborate, consider solving the composite optimization problem (Q) subject to a convex constraint set . We introduce the following Bregman distance associated with a kernel function that is defined as: for all ,
Here, the function is smooth and -strongly convex with respect to a certain generic norm. The specific choice of the kernel function should be compatible to the underlying geometry of the constraint set. As an example, for the unconstrained case one can choose so that , which is 1-strongly convex with regard to the -norm, whereas for the simplex constraint set, one can choose that yields the KL relative entropy distance , which is -strongly convex with regard to the -norm.
Based on the Bregman distance, the proximal gradient step in Algorithm 4 can be generalized to the following update rule for solving the constrained composite optimization.
Moreover, the characterization of critical points in creftypecap 1 remains valid by defining the generalized gradient as . Then, we obtain the following oracle complexity result of Prox-SpiderBoost under the Bregman distance (replace the proximal step in Algorithm 4 by its general version (eq. 8) ) for solving constrained composite optimization.
Let creftypecap 1 hold and consider the problem (Q) with a convex regularizer and subject to a convex constrained set . Apply Prox-SpiderBoost with a proper Bregman distance that is -strongly convex, where . Choose the parameters and stepsize . Then, the algorithm outputs a point satisfying provided that the total number of iterations is at least
Moreover, the total SFO complexity is , and the PO complexity is .
4 Prox-SpiderBoost under Gradient Dominance Condition
Despite the nonconvexity geometry of many machine learning problems, their landscapes have been shown to have amenable properties for optimization. In particular, the so-called gradient dominance condition has been shown to hold for a variety of nonconvex problems such as phase retrieval (Zhou et al., 2016), blind deconvolution (Li et al., 2018)
and neural networks(Zhou and Liang, 2017), etc. Such a desirable property has been shown to accelerate the convergence of various first-order algorithms.
This motivate us to explore the theoretical performance of the Prox-SpiderBoost for solving the composite optimization problem (P) under the generalized gradient dominance geometry we define below, where the function can still be nonconvex.
Let be a minimizer of function . Then, is said to be -gradient dominated if for all and one has
where is the generalized gradient defined in creftypecap 1.
Definition 1 generalizes the traditional gradient dominance condition for single smooth objective functions to composite objective functions. In particular, such a condition allows the objective function to be non-smooth and nonconvex, and it requires the growth of the function value to be controlled by the gradient norm.
In order to solve the composite optimization problems under the generalized gradient dominance condition, we propose a variant of Prox-SpiderBoost, which we refer to as Prox-SpiderBoost-gd, described in Algorithm 5. We note that Prox-SpiderBoost-gd in Algorithm 5 is different from Prox-SpiderBoost in Algorithm 4 (for general nonconvex optimization) from several aspects. First, Prox-SpiderBoot-gd uses a constant level mini-batch size in the gradient estimator. Second, after every iterations (i.e., ), we set to be a random draw from the previous iterations, whereas Prox-SpiderBoost chooses the output randomly from all the iteration history. Prox-SpiderBoost-gd can also be viewed as a generalization of SARAH Nguyen et al. (2017b) to a proximal algorithm with further differences lying in a much larger stepsize than that chosen by SARAH and random sampling with replacement for inner loop iterations, as opposed to sampling without replacement taken by SARAH.
Next, we present the convergence rate characterization of Algorithm 5 for solving composite optimization problems under the generalized gradient dominance condition.
Let Assumprion 1 hold and apply Prox-SpiderBoost-gd in Algorithm 5 to solve the problem (Q). Assume the objective function is -gradient dominated and set . Then, the generated variable sequence satisfies, for all
Consequently, the oracle complexity of Algorithm 5 for finding a point that satisfies is in the order .
Theorem 4 shows that Prox-SpiderBoost-gd in Algorithm 5 converges linearly to a stationary point for solving composite optimization problems under the generalized gradient dominance condition. We compare the oracle complexity in Theorem 4 with those of other stochastic proximal algorithms in Table 3. We note that both the results of ProxSVRG and ProxSVRG+ requires the condition number to satisfy , in which regime our Prox-SpiderBoost-gd outperform the SFO complexity of these existing algorithms. Furthermore, our result of Prox-SpiderBoost-gd does not require the aforementioned condition, and has the most relaxed dependency on and , demonstrating the superior performance of Prox-SpiderBoost-gd for optimizing gradient dominant functions.
For the case with (i.e., the problem objective reduces to the smooth function ), our algorithm achieves a total SFO complexity of (Nguyen et al., 2017b), which is the same as that achieved by SARAH. However, we note that our algorithm allows to use a constant stepsize i.e., , while the stepsize used in SARAH is in the order .
|ProxGD||(Karimi et al., 2016)||-|
|ProxSVRG/SAGA||(Reddi et al., 2016b)||777Following the convenience, we treat as constant in all the bound, thus the original requirement becomes .|
|ProxSVRG||(Li and Li, 2018)|
5 Prox-SpiderBoost-o for Online Nonconvex Composite Optimization
In this section, we study the performance of a variant of Prox-SpiderBoost for solving nonconvex composite optimization problems under the online setting.
5.1 Unconstrained Optimization under Euclidean Geometry
In this subsection, we study the following composite optimization problem.
Here, the objective function consists of a population risk over the underlying data distribution and a regularizer . Such a problem can be viewed to have infinite samples as opposed to finite samples in the finite-sum problem (as in problem (Q)), and the underlying data distribution is typically unknown a priori. Therefore, one cannot evaluate the full-gradient over the underlying data distribution in practice. For such a type of problems, we propose a variant of Prox-SpiderBoost, which applies stochastic sampling to estimate the full gradient for initializing the gradient estimator in the inner loops. We refer to this variant as Prox-SpiderBoost-o, the details of which are summarized in Algorithm 6.
It can be seen that Prox-SpiderBoost-o in Algorithm 6 draws stochastic samples to estimate the full gradient for initializing the gradient estimator. To analyze its performance, we introduce the following standard assumption on the variance of stochastic gradients.
The variance of stochastic gradients is bounded, i.e., there exists a constant such that for all and all random draws of , it holds that .
Under creftypecap 2, the total variance of a mini-batch of stochastic gradients can be upper bounded by . We obtain the following result on the oracle complexity for Prox-SpiderBoost-o in Algorithm 6.
Let Assumptions 1 and 2 hold and consider the problem (R) with a convex regularizer . Apply Prox-SpiderBoost-o in Algorithm 6 to solve the problem (R) with parameters . Then, the corresponding output satisfies provided that the total number of iterations satisfies
Moreover, the resulting total SFO complexity is , and the PO complexity is .
In the smooth case with , the problem (R) reduces to the online case of problem (P), and Algorithm 6 reduces to the SpiderBoost algorithm except the outer loop gradient is estimated by a batch of samples instead of the full gradient. We refer to such an algorithm as SpiderBoost-o. The following corollary characterizes the performance of SpiderBoost-o to solve an online problem.
Let Assumptions 1 and 2 hold and consider the online setting of problem (P). Apply SpiderBoot-o with parameters to solve such a problem. Then, the corresponding output satisfies provided that the total number of iterations satisfies
Moreover, the resulting total SFO complexity is , and the PO complexity is .
5.2 Constrained Optimization under Non-Euclidean Geometry
Algorithm 6 can be generalized to solve the online optimization problem (R) subject to a convex constraint set with a general distance function. To do this, one replaces the proximal gradient update in Algorithm 6 with the generalized proximal gradient step in eq. 8 which is based on a proper Bregman distance . For such an algorithm, we obtain the following result on the oracle complexity for Prox-SpiderBoost-o in solving constrained stochastic composite optimization under non-Euclidean geometry.
Let Assumptions 1 and 2 hold and consider the problem (R) with a convex regularizer , which is further subject to a convex constraint set . Apply Prox-SpiderBoost-o with a proper Bregman distance that is -strongly convex with . Choose the parameters as and stepsize Then, the algorithm outputs a point that satisfies provided that the total number of iterations is at least
Moreover, the resulting total SFO complexity is and the PO is .
In this paper, we proposed an algorithm named SpiderBoost for solving smooth nonconvex optimization problem, which is guaranteed to output an -approximate first-order stationary point with at most SFO complexity (same as SPIDER), but allows much larger stepsize than SPIDER and hence runs faster in practice than SPIDER. Moreover, we extend the proposed SpiderBoost to Prox-SpiderBoost to solve nonsmooth nonconvex optimization, which achieves an -approximate first-order stationary point with at most SFO complexity and PO complexity. The SFO complexity outperforms the existing best result by a factor of . We anticipate that SpiderBoost has a great potential to be applied to various other large-scale optimization problems.
- Allen-Zhu (2017) Allen-Zhu, Z. (2017). Natasha 2: Faster non-convex optimization than SGD. ArXiv:1708.08694.
- Allen-Zhu and Hazan (2016) Allen-Zhu, Z. and Hazan, E. (2016). Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machine Learning(ICML), pages 699–707.
- Bottou et al. (2018) Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311.
- Defazio et al. (2014) Defazio, A., Bach, F., and Lacoste-Julien, S. (2014). SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), pages 1646–1654.
- Fang et al. (2018) Fang, C., Li, C. J., Lin, Z., and Zhang, T. (2018). Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems (NIPS).
- Ghadimi et al. (2016) Ghadimi, S., Lan, G., and Zhang, H. (2016). Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1):267–305.
- Johnson and Zhang (2013) Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems (NIPS), pages 315–323.
- Karimi et al. (2016) Karimi, H., Nutini, J., and Schmidt, M. (2016). Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Frasconi, P., Landwehr, N., Manco, G., and Vreeken, J., editors, Machine Learning and Knowledge Discovery in Databases, pages 795–811.
- Lei et al. (2017) Lei, L., Ju, C., Chen, J., and Jordan, M. I. (2017). Non-convex finite-sum optimization via scsg methods. In Advances in Neural Information Processing Systems (NIPS), pages 2348–2358.
- Li et al. (2018) Li, X., Ling, S., Strohmer, T., and Wei, K. (2018). Rapid, robust, and reliable blind deconvolution via nonconvex optimization. Applied and Computational Harmonic Analysis.
- Li and Li (2018) Li, Z. and Li, J. (2018). A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In Advances in Neural Information Processing System (NIPS).
- Nesterov (2014) Nesterov, Y. (2014). Introductory lectures on convex optimization: A basic course. Springer Publishing Company, Incorporated.
- Nguyen et al. (2017a) Nguyen, L. M., Liu, J., Scheinberg, K., and Takáč, M. (2017a). SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70, pages 2613–2621.
- Nguyen et al. (2017b) Nguyen, L. M., Liu, J., Scheinberg, K., and Takáč, M. (2017b). Stochastic recursive gradient algorithm for nonconvex optimization. ArXiv:1705.07261.
- Reddi et al. (2016a) Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola, A. (2016a). Stochastic variance reduction for nonconvex optimization. In Proceedings of The 33rd International Conference on Machine Learning (ICML), pages 314–323.
- Reddi et al. (2016b) Reddi, S. J., Sra, S., Poczos, B., and Smola, A. (2016b). Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Advances in Neural Information Processing Systems (NIPS), pages 1145–1153.
- Roux et al. (2012) Roux, N. L., Schmidt, M., and Bach, F. R. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems (NIPS), pages 2663–2671.
- Zhang et al. (2018) Zhang, J., Zhang, H., and Sra, S. (2018). R-SPIDER: A Fast Riemannian Stochastic Optimization Algorithm with Curvature Independent Rate. ArXiv:811.04194.
- Zhou et al. (2018) Zhou, P., Yuan, X.-T., and Feng, J. (2018). Faster First-Order Methods for Stochastic Non-Convex Optimization on Riemannian Manifolds. ArXiv:1811.08109.
- Zhou and Liang (2017) Zhou, Y. and Liang, Y. (2017). Characterization of gradient dominance and regularity conditions for neural networks. ArXiv:1710.06910v2.
- Zhou et al. (2016) Zhou, Y., Zhang, H., and Liang, Y. (2016). Geometrical properties and accelerated gradient solvers of non-convex phase retrieval. In Proc. 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 331–335.
Appendix A Analysis of SpiderBoost
Throughout the paper, let such that . We first present an auxiliary lemma from Fang et al. (2018).
Lemma 1 (Fang et al. (2018), Lemma 1).
Under creftypecap 1, the SPIDER estimator satisfies for all ,
Telescoping Lemma 1 over from to , we obtain that
We note that the above inequality also holds for , which can be simply checked by plugging into above inequality.
Next, we prove our main result that yields Theorem 1.
Under creftypecap 1, if the parameters and are chosen such that
and if it holds that for , we always have
then the output point of SpiderBoost satisfies
By creftypecap 1, the entire objective function is -smooth, which further implies that
where (i) follows from the update rule of SpiderBoost, (ii) uses the inequality that for . Taking expectation on both sides of the above inequality yields that