We are interested in solving the finite-sum smooth minimization problem
where each , , has a Lipschitz continuous gradient with constant . Throughout the paper, we consider the case where has a finite lower bound .
Problems of form (1
) cover a wide range of convex and nonconvex problems in machine learning applications including but not limited to logistic regression, neural networks, multi-kernel learning, etc. In many of these applications, the number of component functions
is very large, which makes the classical Gradient Descent (GD) method less efficient since it requires to compute a full gradient many times. Instead, a traditional alternative is to employ stochastic gradient descent (SGD)(Robbins & Monro, 1951; Shalev-Shwartz et al., 2011; Bottou et al., 2016). In recent years, a large number of improved variants of stochastic gradient algorithms called variance reduction methods have emerged, in particular, SAG/SAGA (Schmidt et al., 2016; Defazio et al., 2014), SDCA (Shalev-Shwartz & Zhang, 2013), MISO (Mairal, 2013), SVRG/S2GD (Johnson & Zhang, 2013; Konečný & Richtárik, 2013), SARAH (Nguyen et al., 2017a), etc. These methods were first analyzed for strongly convex problems of form (1). Due to recent interest in deep neural networks, nonconvex problems of form (1) have been studied and analyzed by considering a number of different approaches including many variants of variance reduction techniques (see e.g. (Reddi et al., 2016; Lei et al., 2017; Allen-Zhu, 2017a, b; Fang et al., 2018), etc.)
We study the SARAH algorithm (Nguyen et al., 2017a, b) depicted in Algorithm 1, slightly modified. We use upper index to indicate the -th outer loop and lower index to indicate the -th iteration in the inner loop. The key update rule is
The computed is used to update
After iteration in the inner loop, the outer loop remembers the last computed and starts its loop anew – first with a full gradient computation before again entering the inner loop with updates (2). Instead of remembering for the next outer loop, the original SARAH algorithm in (Nguyen et al., 2017a) uses with chosen uniformly at random from . The authors of (Nguyen et al., 2017a) chose to do this in order to being able to analyze the convergence rate for a single outer loop – since in practice it makes sense to keep the last computed if multiple outer loop iterations are used, we give full credit of Algorithm 1 to (Nguyen et al., 2017a) and call this SARAH.
We will analyze SARAH for smooth nonconvex optimization, i.e., we study (1) where we only assume component functions having a finite Lipschitz continuous gradient and no other assumptions:
Assumption 1 (-smooth).
Each , , is -smooth, i.e., there exists a constant such that, ,
We stress that our convergence analysis only relies on the above smooth assumption without bounded variance assumption (as required in (Lei et al., 2017; Zhou et al., 2018)) or Hessian-Lipschitz assumption (as required in (Fang et al., 2018)).
We measure the convergence rate in terms of total complexity , i.e., the total number of gradient computations. For SARAH we have
We notice that SARAH, using the notation and definition of (Fang et al., 2018), is a random algorithm that maps functions to a sequence of iterates
where is a measure mapping, is the individual function chosen by at iteration , and
is a uniform random vector with entries in. Rephrasing Theorem 3 in (Fang et al., 2018) states the following lower bound: There exists a function such that in order to find a point for which accuracy , must have a total complexity of at least stochastic gradient computations. Applying this bound to SARAH tells us that if the final output has
Our main contribution is to meet this lower bound and show that in SARAH we can choose parameters and such that the total complexity is
This significantly improves over prior work which only achieves :
Related Work: The paper that introduces SARAH (Nguyen et al., 2017b) is only able to analyze convergence of a single outer loop giving a total complexity of .
Besides the lower bound, (Fang et al., 2018) introduces SPIDER, as a variant of SARAH, which achieves to date the best known convergence result in the nonconvex case. SPIDER uses the SARAH update rule (2) as was originally proposed in (Nguyen et al., 2017a) and the mini-batch version of SARAH in (Nguyen et al., 2017b). SPIDER and SARAH are different in terms of iteration (3), which are and , respectively. Also, SPIDER does not divide into outer loop and inner loop as SARAH does although SPIDER does also perform a full gradient update after a certain fixed number of iterations. A recent technical report (Wang et al., 2018) provides an improved version of SPIDER called SpiderBoost which allows a larger learning rate. Both SPIDER and SpiderBoost are able to show for smooth nonconvex optimization a total complexity of
which is called “near-optimal” in (Fang et al., 2018) since, except for the term, it almost matches the lower bound.
|GD (Nesterov, 2004)||None|
|SVRG (Reddi et al., 2016)||None|
|SCSG (Lei et al., 2017)||Bounded variance|
|SNVRG (Zhou et al., 2018)||Bounded variance|
|SPIDER (Fang et al., 2018)||None|
|SpiderBoost (Wang et al., 2018)||None|
|R-SPIDER (Zhang et al., 2018)||None|
|SARAH (this paper)||None|
Table 1111 is defined as shows the comparison of results on the total complexity for smooth nonconvex optimization. (a) Each of the complexities in Table 1 also depends on the Lipschitz constant , however, since we consider smooth optimization and it is custom to assume/design , we ignore the dependency on in the complexity results. (b) Although many algorithms have appeared during the past few years, we only compare algorithms having a convergence result which only supposes the smooth assumption. For example, (Fang et al., 2018) can also prove a total complexity of by requiring an additional Hessian-Lipschitz assumption and adding dependence on the Hessian-Lipschitz constant to their analysis. For this reason, this result is not part of the table as it is weaker in that the analysis supposes an additional property of the component functions. (c) Among algorithms with convergence results that only suppose the smooth assumption, Table 1 only mentions recent state-of-the-art results. For example, we do not provide comparisons with SGD (Robbins & Monro, 1951) and SGD-like (e.g. (Duchi et al., 2011; Kingma & Ba, 2014)) since they achieve a much worse complexity of . (d) Although the bounded variance assumption is acceptable in many existing literature, this additional assumption limits the applicability of these convergence results since it adds dependence on which can be arbitrarily large. For fair comparison with convergence analysis without the bounded variance assumption, must be set to go to infinity – and this is what is mentioned in Table 1. As an example, from Table 1 we observe that SCSG has an advantage over SVRG only if but, theoretically, it has the same total complexity as SVRG if . (e) For completeness, incompatibility with assuming a bounded gradient has been discussed in (Nguyen et al., 2018a) for strongly convex objective functions.
According to the results in Table 1, we can observe that SARAH-type algorithms dominate SVRG-type algorithms. In fact this paper proves that SARAH (slightly modified as given in Algorithm 1) achieves the minimal possible total complexity among variance reduction techniques in the nonconvex case for finding a first-order stationary point based on only the smooth assumption. This closes the gap of searching for “better” algorithms since the total complexity meets the lower bound .
Contributions: We summarize our key contributions as follows.
Smooth Non-Convex. We provide a convergence analysis for the full SARAH algorithm with multiple outer iterations for nonconvex problems (unlike in (Nguyen et al., 2017b) which only analyses a single outer iteration). The convergence analysis only supposes the smooth assumption (Lipschitz continuous on the gradient) and proves that SARAH with multiple outer loops (which has not been analyzed before) attains the asymptotic minimum possible total complexity in the non-convex case (Theorem 1). We extend these results to the mini-batch case (Theorem 2).
Smooth Convex. In order to complete the picture, we study SARAH+ (Nguyen et al., 2017a) which was designed as a variant of SARAH for convex optimization. We propose a novel variant of SARAH+ called SARAH++. Here, we study the iteration complexity measured by the total number of iterations (which counts one full gradient computation as adding one iteration to the complexity) – and leave an analysis of the total complexity as an open problem. For SARAH++ we show a sublinear convergence rate in the general convex case (Theorem 3) and a linear convergence rate in the strongly convex case (Theorem 4). SARAH itself may already lead to good convergence and there may no need to introduce SARAH++; in numerical experiments we show the advantage of SARAH++ over SARAH. We further propose a practical version called SARAH Adaptive which improves the performance of SARAH and SARAH++ for convex problems – numerical experiments on various data sets show good overall performance.
For the convergence analysis of SARAH for the non-convex case and SARAH++ for the convex case we show that the analysis generalizes the total complexity of Gradient Descent (GD) (Remarks 1 and 2), i.e., the analysis reproduces known total complexity results of GD. Up to the best of our knowledge, this is the first variance reduction method having this property.
2 Non-Convex Case: Convergence Analysis of SARAH
SARAH is very different from other algorithms since it has a biasedestimator of the gradient. Therefore, in order to analyze SARAH’s convergence rate, it is non-trivial to use existing proof techniques from unbiased estimator algorithms such as SGD, SAGA, and SVRG.
2.1 A single batch case
We start analyzing SARAH (Algorithm 1) for the case where we choose a single sample uniformly at random from in the inner loop.
The above result is for a single outer loop iteration of SARAH, which includes a full gradient step together with the inner loop. Since the outer loop iteration concludes with , and , we have
Summing over gives
This proves our main result:
Theorem 1 (Smooth nonconvex).
The proof easily follows from (6) since is a lower bound of (that is, ). We note that the term
is simply the average of the expectation of the squared norms of the gradients of all the iteration results generated by SARAH. For nonconvex problems, our goal is to achieve
We note that, for simplicity, if is chosen uniformly at random from all the iterations generated by SARAH, we are able to have accuracy .
The total complexity can be minimized over the inner loop size . By choosing , we achieve the minimal total complexity:
The above results explain the relationship between SARAH and GD and explains the advantages of the inner loop and outer loop of SARAH. SARAH becomes more beneficial in ML applications where is large.
2.2 Mini-batch case
The above results can be extended to the mini-batch case where instead of choosing a single sample , we choose samples uniformly at random from for updating in the inner loop. We then replace in Algorithm 1 by
where we choose a mini-batch of size uniformly at random at each iteration of the inner loop. The result of Theorem 1 generalizes as follows.
Theorem 2 (Smooth nonconvex with mini-batch).
We can again derive similar corollaries as was done for Theorem 1, but this does not lead to additional insight; it results in the same minimal total complexity for -accurate solutions.
3 Convex Case: SARAH++: A New Variant of SARAH+
Different from SARAH, SARAH+ provides a stopping criteria for the inner loop; as soon as
the inner loop finishes. This idea originates from the property of SARAH that, for each outer loop iteration , as in the strongly convex case (Theorems 1a and 1b in (Nguyen et al., 2017a)). Therefore, it does not make any sense to update with tiny steps when is small. (We note that SVRG (Johnson & Zhang, 2013) does not have this property.) SARAH+ suggests to empirically choose parameter (Nguyen et al., 2017a) without theoretical guarantee.
and by introducing a stopping criteria for the outer loop.
3.1 Details SARAH++ and Convergence Analysis
Before analyzing and explaining SARAH++ in detail, we introduce the following assumptions used in this section.
Assumption 2 (-strongly convex).
The function , is -strongly convex, i.e., there exists a constant such that ,
We note here, for future use, that for strongly convex functions of the form (1), arising in machine learning applications, the condition number is defined as . Assumption 2 covers a wide range of problems, e.g. -regularized empirical risk minimization problems with convex losses.
We separately assume the special case of strong convexity of all ’s with , called the general convexity assumption, which we will use for convergence analysis.
Each function , , is convex, i.e.,
SARAH++ is motivated by the following lemma.
where , inequality (9) implies
For this reason, we choose the stopping criteria for the inner loop in SARAH++ as with . Unlike SARAH+, for analyzing the convergence rate can be as small as .
The above discussion leads to SARAH++ (Algorithm 3
). In order to analyze its convergence for convex problems, we define random variableas the stopping time of the inner loop in the -th outer iteration:
Note that is at least 1 since at , the condition always holds.
Let random variable be the stopping time of the outer iterations as a function of an algorithm parameter :
Notice that SARAH++ maintains a running sum against which parameter is compared in the stopping criteria of the outer loop.
For the general convex case which supposes Assumption 3 in addition to smoothness we have the next theorem.
Theorem 3 (Smooth general convex).
The theorem leads to the next corollary about iteration complexity, i.e., we bound which is the total number of iterations performed by the inner loop across all outer loop iterations. This is different from the total complexity since does not separately count the gradient evaluations when the full gradient is computed in the outer loop.
Corollary 3 (Smooth general convex).
For the conditions in Theorem 3 with , we achieve an -accurate solution after inner loop iterations.
By supposing Assumption 2 in addition to the smoothness and general convexity assumptions, we can prove a linear convergence rate. For strongly convex objective functions we have the following result.
Theorem 4 (Smooth strongly convex).
This leads to the following iteration complexity.
Corollary 4 (Smooth strongly convex).
For the conditions in Theorem 4 with , we achieve after total iterations, where is the condition number.
An interesting open question we would like to discuss here is the total complexity of SARAH++. Although we have shown the convergence results of SARAH++ in terms of the iteration complexity, the total complexity which is computed as the total number of evaluations of the component gradient functions still remains an open question. It is clear that the total complexity must depend on the learning rate (or ) – the factor that decides when to stop the inner iterations.
We note that can be “closely” understood as the total number of updates of the algorithm. The total complexity is equal to . For the special case , , the algorithm recovers the GD algorithm with . Since each full gradient takes gradient evaluations, the total complexity for this case is equal to (in the general convex case) and (in the strongly convex case).
However, it is non-trivial to derive the total complexity of SARAH++ since it should depend on the learning rate . We leave this question as an open direction for future research.
3.2 Numerical Experiments
Paper (Nguyen et al., 2017a) provides experiments showing good overall performance of SARAH over other algorithms such as SGD (Robbins & Monro, 1951), SAG (Le Roux et al., 2012), SVRG (Johnson & Zhang, 2013), etc. For this reason, we provide experiments comparing SARAH++ directly with SARAH. We notice that SARAH (with multiple outer loops) like SARAH++ has theoretical guarantees with sublinear convergence for general convex and linear convergence for strongly convex problems as proved in (Nguyen et al., 2017a). Because of these theoretical guarantees (which SARAH+ does not have), SARAH itself may already perform well for convex problems and the question is whether SARAH++ offers an advantage.
We consider -regularized logistic regression problems with
where is the training data and the regularization parameter is set to , a widely-used value in literature (Le Roux et al., 2012; Nguyen et al., 2017a). The condition number is equal to . We conducted experiments to demonstrate the advantage in performance of SARAH++ over SARAH for convex problems on popular data sets including covtype ( training data; estimated ) and ijcnn1 ( training data; estimated ) from LIBSVM (Chang & Lin, 2011).
Figure 1 shows comparisons between SARAH++ and SARAH for different values of learning rate . We depicted the value of (i.e. in log scale) for the
-axis and “number of effective passes” (or number of epochs, where an epoch is the equivalent ofcomponent gradient evaluations or one full gradient computation) for the -axis. For SARAH, we choose the outer loop size and tune the inner loop size to achieve the best performance. The optimal solution of the strongly convex problem in (11) is found by using Gradient Descent with stopping criterion . We observe that, SARAH++ achieves improved overall performance compared to regular SARAH as shown in Figure 1. From the experiments we see that the stopping criteria () of SARAH++ is indeed important. The stopping criteria helps the inner loop to prevent updating tiny redundant steps. We also provide experiments about the sensitivity of the maximum inner loop size in supplementary material.
3.3 SARAH Adaptive: A New Practical Variant
We now propose a practical adaptive method which aims to improve performance. Although we do not have any theoretical result for this adaptive method, numerical experiments are very promising and they heuristically show the improved performance on different data sets.
The motivation of this algorithm comes from the intuition of Lemma 2 (for convex optimization). For a single outer loop with , (9) holds for SARAH (Algorithm 1). Hence, for any , we intentionally choose such that . Since , , in (Nguyen et al., 2017a) for convex problems, we have , . We also stop the inner loop by the stopping criteria for some . SARAH Adaptive is given in detail in Algorithm 4 without convergence analysis.
We have conducted numerical experiments on the same datasets and problems as introduced in the previous subsection. Figures 2 and 3 show the comparison between SARAH Adaptive and SARAH and SARAH++ for different values of . We observe that SARAH Adaptive has an improved performance over SARAH and SARAH++ (without tuning learning rate). We also present the numerical performance of SARAH Adaptive for different values of in the supplementary materials. We also present the numerical performance of SARAH Adaptive for different values of in the supplementary materials.
We note that additional experiments in this section on more data sets are performed in the supplementary material.
4 Conclusion and Future Research
Not known in prior literature, we have proven how to achieve optimal total complexity for smooth nonconvex problems in the finite-sum setting, which arises frequently in supervised learning applications. For convex problems, we proposed SARAH++ with theoretical convergence guarantee and showed improved performance over SARAH.
For future research, ideas in this paper may apply to general expectation minimization problems using an inexact version of the gradient (Nguyen et al., 2018b). It would also be noteworthy to investigate SARAH Adaptive in more detail since it has promising empirical results. Moreover, SARAH may open some new research directions because it could be reduced to Gradient Descent as shown in the paper.
- Allen-Zhu (2017a) Allen-Zhu, Z. Natasha: Faster non-convex stochastic optimization via strongly non-convex parameter. arXiv preprint arXiv:1702.00763, 2017a.
- Allen-Zhu (2017b) Allen-Zhu, Z. Natasha 2: Faster non-convex optimization than sgd. arXiv preprint arXiv:1708.08694, 2017b.
- Bottou et al. (2016) Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. arXiv:1606.04838, 2016.
Chang & Lin (2011)
Chang, C.-C. and Lin, C.-J.
LIBSVM: A library for support vector machines.ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- Defazio et al. (2014) Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, pp. 1646–1654, 2014.
- Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
- Fang et al. (2018) Fang, C., Li, C. J., Lin, Z., and Zhang, T. Spider: Near-optimal non-convex optimization via stochastic path integrated differential estimator. arXiv preprint arXiv:1807.01695, 2018.
- Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315–323, 2013.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Konečný & Richtárik (2013) Konečný, J. and Richtárik, P. Semi-stochastic gradient descent methods. arXiv:1312.1666, 2013.
- Le Roux et al. (2012) Le Roux, N., Schmidt, M., and Bach, F. A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS, pp. 2663–2671, 2012.
- Lei et al. (2017) Lei, L., Ju, C., Chen, J., and Jordan, M. I. Non-convex finite-sum optimization via SCSG methods. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 2348–2358. Curran Associates, Inc., 2017.
- Mairal (2013) Mairal, J. Optimization with first-order surrogate functions. In ICML, pp. 783–791, 2013.
- Nesterov (2004) Nesterov, Y. Introductory lectures on convex optimization : a basic course. Applied optimization. Kluwer Academic Publ., Boston, Dordrecht, London, 2004. ISBN 1-4020-7553-7.
- Nguyen et al. (2018a) Nguyen, L., Nguyen, P. H., van Dijk, M., Richtarik, P., Scheinberg, K., and Takac, M. SGD and Hogwild! convergence without the bounded gradients assumption. In ICML, 2018a.
- Nguyen et al. (2017a) Nguyen, L. M., Liu, J., Scheinberg, K., and Takáč, M. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In ICML, 2017a.
- Nguyen et al. (2017b) Nguyen, L. M., Liu, J., Scheinberg, K., and Takác, M. Stochastic recursive gradient algorithm for nonconvex optimization. CoRR, abs/1705.07261, 2017b.
- Nguyen et al. (2018b) Nguyen, L. M., Scheinberg, K., and Takac, M. Inexact SARAH algorithm for stochastic optimization. arXiv preprint arXiv:1811.10105, 2018b.
- Reddi et al. (2016) Reddi, S. J., Hefny, A., Sra, S., Póczos, B., and Smola, A. J. Stochastic variance reduction for nonconvex optimization. In ICML, pp. 314–323, 2016.
- Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
- Schmidt et al. (2016) Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, pp. 1–30, 2016.
- Shalev-Shwartz & Zhang (2013) Shalev-Shwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013.
- Shalev-Shwartz et al. (2011) Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1):3–30, 2011.
- Wang et al. (2018) Wang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. Spiderboost: A class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690, 2018.
- Zhang et al. (2018) Zhang, J., Zhang, H., and Sra, S. R-spider: A fast riemannian stochastic optimization algorithm with curvature independent rate. arXiv preprint arXiv:1811.04194, 2018.
- Zhou et al. (2018) Zhou, D., Xu, P., and Gu, Q. Stochastic nested variance reduction for nonconvex optimization. arXiv preprint arXiv:1806.07811, 2018.
Useful Existing Results
Lemma 3 (Theorem 2.1.5 in (Nesterov, 2004)).
Suppose that is -smooth. Then, for any , ,