We study the finite-sum optimization problem
where each () can be nonconvex; our only assumption is that the gradients of exist and are Lipschitz continuous. We denote the class of such instances of (1) by .
Problems of the form (12013)
and deep neural networks (nonconvex)(Deng & Yu, 2014). Consequently, such problems have been intensively studied. A basic approach for solving (1) is gradient descent (GradDescent), described by the following update rule:
However, iteration (2) is prohibitively expensive in large-scale settings where is very large. For such settings, stochastic and incremental methods are typical (Bertsekas, 2011; Defazio et al., 2014). These methods use cheap noisyestimates of the gradient at each iteration of (2) instead of . A particularly popular approach, stochastic gradient descent (Sgd) uses , where in chosen uniformly randomly from . While the cost of each iteration is now greatly reduced, it is not without any drawbacks. Due to the noise
(also known as variance in stochastic methods) in the gradients, one has to typically use decreasing step-sizesto ensure convergence, and consequently, the convergence rate gets adversely affected.
Therefore, it is of great interest to overcome the slowdown of Sgd without giving up its scalability. Towards this end, for convex instances of (1), remarkable progress has been made recently. The key realization is that if we make multiple passes through the data, we can store information that allows us to reduce variance of the stochastic gradients. As a result, we can use constant stepsizes (rather than diminishing scalars) and obtain convergence faster than Sgd, both in theory and practice (Johnson & Zhang, 2013; Schmidt et al., 2013; Defazio et al., 2014).
Nonconvex instances of (1) are also known to enjoy similar speedups (Johnson & Zhang, 2013), but existing analysis does not explain this success as it relies heavily on convexity to control variance. Since Sgd dominates large-scale nonconvex optimization (including neural network training), it is of great value to develop and analyze faster nonconvex stochastic methods.
With the above motivation, we analyze Saga (Defazio et al., 2014), an incremental aggregated gradient algorithm that extends the seminal Sag method of (Schmidt et al., 2013), and has been shown to work well for convex finite-sum problems (Defazio et al., 2014). Specifically, we analyze Saga for the class using the incremental first-order oracle (IFO) framework (Agarwal & Bottou, 2014). For , an IFO is a subroutine that takes an index and a point , and returns the pair .
To our knowledge, this paper presents the first analysis of fast convergence for an incremental aggregated gradient method for nonconvex problems. The attained rates improve over both Sgd and GradDescent, a benefit that is also corroborated by experiments. Before presenting our new analysis, let us briefly recall some items of related work.
1.1 Related work
A concise survey of incremental gradient methods is (Bertsekas, 2011). An accessible analysis of stochastic convex optimization () is (Nemirovski et al., 2009). Classically, Sgd stems from the seminal work (Robbins & Monro, 1951), and has since witnessed many developments (Kushner & Clark, 2012), including parallel and distributed variants (Bertsekas & Tsitsiklis, 1989; Agarwal & Duchi, 2011; Recht et al., 2011), though non-asymptotic convergence analysis is limited to convex setups. Faster rates for convex problems in are attained by variance reduced stochastic methods, e.g., (Defazio et al., 2014; Johnson & Zhang, 2013; Schmidt et al., 2013; Shalev-Shwartz & Zhang, 2013; Reddi et al., 2015). Linear convergence of stochastic dual coordinate ascent when () may be nonconvex but is strongly convex is studied in (Shalev-Shwartz, 2015). Lower bounds for convex finite-sum problems are studied in (Agarwal & Bottou, 2014).
For nonconvex nonsmooth problems the first incremental proximal-splitting methods is in (Sra, 2012), though only asymptotic convergence is studied. Hong (Hong, 2014) studies convergence of a distributed nonconvex incremental ADMM algorithm. The first work to present non-asymptotic convergence rates for Sgd is (Ghadimi & Lan, 2013); this work presents an iteration bound for Sgd to satisfy approximate stationarity , and their convergence criterion is motivated by the gradient descent analysis of Nesterov (Nesterov, 2003). The first analysis for nonconvex variance reduced stochastic gradient is due to (Shamir, 2014)
, who apply it to the specific problem of principal component analysis (PCA).
In this section, we further explain our assumptions and goals. We say is -smooth if there is a constant such that
Throughout, we assume that all in (1) are -smooth, i.e., for all . Such an assumption is common in the analysis of first-order methods. For ease of exposition, we assume the Lipschitz constant to be independent of . For our analysis, we also discuss the class of -gradient dominated (Polyak, 1963; Nesterov & Polyak, 2006) functions, namely functions for which
where is a global minimizer of . This class of functions was originally introduced by Polyak in (Polyak, 1963). Observe that such functions need not be convex. Also notice that gradient dominance (3) is a restriction on the overall function , but not on the individual functions ().
Following (Nesterov, 2003; Ghadimi & Lan, 2013) we use to judge approximate stationarity of . Contrast this with Sgd for convex , where one uses or as criteria for convergence analysis. Such criteria cannot be used for nonconvex functions due to the intractability of the problem.
While the quantities , , or are not comparable in general, they are typically assumed to be of similar magnitude (see (Ghadimi & Lan, 2013)). We note that our analysis does not assume to be a constant, so we report dependence on it in our results. Furthermore, while stating our complexity results, we assume that the initial point of the algorithms is constant, i.e., and are constants. For our analysis, we will need the following definition.
A point is called -accurate if . A stochastic iterative algorithm is said to achieve -accuracy in iterations if , where the expectation is taken over its stochasticity.
We start our discussion of algorithms by recalling Sgd, which performs the following update in its iteration:
where is a random index chosen from , and the gradient approximates the gradient of at , . It can be seen that the update is unbiased, i.e., since the index is chosen uniformly at random. Though the Sgd update is unbiased, it suffers from variance due to the aforementioned stochasticity in the chosen index. To control the variance one has to decrease the step size in (4), which in turn leads to slow convergence. The following is a well-known result on Sgd in the context of nonconvex optimization (Ghadimi & Lan, 2013).
Suppose i.e., gradient of function is bounded for all , then the IFO complexity of Sgd to obtain an -accurate solution is .
It is instructive to compare the above result with the convergence rate of GradDescent. The IFO complexity of GradDescent is . Thus, while Sgd eliminates the dependence on , the convergence rate worsens to from in GradDescent. In the next section, we investigate an incremental method with faster convergence.
We describe below the Saga algorithm and prove its fast convergence for nonconvex optimization. Saga is a popular incremental method in machine learning and optimization communities. It is very effective in reducing the variance introduced due to stochasticity in Sgd. Algorithm 1 presents pseudocode for Saga. Note that the update (Line 5) is unbiased, i.e., . This is due to the uniform random selection of index . It can be seen in Algorithm 1 that Saga maintains gradients at for . This additional set is critical to reducing the variance of the update . At each iteration of the algorithm, one of the is updated to the current iterate. An implementation of Saga requires storage and updating of gradients ; the storage cost of the algorithm is . While this leads to higher storage in comparison to Sgd, this cost is often reasonable for many applications. Furthermore, this cost can be reduced in the case of specific models; refer to (Defazio et al., 2014) for more details.
For ease of exposition, we introduce the following quantity:
where the parameters , and will be defined shortly. We start with the following set of key results that provide convergence rate of Algorithm 1.
For , suppose we have
Also let , and be chosen such that . Then, the iterates of Algorithm 1 satisfy the bound
The proof of this lemma is given in Section 9. Using this lemma we prove the following result on the iterates of Saga.
We apply telescoping sums to the result of Lemma 1 to obtain
The first inequality follows from the definition of . This inequality in turn implies the bound
Note that the notation involves , since this quantity can depend on . To obtain an explicit dependence on , we have to use an appropriate choice of and . This is made precise by the following main result of the paper.
With the values of and , let us first establish an upper bound on . Let denote . Observe that and . This is due to the specific values of and stated in the theorem. Also, we have . Using this relationship, it is easy to see that . Therefore, we obtain the bound
for all , where the inequality follows from the definition of and the fact that . Using the above upper bound on we can conclude that
A more general result with step size can be proved, but it will only lead to a theoretically suboptimal convergence result. Recall that each iteration of Algorithm 1 requires IFO calls. Using this fact, we can rewrite Theorem 3 in terms of its IFO complexity as follows.
This corollary follows from the per iteration cost of Algorithm 1 and because IFO calls are required to calculate at the start of the algorithm. In special cases, the initial IFO calls can be avoided (refer to (Schmidt et al., 2013; Defazio et al., 2014) for details). By comparing the IFO complexity of Saga () with that of GradDescent (), we see that Saga is faster than GradDescent by a factor of .
4 Finite Sums with Regularization
In this section, we study the problem of finite-sum problems with additional regularization. More specifically, we consider problems of the form
where is an -smooth (possibly nonconvex) function. Problems of this nature arise in machine learning where the functions
are loss functions andis a regularizer. Since we assumed to be smooth, (8) can be reformulated as (1) by simply encapsulating into the functions . However, as we will see, it is beneficial to handle the regularization separately. We call the variant of Saga with the additional regularization as Reg-Saga. The key difference between Reg-Saga and Saga lies in Line 6 of Algorithm 1. In particular, for Reg-Saga, Line 6 of Algorithm 1 is replaced with the following update:
Note that the primary difference is that the part of gradient based on function is updated at each iteration of the algorithm. The convergence of Reg-Saga can be proved along the lines of our analysis of Saga. Hence, we omit the details for brevity and directly state the following key result stating the IFO complexity of Reg-Saga.
If function is of the form in (8), then the IFO complexity of Reg-Saga to obtain an -accurate solution is .
The proof essentially follows along the lines of the proof of Theorem 3. The difference, however, being that the update corresponding to function is handled explicitly at each iteration. Note that the above IFO complexity is not different from that in Corollary 1. However, its main benefit comes in the form of storage efficiency in problems with more structure. To understand this, consider the problems of form
where for while is a differentiable loss function. Here, and can be in general nonconvex. Such problems are popularly known as (regularized) empirical risk minimization in machine learning literature. We can directly apply Saga to (10) by casting it in the form (1). However, recall that the storage cost of Saga is due to the cost of storing the gradient at . This storage cost can be avoided in Reg-Saga by handling the function separately. Indeed, for Reg-Saga we need to store just for all (as is updated at each iteration). By observing that , where represents the derivative of , it is apparent that we need to store only the scalars for Reg-Saga. This reduces the storage instead of in Saga.
5 Linear convergence rates for gradient dominated functions
Until now the only assumption we used was Lipschitz continuity of gradients. An immediate question is whether the IFO complexity can be further improved under stronger assumptions. We provide an affirmative answer to this question by showing that for gradient dominated functions, a variant of Saga attains linear convergence rate. Recall that a function is called -gradient dominated if around an optimal point , satisfies the following growth condition:
For such functions, we use the variant of Saga shown in Algorithm 2. Observe that Algorithm 2 uses Saga as a subroutine. Alternatively, one can rewrite Algorithm 2 in the form of iterations of Algorithm 1 where one updates after every iterations and then selects a random iterate amongst the last iterates. For this variant of Saga, we can prove the following linear convergence result.
An immediate consequence of this theorem is the following.
While we state the result in terms of , it is not hard to see that for gradient dominated functions a similar result holds for the convergence criterion being .
A noteworthy aspect of the above result is the linear convergence rate to a global optimum. Therefore, the above result is stronger than Theorem 3. Note that throughout our analysis of gradient dominated functions, no assumptions other than Lipschitz smoothness are placed on the individual set of functions . We emphasize here that these results can be further improved with additional assumptions (e.g., strong convexity) on the individual functions and on . Also note that GradDescent can achieve linear convergence rate for gradient dominated functions (Polyak, 1963). However, the IFO complexity of GradDescent is , which is strictly worse than IFO complexity of GD-Saga (see Corollary 2).
6 Minibatch Variant
A common variant of incremental methods is to sample a set of indices instead of single index when approximating the gradient. Such a variant is generally referred to as a “minibatch” version of the algorithm. Minibatch variants are of great practical significance since they reduce the variance of incremental methods and promote parallelism. Algorithm 3 lists the pseudocode for a minibatch variant of Saga. Algorithm 3 uses a set of size for calculating the update instead of a single index used in Algorithm 1. By using a larger , one can reduce the variance due to the stochasticity in the algorithm. Such a procedure is also beneficial in parallel settings since the calculation of the update can be parallelized. For this algorithm, we can prove the following convergence result.
We omit the details of the proof since it is similar to the proof of Theorem 3. Note that the key difference in comparison to Theorem 1 is that we can now use a more aggressive step size due to a larger minibatch size . An interesting aspect of the result is the dependence on the minibatch size . As long as this size is not large (), one can significantly improve the convergence rate to a stationary point. A restatement of aforementioned result in terms of IFO complexity is provided below.
By comparing the above result with Corollary 1, we can see that the IFO complexity of minibatch-Saga is the same Saga. However, since the gradients can be computed in parallel, one can achieve (theoretical) times speedup in multicore and distributed settings. In contrast, the performance Sgd degrades with minibatch size since the improvement in convergence rate for Sgd is typically but IFO calls are required at each iteration of minibatch-Sgd. Thus, Saga has a much more efficient minibatch version in comparison to Sgd.
Discussion of Convergence Rates
Before ending our discussion on convergence rates, it is important to compare and contrast different convergence rates obtained in the paper. For general smooth nonconvex problems, we observed that Saga has a low IFO complexity of in comparison to Sgd () and GradDescent (). This difference in the convergence is especially significant if one requires a medium to high accuracy solution, i.e., is small.
Furthermore, for gradient dominated functions, where Sgd obtains a sublinear convergence rate of 111For Sgd, we are not aware of any better convergence rates for gradient dominated functions. as opposed to fast linear convergence rate of a variant of Saga (see Theorem 2). It is an interesting future work to explore other setups where we can achieve stronger convergence rates.
From our analysis of minibatch-Saga in Section 6, we observe that Saga profits from mini-batching much more than Sgd. In particular, one can achieve a (theoretical) linear speedup using mini-batching in Saga in parallel settings. On the other hand, the performance of Sgd typically degrades with minibatching. In summary, Saga enjoys all the benefits of GradDescent like constant step size, efficient minibatching with much weaker dependence on .
Notably, Saga, unlike Sgd, does not use any additional assumption of bounded gradients (see Theorem 1 and Corollary 1). Moreover, if one uses a constant step size for Sgd, we need to have an advance knowledge of the total number of iterations in order to obtain the convergence rate mentioned in Theorem 1.
We present our empirical results in this section. For our experiments, we study the problem of binary classification using nonconvex regularizers. The input consists of tuples where (commonly referred to as features) and (class labels). We are interested in the empirical loss minimization setup described in Section 4. Recall that problem of finite sum with regularization takes the form
For our experiments, we use logistic function for , i.e., for all . All are normalized such that . We observe that the loss function has Lipschitz continuous gradients. The function is chosen as the regularizer (see (Antoniadis et al., 2009)). Note that the regularizer is nonconvex and smooth. In our experiments, we use the parameter settings of and for all the datasets.
We compare the performance of Sgd (the de facto incremental method for nonconvex optimization) with nonconvex Reg-Saga in our experiments. The comparison is based on the following criteria: (i) the objective function value (also called training loss in this context), which is the main goal of the paper; and (ii) the stationarity gap , the criteria used for our theoretical analysis. For the step size of Sgd, we use the popular inverse schedule , where and are tuned so that Sgd gives the best performance on the training loss. In our experiments, we also use ; this results in a fixed step size for Sgd. For Reg-Saga, a fixed step size is chosen (as suggested by our analysis) so that it gives the best performance on the objective function value, i.e., training loss. Note that due to the structure of the problem in (12), as discussed in Section 4, the storage cost of Reg-Saga is just .
Initialization is critical to many of the incremental methods like Reg-Saga. This is due to the stronger dependence of the convergence on the initial point (see Theorem 3). Furthermore, one has to obtain for all in Reg-Saga algorithm (see Algorithm 1). For initialization of both Sgd and Reg-Saga, we make one pass (without replacement) through the dataset and perform the updates of Sgd during this pass. Doing so not only allows us to also obtain a good initial point but also to compute the initial values of for . Note that this choice results in a variant of Reg-Saga where are different for various (unlike the pseudocode in Algorithm 1). The convergence rates of this variant can be shown to be similar to that of Algorithm 1.
Figure 1 shows the results of our experiments. The results are on two standard UCI datasets, ‘rcv1’ and ‘realsim’222The datasets can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html.. The plots compare the criteria mentioned earlier against the number of IFO calls made by the corresponding algorithm. For the objective function, we look at the difference between the objective function () and the best objective function value obtained by running GradDescent for a very large number of iterations using multiple initializations (denoted by ). As shown in the figure, Reg-Saga converges much faster than Sgd in terms of objective value. Furthermore, as supported by the theory, the stationarity gap for Reg-Saga is very small in comparison to Sgd. Also, in our experiments, the selection of step size was much easier for Reg-Saga when compared to Sgd. Overall the empirical results for nonconvex regularized problems are promising. It will be interesting to apply the approach for other smooth nonconvex problems.
In this paper, we investigated a fast incremental method (Saga) for nonconvex optimization. Our main contribution in this paper to show that Saga can provably perform better than both Sgd and GradDescent in the context of nonconvex optimization. We also showed that with additional assumptions on function in (1) like gradient dominance, Saga has linear convergence to the global minimum as opposed to sublinear convergence of Sgd. Furthermore, for large scale parallel settings, we proposed a minibatch variant of Saga with stronger theoretical convergence rates than Sgd by attaining linear speedups in the size of the minibatches. One of the biggest advantages of Saga is the ability to use fixed step size. Such a property is important in practice since selection of step size (learning rate) for Sgd is typically difficult and is one of its biggest drawbacks.
- Agarwal & Bottou (2014) Agarwal, Alekh and Bottou, Leon. A lower bound for the optimization of finite sums. arXiv:1410.0723, 2014.
- Agarwal & Duchi (2011) Agarwal, Alekh and Duchi, John C. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pp. 873–881, 2011.
- Antoniadis et al. (2009) Antoniadis, Anestis, Gijbels, Irène, and Nikolova, Mila. Penalized likelihood regression for generalized linear models with non-quadratic penalties. Annals of the Institute of Statistical Mathematics, 63(3):585–615, June 2009.
- Bertsekas & Tsitsiklis (1989) Bertsekas, D. and Tsitsiklis, J. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, 1989.
- Bertsekas (2011) Bertsekas, Dimitri P. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. In S. Sra, S. Nowozin, S. Wright (ed.), Optimization for Machine Learning. MIT Press, 2011.
- Defazio et al. (2014) Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS 27, pp. 1646–1654, 2014.
- Deng & Yu (2014) Deng, Li and Yu, Dong. Deep learning: Methods and applications. Foundations and Trends Signal Processing, 7:197–387, 2014.
- Ghadimi & Lan (2013) Ghadimi, Saeed and Lan, Guanghui. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013. doi: 10.1137/120880811.
- Hong (2014) Hong, Mingyi. A distributed, asynchronous and incremental algorithm for nonconvex optimization: An admm based approach. arXiv preprint arXiv:1412.6058, 2014.
- James et al. (2013) James, G., Witten, D., Hastie, T., and Tibshirani, R. An Introduction to Statistical Learning: with Applications in R. Springer Texts in Statistics. Springer, 2013.
- Johnson & Zhang (2013) Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS 26, pp. 315–323, 2013.
- Kushner & Clark (2012) Kushner, Harold Joseph and Clark, Dean S. Stochastic approximation methods for constrained and unconstrained systems, volume 26. Springer Science & Business Media, 2012.
- Nemirovski et al. (2009) Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
- Nesterov (2003) Nesterov, Yurii. Introductory Lectures On Convex Optimization: A Basic Course. Springer, 2003.
- Nesterov & Polyak (2006) Nesterov, Yurii and Polyak, Boris T. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
- Polyak (1963) Polyak, B.T. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864–878, January 1963.
- Recht et al. (2011) Recht, Benjamin, Re, Christopher, Wright, Stephen, and Niu, Feng. Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In NIPS 24, pp. 693–701, 2011.
- Reddi et al. (2015) Reddi, Sashank, Hefny, Ahmed, Sra, Suvrit, Poczos, Barnabas, and Smola, Alex J. On variance reduction in stochastic gradient descent and its asynchronous variants. In NIPS 28, pp. 2629–2637, 2015.
- Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
- Schmidt et al. (2013) Schmidt, Mark W., Roux, Nicolas Le, and Bach, Francis R. Minimizing Finite Sums with the Stochastic Average Gradient. arXiv:1309.2388, 2013.
- Shalev-Shwartz (2015) Shalev-Shwartz, Shai. SDCA without duality. CoRR, abs/1502.06177, 2015.
- Shalev-Shwartz & Zhang (2013) Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 14(1):567–599, 2013.
- Shamir (2014) Shamir, Ohad. A stochastic PCA and SVD algorithm with an exponential convergence rate. arXiv:1409.2848, 2014.
- Sra (2012) Sra, Suvrit. Scalable nonconvex inexact proximal splitting. In NIPS, pp. 530–538, 2012.
9 Proof of Lemma 1
Since is -smooth, from Lemma 3, we have
We first note that the update in Algorithm 1 is unbiased i.e., . By using this property of the update on the right hand side of the inequality above, we get the following:
Here we used the fact that (see Algorithm 1). Consider now the Lyapunov function
For bounding we need the following:
The second equality again follows from the unbiasedness of the update of Saga. The last inequality follows from a simple application of Cauchy-Schwarz and Young’s inequality. Plugging (13) and (15) into , we obtain the following bound:
The second inequality follows from the definition of i.e., and specified in the statement, thus concluding the proof. ∎
10 Other Lemmas
The following lemma provides a bound on the variance of the update used in Saga algorithm. More specifically, it bounds the quantity . A more general result for bounding the variance of the minibatch scheme in Algorithm 3 can be proved along similar lines.
Let be computed by Algorithm 1. Then,
For ease of exposition, we use the notation
Using the convexity of and the definition of we get
The first inequality follows from the fact that and that
. The second inequality is obtained by noting that for a random variable, . Using Jensen’s inequality in the inequality above, we get
The last inequality follows from -smoothness of , thus concluding the proof. ∎
The following result provides a bound on the function value of functions with Lipschitz continuous gradients.
Suppose the function is -smooth, then the following holds
for all .