Momentum methods have a long history dating back to 1960’s. [Polyak1964] proposed a heavy-ball (HB) method that uses the previous two iterates when computing the next one. The original motivation of momentum methods is to speed up the convergence for convex optimization. For a twice continuously differential strongly convex and smooth objective function, Polyak’s analysis yields an accelerated linear convergence rate over the standard gradient method. In 1983, [Nesterov1983] proposed an accelerated gradient (NAG) method, which is also deemed as a momentum method and achieves the optimal convergence rate for convex smooth optimization 111 is the number of iterations., which has a clear advantage over standard gradient method with convergence for the same problem. It was later on shown to have an accelerated linear convergence rate for smooth and strongly convex optimization problems [Nesterov2004]. Both the HB method and the NAG method use a momentum term in updating the solution, i.e., the difference between current iterate and the previous iterate. Therefore, both methods have been referred to as momentum methods in literature.
Due to recently increasing interests in deep learning, the stochastic variants of HB and NAG methods have been employed broadly in optimizing deep neural networks [Krizhevsky et al.2012, Hinton et al.2012]. [Sutskever et al.2013]
are probably the first to study SNAG and compare it with SHB for optimizing deep neural networks. Although they have some interesting findings of these two methods in deep learning (e.g., a distinct improvement in performance of SNAG is usually observed in their experiments), their analysis and argument are mostly restricted to convex problems. Moreover, some questions remained unanswered, e.g., (i) Do SHB and SNAG enjoy faster convergence rates than SG for deep learning (non-convex optimization problems) as in the convex and deterministic setting? (ii) If not, what is the advantage of these two methods over SG? (iii) Why does SNAG often yield improvement over SHB?
In this paper, we propose and analyze a unified framework for stochastic momentum methods and stochastic gradient method aiming at bringing answers and more insights to above questions. We summarize our results and contributions as follows:
We propose a unified stochastic momentum framework parameterized by a free scalar parameter. The framework reduces to SHB, SNAG and SG by setting three different values for the free parameter.
We present a unified convergence analysis of the gradient’s norm of the training objective of these stochastic methods for non-convex optimization, revealing the same rate of convergence for three variants.
We analyze the generalization error of the unified framework through the uniform stability approach. The result exhibits a clear advantage of stochastic momentum methods, i.e., adding a momentum helps generalization.
Our empirical results for learning deep neural networks complete the unified view and analysis by showing that (i) there is no clear advantage of SHB and SNAG over SG in convergence speed of the training error; (ii) the advantage of SHB and SNAG lies at better generalization due to more stability; (iii) SNAG usually achieves the best tradeoff between speed of convergence in training error and stability of testing error among the three stochastic methods.
2 More Related Work
There is much analysis on the momentum methods for deterministic optimization. Nesterov pioneered the work of accelerated gradient methods for smooth convex optimization [Nesterov2004]. The convergence analysis of HB has been recently extended to smooth functions for both convex [Ghadimi et al.2014, Ochs et al.2015] and non-convex deterministic optimization [Ochs et al.2014, Ochs2016]
. As the rising popularity of deep neural networks, the stochastic variants of HB and NAG have been employed widely for training neural networks and leading to tremendous success for many problems in computer vision and speech recognition[Krizhevsky et al.2012, Hinton et al.2012, Sutskever et al.2013]. However, their stochastic variants in non-convex optimization are under-explored.
It is worth mentioning that two recent works have established the convergence results of the SG method [Ghadimi and Lan2013] and the stochastic version of a different variant of accelerated gradient method for non-convex optimization [Ghadimi and Lan2016]. However, the variant of accelerated gradient method in [Ghadimi and Lan2016] is hard to be explained in the framework of momentum methods and is not widely employed for optimizing deep neural networks. Moreover, their analysis is not applicable to the SHB method. Hence, from a theoretical standpoint, it is still interesting to analyze the stochastic variants of the Nesterov’s accelerated gradient method and the HB method for stochastic non-convex optimization, which are extensively employed for learning deep neural networks. Our unified analysis shows that they enjoy the same order of convergence rate as the SG method, which conincides with the results in [Ghadimi and Lan2013, Ghadimi and Lan2016].
On the other hand, there exist few studies on analyzing the statistical properties (e.g., the generalization error) of the model learned by the SG method or stochastic momentum methods for minimizing the empirical risk. Conventional studies on the SG method in terms of statistical property focus on one pass learning, i.e., the training examples are passed once [Cesa-Bianchi et al.2004]
. Recently, there emerge several works that aim to establish the statistical properties of the multiple pass SG methods in machine learning[Lin and Rosasco2016, Hardt et al.2016]. The latter work is closely related to the present work, which established the generalization error of the SG method with multiple pass for both convex and non-convex learning problems by analyzing the uniform stability. Nevertheless, it remains an open problem from a theoretical standpoint how the momentum term helps improve the generalization, though it has been observed to yield better performance in practice for deep learning [Sutskever et al.2013]. Our unified analysis of the uniform stability of the SG method and stochastic momentum methods explicitly exhibit the advantage of the stochastic momentum methods in terms of the generalization error, hence providing the theoretical support for the common wisdom.
In the remainder of the paper, we first review the HB and NAG method, and present their stochastic variants. Then we present a unified view of these momentum methods. Next, we present the convergence and generalization analysis for stochastic momentum methods. In addition, we present empirical results for comparing different methods for optimizing deep neural networks. Finally, we conclude this work.
3 Momentum Methods And Their Stochastic Variants
3.1 Notations and Setup
Let us consider a general setting of learning with deep learning as a special case. Given a set of training examples sampled from an unknown distribution , the goal of learning is to find a model that minimizes the population risk, i.e.,
is a loss function,denotes the loss of the model on the example and denotes the hypothesis class of the model. Since we cannot compute due to unknown distribution , one usually learns a model by minimizing the empirical risk, i.e.,
Two concerns usually present in the above empirical risk minimization approach. First, how fast the optimization algorithm solves Problem (2). This is usually measured by the speed of convergence to the optimal solution. However, it is NP-hard to find the global optimal solution for a general non-convex optimization problem [Hillar and Lim2013]. As with many previous works [Ghadimi and Lan2013, Ghadimi and Lan2016, Reddi et al.2016, Zhu and Hazan2016], we study the convergence rate of an iterative algorithm to the critical point, i.e., a point such that .
Second, how the model learned by solving Problem (2) generalizes to different data. It is usually measured by the population risk defined in (1). Since the model is learned from the random samples with randomness in the optimization algorithm itself, the expected population risk is also used for the analysis with the expectation taking over the randomness in the samples and the algorithm itself. One way to assess the expected population risk is the generalization error, i.e., the difference between the population risk and the empirical risk,
We use to denote the gradient of a smooth function. A function is smooth iff there exists such that
where denotes the Euclidean norm. Note that the above inequality does not imply convexity. Through the paper, we assume that a -Lipschitz continuous and -smooth non-convex function in , and assume that . It follows that is -Lipschitz continuous and -smoothness.
3.2 Stochastic Momentum Methods
We denote by a stochastic gradient of at
depending on a random variablesuch that . In the context of the empirical risk minimization (2), , where is a random index sampled from .
There are two variants of momentum methods for solving (2), i.e., HB and NAG. HB was originally proposed for optimizing a smooth and strongly convex objective function. Based on HB, the update of stochastic HB (SHB) is given below for
with , where is the momentum constant and is the step size. Equivalently, the above update can be implemented by the following two steps for :
Based on NAG [Nesterov2004], the update of stochastic NAG (SNAG) consists of the two steps below for :
with . By introducing with , the above update can be equivalently written as
Finally, the traditional view of SG can be written as
By comparing (8) to (6), one might argue that the difference between HB and NAG lies at the point for evaluating the gradient [Sutskever et al.2013]. We will present a different unified view of the three methods that allows us to analyze them in a unified framework. The convergence of HB and NAG has been established for convex optimization [Polyak1964, Nesterov1983, Nesterov2004, Ghadimi et al.2014, Ochs et al.2015].
4 A Unified View of Stochastic Momentum Methods
In this section, we present a unified view of the two (stochastic) momentum methods and (stochastic) gradient methods. We first present the unified framework and then show that HB, NAG and the gradient method are special cases of the unified framework. Denote by either a gradient or a stochastic gradient of at .
Let , , and . The updates of the stochastic unified momentum (SUM) method are given by
for with . It is notable that in the update of , a momentum term is constructed based on the auxiliary sequence , whose update is parameterized by . The following proposition indicates that SUM reduces to the concerned three special cases by setting different values to .
From the above result, we can see that SHB, SNAG and SG are three variants of SUM. Moreover, the SUM view of SG implies that SG can have a larger “effective” step size (i.e., ) before the gradient than that of SHB and SNAG. We note that this is a very important observation about SG since setting a smaller effective step size for SG (e.g., the same as that in SNAG) will yield much worse convergence of training error as observed in experiments.
Then for any , we have
Remark: We note that a similar recursion in (13) with and has been observed and employed to [Ghadimi et al.2014] for deterministic convex optimization. However, the recursion in (14) for (i.e., ) is a key to our convergence analysis for non-convex optimization and importantly the generalization to any allows us to analyze SHB, SNAG and SG in a unified framework.
Finally, we present a lemma stating the cumulative effect of updates for each iterate, which will be useful for our generalization error analysis.
Given the update in (10), for any we have
Remark: The above cumulative update reduce to the following three cases for SHB (), SNAG () and SG ().
From the above cumulative update, we can see that SHB and SNAG have smaller step size for each stochastic gradient. This is the main reason that SHB and SNAG are more stable than SG, and hence yield a solution with better generalization performance. We will present a more formal analysis of generalization error later.
5 Convergence Analysis of SUM
In this section, we present the convergence results for the empirical risk minimization (2) of the SUM methods. As mentioned before, for deep learning problems the loss function is a non-convex function, which makes finding the global optimal solution an NP-hard problem. Instead, as in many previous works we will present the convergence rates of SUM in terms of the norm of the gradient. We will present the main results first and then sketch the analysis. Detailed proofs are deferred to the supplement due to limit of space.
5.1 Main results
(Convergence of SUM) Suppose is a non-convex and -smooth function, and for any . Let update (10) run for iterations with . By setting we have
We would like to make several remarks. (i) The assumption on the magnitude of the gradient and the variance of stochastic gradient can be simply replaced by the magnitude of the stochastic gradient, which are standard assumptions in the previous analysis of stochastic gradient method[Ghadimi and Lan2013]. (ii) This is the first time that the convergence rate of SHB for non-convex optimization is established. A similar convergence rate of SG and a different stochastic variant of accelerated gradient method has been established in [Ghadimi and Lan2013] and [Ghadimi and Lan2016], respectively under similar assumptions. (iii) The unified convergence makes it clear that the difference of the convergence bounds for different variants of SUM lies at the term , which is equal to , and for SHB, SNAG and SG, respectively. (iv) The step size of different variants of SUM used in the analysis of Theorem 1 is the same value.
The above result shows that the convergence upper bound of the three methods are of the same order, i.e., for the gradient’s square norm. In addition, when the momentum term is large, the effect of different values of in the term would becomes marginal in contrast to the term in the convergence bound. This reveals that SHB and SNAG have no advantage over SG in terms of empirical risk minimization.
Below, we present a result with different step sizes for different variants of SUM in the analysis, which sheds more insights of different methods.
(Convergence of SUM) Suppose is a non-convex and -smooth function, and for any . Let update (10) run for iterations with . By setting we have
Remark: The above result allows us to possibly set a larger initial value of for SG (where ) and SNAG (where ) than that for SHB (where ). Our empirical studies for deep learning also confirms this point.
5.2 Generalization Error Analysis of SUM
In this section, we provide a unified analysis for the generalization error of the solution returned by SUM after a finite number of iterations. By employing the unified analysis, we are able to analyze the effect of the scalar on the generalization error. Our analysis is inspired by [Hardt et al.2016], which leverages the uniform stability of a randomized algorithm [Bousquet and Elisseeff2002] to bound the generalization error of multiple pass SG method. To this end, we first introduce the uniform stability and its connection with generalization error.
Let denote a randomized algorithm that generates a model from the set of training samples of size . The uniform stability measures that how likely the prediction of the learned model on any sample would change if one example in is changed to a different data. In particular, let denote a set of training examples that differ from in one example. The algorithm is said to be -uniform stable, if
The following proposition states that the generalization error of is bounded by the uniform stability of .
(Theorem 2.2 in [Hardt et al.2016]) For a randomized algorithm ,
The above proposition allows us to use the uniform stability of a randomized algorithm as a proxy of the generalization error. Below, we will show that SHB and SNAG are more uniform stable than SG, which exhibits that SHB and SNAG has potentially smaller generalization error than SG.
To proceed, we assume that loss function is -Lipschitz continuous, then and . To analyze the uniform stability of SUM, we will assume that there are two instances of SUM starting from the same initial solution, with one running on and the other one running on , where and differs only at one example. Let denote the -th iterate of the first instance and denote the -th iterate of the second instance. Below, we establish a result showing how grows based on the unified framework in Lemma 2.
Assume that for any and and is -smooth w.r.t . For two data sets that differs at one example, let and denote the -th iterates of running SUM for the empirical risk minimization on and , we have
with , where .
Remark: From the above result, we can easily analyze how the value of affects the growth of that implies the growth of the generalization error of . The values of for the three variants (i.e., SHB, SNAG and SG) are given by , and , respectively. It is obvious that . As a result, of SG grows faster than that of SNAG, and then followed by of SHB. Since the generalization error of is bounded by up to a constant, we can conclude that by running the same number of iterations, the generalization error of the model returned by SHB and SNAG is potentially smaller than that of SG.
So far, we have analyzed the convergence rate for optimizing the empirical risk and the generalization error of the learned models of different variants of SUM, which provide answers to the questions raised at the beginning except the last one (why is SNAG observed to yield improvement on the prediction performance over SHB by some studies [Sutskever et al.2013]). Next, we show that how our analysis can shed lights on this question. In fact, the population risk of that is usually assessed in practice by the testing error can be decomposed into three parts, consisting of the optimization error, the generalization error, and an optimization independent term, i.e.,
where is the optimal solution to the empirical risk minimization problem. An informal analysis follows: our Theorem 2 implies that SNAG converges potentially faster than SHB in terms of the optimization error, while Proposition 3 implies that SHB has potentially smaller generalization error. If the optimization error of SNAG decreases faster than the generalization error increases comparing with SHB, then SNAG could yield a solution with a smaller population risk. However, a rigorous analysis of the optimization error is complicated by the non-convexity of the problem. In next section, we will present empirical results to corroborate and complete our theoretical analysis.
6 Empirical Studies
In this section, we present empirical results on the non-convex optimization of deep neural networks. We train a deep convolutional neural network (CNN) for classification on two benchmark datasets, i.e., CIFAR-10 and CIFAR-100. Both datasets containtraining images of size from classes (CIFAR-10) or classes (CIFAR-100) and
testing images of the same size. The employed CNN consists of 3 convolutional layers and 2 fully-connected layers. Each convolutional layer is followed by a max pooling layer. The output of the last fully-connected layer is fed into a-class or -class softmax loss function. We emphasize that we do not intend to obtain the state-of-the-art prediction performance by trying different network structures and different engineering tricks, but instead focus our attention on verifying the theoretical analysis. We compare the three variants of SUM, i.e., SHB, SNAG, and SG, which corresponds to , and in (10). We fix the momentum constant and the regularization parameter of weights to . We use a mini-batch of size to compute a stochastic gradient at each iteration. All three methods use the same initialization. We follow the procedure in [Krizhevsky et al.2012] to set the step size , i.e., initially giving a relatively large step size and and decreasing the step size by times after certain number of iterations when observing the performance on testing data saturates.
Results on CIFAR-10. We first present the convergence results of different methods with the best initial step size. In particular, for the initial step size, we search in a range () for different methods and select the best one that yields the fastest convergence in training error. In particular, for SHB the best initial step size is and that for SNAG and SG is . In fact, a larger initial step size (e.g, ) for SHB gives a divergent result. The training and testing error of different methods versus the number of iterations is plotted in Figure 1. This result is consistent with our convergence result in Theorem 2.
Next, we plot the performance of different methods with the same initial step size in Figure 2. We report the training error, the testing error and their absolute difference in Figure 2(a), 2(b) and 2(c)
, respectively. We use the absolute difference between the training and testing error as an estimate of the generalization error. We can see that the convergence of training error of the three methods are very close, which is consistent with our theoretical result in Theorem1. Moreover, the behavior of the absolute difference between the training and testing error is also consistent with the theoretical result in Proposition 3, i.e, SG has a larger generalization error than SNAG and SHB.
Results on CIFAR-100. We plot the training and testing error and their absolute difference of the three methods with the same initial step sizes () in Figure 2(d), 2(e) and 2(f), respectively. We observe similar trends in the training error and the generalization error, i.e., the three methods have similar performance (convergence speed) in training error but exhibit different degree of generalization error. The testing error curve shows that SNAG achieves the best prediction performance on the testing data.
Finally, we present a comparison of SUM with different values of including variants besides SHB, SNAG and SG. In particular, we compare SUM with and the same initial step size . Note that corresponds to SHB, corresponds to SNAG, corresponds to SG since and corresponds to a new variant. The results on the CIFAR-100 data are shown in Figure 3. From the results, we can observe that the convergence of training error for different variants perform similarly. For the generalization error, we observe a clear trend from to in that the generalization error decreases.
We have developed a unified framework of stochastic momentum methods that subsumes SHB, SNAG and SG as special cases, which have been widely adopted in training deep neural network. We also analyzed convergence of the training for non-convex optimization and generalization error for learning of the unified stochastic momentum methods. The unified framework and analysis bring more insights about differences between different methods and help explain experimental results for optimization deep neural networks. In particular, the momentum term helps improve the generalization performance but not helps speed up the training process.
This work is partially supported by NSF-1545995, the Data to Decisions Cooperative Research Centre and ARC DP180100106. Most work of Y. Yan was done when he was visiting the University of Iowa.
- [Bousquet and Elisseeff2002] Olivier Bousquet and André Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2:499–526, 2002.
- [Cesa-Bianchi et al.2004] Nicolò Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Trans. on Inf. Theory, 50(9), 2004.
- [Ghadimi and Lan2013] Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- [Ghadimi and Lan2016] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program., 156(1-2):59–99, 2016.
- [Ghadimi et al.2014] Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global convergence of the heavy-ball method for convex optimization. coRR, 2014.
[Hardt et al.2016]
Moritz Hardt, Benjamin Recht, and Yoram Singer.
Train faster, generalize better: Stability of stochastic gradient descent.In Proceedings of the International Conference on Machine Learning, 2016.
Christopher J. Hillar and Lek-Heng Lim.
Most tensor problems are np-hard.J. ACM, 60(6):45:1–45:39, 2013.
- [Hinton et al.2012] Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29:82–97, 2012.
- [Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- [Lin and Rosasco2016] Junhong Lin and Lorenzo Rosasco. Optimal learning for multi-pass stochastic gradient methods. In NIPS, pages 4556–4564, 2016.
- [Nesterov1983] Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/sqr(k)). Soviet Mathematics Doklady, 27:372–376, 1983.
- [Nesterov2004] Yurii Nesterov. Introductory lectures on convex optimization : a basic course. Applied optimization. Kluwer Academic Publ., 2004.
- [Ochs et al.2014] Peter Ochs, Yunjin Chen, Thomas Brox, and Thomas Pock. ipiano: Inertial proximal algorithm for nonconvex optimization. SIAM J. Imaging Sciences, 7(2):1388–1419, 2014.
- [Ochs et al.2015] Peter Ochs, Thomas Brox, and Thomas Pock. ipiasco: Inertial proximal algorithm for strongly convex optimization. Journal of Mathematical Imaging and Vision, 53(2):171–181, 2015.
- [Ochs2016] Peter Ochs. Local convergence of the heavy-ball method and ipiano for non-convex optimization. CoRR, abs/1606.09070, 2016.
- [Polyak1964] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4:791–803, 1964.
- [Reddi et al.2016] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alexander J. Smola. Stochastic variance reduction for nonconvex optimization. CoRR, abs/1603.06160, 2016.
- [Sutskever et al.2013] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 1139–1147, 2013.
- [Zhu and Hazan2016] Zeyuan Allen Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. CoRR, abs/1603.05643, 2016.
Appendix A Proof of Proposition 1
We re-present Proposition 1.
We prove the result by separately discussing three different values of , i.e., , and . We show that these three settings in fact correspond to SHB, SNAG and SG, respectively.
1. When , then , the update of becomes
3. The last special case corresponds to using for SG. We will show that the update of is equivalent to
Since , then , thus for any
which leads to (18).
Appendix B Proof of Lemma 1
We re-present Lemma 1.
Let us first write down the updates:
We can see that
If we define and , the above equation holds for any . Similarly, we can write as
for any .
Similarly, we have
which verifies (13).
Appendix C Proof of Lemma 2
We re-present Lemma 2.
Given the update in (10), for any we have
Start with the update of (10). We can rewrite the update by
Appendix D Proof of Theorem 1
Before proving Theorem 1, we present two key lemmas.
Let . For SUM, we have for any ,
Let . Then . Beginning by exploring the smoothness of we have
Taking expectation on both sides