Many optimization questions arising in machine learning can be cast as a finite sum optimization problem of the form:where
. Most neural network problems also fall under a similar structure where each functionis typically non-convex. A well-studied algorithm to solve such problems is Stochastic Gradient Descent (SGD), which uses updates of the form: , where is a step size, and is a function chosen randomly from at time . Often in neural networks, “momentum” is added to the SGD update to yield a two-step update process given as: followed by . This algorithm is typically called the Heavy-Ball (HB) method (or sometimes classical momentum), with called the momentum parameter [polyak1987introduction]. In the context of neural nets, another variant of SGD that is popular is Nesterov’s Accelerated Gradient (NAG), which can also be thought of as a momentum method [sutskever2013importance], and has updates of the form followed by (see Algorithm 1 for more details).
Momentum methods like HB and NAG have been shown to have superior convergence properties compared to gradient descent in the deterministic setting both for convex and non-convex functions [nesterov1983method, polyak1987introduction, zavriev1993heavy, ochs2016local, o2017behavior, jin2017accelerated]. While (to the best of our knowledge) there is no clear theoretical justification in the stochastic case of the benefits of NAG and HB over regular SGD in general [yuan2016influence, kidambi2018on, wiegerinck1994stochastic, orr1994momentum, yang2016unified, gadat2018stochastic], unless considering specialized function classes [loizou2017momentum]; in practice, these momentum methods, and in particular NAG, have been repeatedly shown to have good convergence and generalization on a range of neural net problems [sutskever2013importance, lucas2018aggregated, kidambi2018on].
The performance of NAG (as well as HB and SGD), however, are typically quite sensitive to the selection of its hyper-parameters: step size, momentum and batch size [sutskever2013importance]. Thus, “adaptive gradient” algorithms such as RMSProp (Algorithm 2) [tieleman2012lecture] and ADAM (Algorithm 3) [kingma2014adam] have become very popular for optimizing deep neural networks [melis2017state, xu2015show, denkowski2017stronger, gregor2015draw, radford2015unsupervised, bahar2017empirical, kiros2015skip]
. The reason for their widespread popularity seems to be the fact that they are believed to be easier to tune than SGD, NAG or HB. Adaptive gradient methods use as their update direction a vector which is the image under a linear transformation (often called the “diagonal pre-conditioner”) constructed out of the history of the gradients, of a linear combination of all the gradients seen till now. It is generally believed that this “pre-conditioning” makes these algorithms much less sensitive to the selection of its hyper-parameters. A precursor to these RMSProp and ADAM was outlined in[duchi2011adaptive].
Despite their widespread use in neural net problems, adaptive gradients methods like RMSProp and ADAM lack theoretical justifications in the non-convex setting - even with exact/deterministic gradients [bernstein2018signsgd]. Further, there are also important motivations to study the behavior of these algorithms in the deterministic setting because of usecases where the amount of noise is controlled during optimization, either by using larger batches [martens2015optimizing, de2017automated, babanezhad2015stop]
or by employing variance-reducing techniques[johnson2013accelerating, defazio2014saga].
Further, works like [wilson2017marginal] and [keskar2017improving] have shown cases where SGD (no momentum) and HB (classical momentum) generalize much better than RMSProp and ADAM with stochastic gradients. [wilson2017marginal]
also show that ADAM generalizes poorly for large enough nets and that RMSProp generalizes better than ADAM on a couple of neural network tasks (most notably in the character-level language modeling task). But in general it’s not clear and no heuristics are known to the best of our knowledge to decide whether these insights about relative performances (generalization or training) between algorithms hold for other models or carry over to the full-batch setting.
A summary of our contributions
In this work we try to shed some light on the above described open questions about adaptive gradient methods in the following two ways.
To the best of our knowledge, this work gives the first convergence guarantees for adaptive gradient based standard neural-net training heuristics. Specifically we show run-time bounds for deterministic RMSProp and ADAM to reach approximate criticality on smooth non-convex functions, as well as for stochastic RMSProp under an additional assumption.
Recently, [reddi2018convergence] have shown in the setting of online convex optimization that there are certain sequences of convex functions where ADAM and RMSprop fail to converge to asymptotically zero average regret. We contrast our findings with Theorem in [reddi2018convergence]. Their counterexample for ADAM is constructed in the stochastic optimization framework and is incomparable to our result about deterministic ADAM. Our proof of convergence to approximate critical points establishes a key conceptual point that for adaptive gradient algorithms one cannot transfer intuitions about convergence from online setups to their more common use case in offline setups.
Our second contribution is empirical investigation into adaptive gradient methods, where our goals are different from what our theoretical results are probing. We test the convergence and generalization properties of RMSProp and ADAM and we compare their performance against NAG on a variety of autoencoder experiments on MNIST data, in both full and mini-batch settings. In the full-batch setting, we demonstrate that ADAM with very high values of the momentum parameter (
) matches or outperforms carefully tuned NAG and RMSProp, in terms of getting lower training and test losses. We show that as the autoencoder size keeps increasing, RMSProp fails to generalize pretty soon. In the mini-batch experiments we see exactly the same behaviour for large enough nets. We further validate this behavior on an image classification task on CIFAR-10 using a VGG-9 convolutional neural network, the results to which we present in the AppendixLABEL:vgg9_sec.
We note that recently it has been shown by [lucas2018aggregated], that there are problems where NAG generalizes better than ADAM even after tuning . In contrast our experiments reveal controlled setups where tuning ADAM’s closer to than usual practice helps close the generalization gap with NAG and HB which exists at standard values of .
Much after this work was completed we came to know of a related paper [li2018convergence] which analyzes convergence rates of a modification of AdaGrad (not RMSPRop or ADAM). After the initial version of our work was made public, a few other analysis of adaptive gradient methods have also appeared like [chen2018convergence], [zhou2018convergence] and [zaheer2018adaptive].
2 Notations and Pseudocodes
Firstly we define the smoothness property that we assume in our proofs for all our non-convex objectives. This is a standard assumption used in the optimization literature.
smoothness If is at least once differentiable then we call it smooth for some if for all the following inequality holds,
We need one more definition that of square-root of diagonal matrices,
Square root of the Penrose inverse If and then we define, , where is the standard basis of
Now we list out the pseudocodes used for NAG, RMSProp and ADAM in theory and experiments,
Nesterov’s Accelerated Gradient (NAG) Algorithm
3 Convergence Guarantees for ADAM and RMSProp
Previously it has been shown in [rangamani2017critical] that mini-batch RMSProp can off-the-shelf do autoencoding on depth autoencoders trained on MNIST data while similar results using non-adaptive gradient descent methods requires much tuning of the step-size schedule. Here we give the first result about convergence to criticality for stochastic RMSProp albeit under a certain technical assumption about the training set (and hence on the first order oracle). Towards that we need the following definition,
The sign function
We define the function s.t it maps .
Stochastic RMSProp converges to criticality (Proof in subsection LABEL:sec:supp_rmspropS) Let be smooth and be of the form s.t. (a) each is at least once differentiable, the gradients are s.t , , (c) is an upperbound on the norm of the gradients of and (d) has a minimizer, i.e., there exists such that . Let the gradient oracle be s.t when invoked at some it uniformly at random picks and returns, . Then corresponding to any and a starting point for Algorithm 2, we can define, s.t. we are guaranteed that the iterates of Algorithm 2 using a constant step-length of, will find an critical point in at most steps in the sense that, .∎