1 Introduction
Stochastic gradient descent is the workhorse behind the recent deep learning revolution. This simple and ageold algorithm has been supplemented with a variety of enhancements to improve its practical performance, and sometimes its theoretical guarantees.
Amongst the acceleration methods there are three main categories: momentum, adaptive regularization, and variance reduction. Momentum (in its various incarnations, like heavyball or Nesterov acceleration) is the oldest enhancement. It has a welldeveloped theory, and is known to improve practical convergence in a variety of tasks, small and large. It is also easy to implement. Variance reduction is the most recent advancement; in theory and practice, it is mostly applicable to convex optimization, and is thus less influential in deep learning.
This brings us to adaptive regularization: the most sophisticated, hard to implement, and debated acceleration method. While stateoftheart optimizers such as Adam and AdaGrad [KB14, DHS11] do use adaptive regularization, they do so in a very limited form: with diagonal matrices, often marketed as percoordinate adaptive learningrate methods. Despite solid theoretical guarantees, the practical value of diagonal adaptive regularization as compared to “vanilla” SGD has been the subject of much debate [WRS17]. However, the efficacy of fullmatrix adaptive regularization has been relatively unexplored. This is due to the prohibitive computational cost associated with fullmatrix operations: full AdaGrad requires taking the inverse square root of a large matrix.
In this paper, we present GGT, a practical solution to the computational problems plaguing fullmatrix adaptive regularization, making this technique scalable for modern deep models. At the heart of our method is a simple, GPUfriendly way to apply the inverse square root of the lowrank secondmoment matrix of recent gradients; see Figure
1. GGT’s running time is comparable to stateoftheart optimizers.We proceed to show that fullmatrix preconditioning allows for much better exploitation of anisotropic curvature in loss landscapes. First, we show synthetic experiments which demonstate clear benefits of GGT over baselines, especially when the problem is illconditioned. Then, we implement GGT at scale, and show that the benefits translate to faster training on standard deep learning benchmarks. Our improvement is most salient in complicated landscapes like RNN training.
Our algorithm comes with theoretical guarantees. We give the first proof of convergence to firstorder critical points for an algorithm with adaptive regularization in a stochastic nonconvex setting, whose rate is dependent on an adaptivity ratio. We show examples where our bound is stronger than that for SGD, providing some theoretical basis for our empirical findings.
1.1 Related Work
Since the introduction of AdaGrad [DHS11]
, diagonal adaptive regularization has been a mainstay in the machine learning practitioner’s toolbox. A quick perusal of the literature shows that these methods have continued to thrive in the deep learning era, and appear in all major frameworks
[AAB16, PGC17, CLL15]. By citation count (or GitHub search hits), Adam [KB14]is by far the most popular adaptive optimizer for training a variety of modern deep models. For this reason, this paper’s exposition is targeted towards a fullmatrix dropin replacement for Adam; however, our techniques extend straightforwardly to a plethora of related variants, like RMSprop
[TH12], Adadelta [Zei12], Nadam [Doz16], etc.Fullmatrix adaptive regularization has existed alongside the more commonly used diagonalmatrix manifestation since their common inception in [DHS11]; however, a major obstacle to the scalability of these methods is the need for the storage and inversion of square matrices in the model dimension. This becomes prohibitively expensive in dimension greater than , while stateoftheart models regularly exceed parameters.
Matrix sketching has been employed to approximate the AdaGrad preconditioner [KMK16b, MRVW16]
; however, the sketched estimate for the matrix inverse can be sensitive to noise. In the former, the authors report a 510
overhead over AdaGrad, even with model parameters; we could not find a usable GPU implementation for their requisite rank1 QR update. [GKS18] propose a way to do AdaGrad with Kronecker products of fullmatrix preconditioners, a more limited setting which requires knowledge of the model’s structure. Finally, as we argue in Section 3.1, there is intrinsic value of “forgetting” past curvature using an exponential window. With this, a lowrank preconditioning matrix naturally arises, allowing us to bypass the computational need for sketching in the model dimension or architecturedependent restriction of the preconditioner.Our algorithm bears a superficial resemblance to LBFGS [LN89], a version of BFGS [Bro70, Fle70, Gol70, Sha70] which uses a sliding window of gradient history. Although some are viable for largescale implementation, these quasiNewton methods, along with (subsampled, online, cubicregularized) Newton methods [EM15, ABH17, LACBL16, HAK07, AAZB17, CDHS17] exhibit very different dynamics than the standard optimizers in deep learning, and thus have not seen widespread adoption. We find recent deep learning applications of secondorder methods (e.g. [MG15, MBJ18]) to be intriguing, though outside the scope of this paper.
Recently, the role of adaptive regularization has been a hotly contested topic. In [WRS17], the authors suggest that properlytuned SGD exhibits superior generalization to adaptive methods. In turn, [KS17] propose switching the optimizer from Adam to SGD at the end of training, to reap the advantages of each. Influentially, Adam’s convergence has been the object of recent scrutiny [RKK18]
. However, Adam continues to enjoy successful convergence in practice; the problematic construction involves pathological outlier gradients. We do not use the analyses of Adam or AMSGrad.
2 The GGT Algorithm
Our main algorithmic contribution is GGT, an efficient firstorder algorithm for fullmatrix adaptive preconditioning. In brief, GGT uses the preconditioner from fullmatrix AdaGrad, with gradient history attenuated exponentially as in Adam, and truncated to a window parameter . The name GGT acts as a convenient mnemonic for the gradient secondmoment matrix maintained by fullmatrix AdaGrad, even though we never compute this matrix.
The mathematical specification of GGT is given in Algorithm 1, in the usual model of stochastic optimization (see Section 4), with gradients . Notice that the coordinatewise scaling of Adam is recovered by zeroing out the offdiagonal entries of .
GGT provides the power of fullmatrix adaptive regularization at a cost not much larger than SGD. This crucially exploits the fact only a small window of historical gradients are used for preconditioning. The intuition for using a small window, as opposed to the entire history, is clear (and timetested, by the ubiquity of Adam): the curvature of the loss surface changes, rendering previous gradient information obsolete. We expand on the benefits of forgetting gradients in section 3.1.
The fact that the preconditioning matrix is based on a small window of gradients implies that it has low rank. GGT exploits this fact by computing the inverse square root of the empirical covariance matrix indirectly, as outlined in Figure 1
. In effect, instead of inverting a full matrix in the dimension of parameters, using the special matrix structure GGT inverts a matrix of dimension windowsize. The remainder of this section will discuss efficient implementation and some heuristics.
GGT has a provable guarantees even for nonconvex optimization: it is guaranteed to converge to a firstorder critical point. Its rate of convergence is never significantly slower than that of SGD, and in some favorable geometric conditions, can be significantly faster. These theoretical bounds are made precise in section 4.
2.1 Fast lowrank preconditioning
The window parameter should be roughly the number of copies of the model that fit in RAM; in our largescale experiments, we use . A pessimistic but principled choice is , which truncates on the time scale of the exponential attenuation. Our key observation, highlighted in Figure 1, is that the inversion of the large lowrank matrix can be performed by diagonalizing the small matrix
, along with some extremely GPUfriendly matrixvector operations.
The basic intuition is contained in Figure 1, but it remains to include the term. We derive the full update here. Let , be arbitrary, with
. Write the singular value decomposition
, with . Let , and let be its top left block. Let , so that the columns of are an orthonormal basis for the column space of , and its orthogonal component, noting that . Then, we haveThe first term is none other than an SGD update step. The rest can be computed by taking the eigendecomposition , giving . We prefer this to taking the direct SVD of , which is times slower on GPU.
Using a cyclic buffer to store and update , the algorithm takes (sequential) time per iteration, and memory in total. Iterating over the model parameters to update incurs the same overhead cost as usual adaptive optimizers. The matrix multiplication and SVD operations benefit from decades of extensive hardwarelevel optimizations.
In the experiments in Section 3, we observed a (CNN) and
(RNN) runningtime overhead over SGD; we note that this ratio could be even smaller in reinforcement learning (where the environment causes the time bottleneck), or universally with a more optimized implementation.
2.2 Tweaks for GGT on deep models
Below, we list some practical suggestions for applying GGT to training largescale models.
Momentum. In order to bring GGT closer to a dropin replacement for Adam, we can add momentum to the gradient steps: let , and apply the preconditioner to to compute the update step. We use momentum in all largescale experiments, with the standard . We also get a small performance boost by using instead of the gradients to update . On the other hand, as long as , it makes little difference to choose , letting the window (rather than exponential attenuation) forget stale gradient information.
Interpolation with SGD. We note the possibility of decoupling the scalars and which appear in the efficient update step. Appealingly, this allows the user to tune GGT’s behavior to be arbitrarily close to that of SGD.
Numerical concerns.
For greater numerical stability, it is possible to add a small multiple of the identity matrix (we suggest
) to before computing its eigendecomposition, without noticeable differences in training.3 Experiments
In this section, we present an empirical study of GGT. We begin with some simple experiments, showing that adaptive methods help in the presence of illconditioned optimization problems, as well as the value of limited gradient memory. Next, we evaluate the performance of GGT on largerscale deep learning tasks. Finally, we present some interesting empirical insights on the training dynamics in deep learning models. Our visualizations of gradient spectra suggest that adaptive optimizers are indeed correcting for changing anisotropic curvature in the loss landscape.
3.1 Synthetic data: when do adaptivity and forgetfulness help?
The original theorems on the behavior of adaptive firstorder methods are established from the perspective of online convex optimization [DHS11]. The dynamics are less understood on realistic loss landscapes in stochastic optimization. For this reason, we begin our experimental section with some simple empirical comparisons between full and diagonalmatrix adaptive optimizers and SGD. Figure 2 summarizes our findings.
In each synthetic experiment, we generated an illconditioned landscape, and compared SGD with adaptive optimizers, excluding the typical accompanying heuristics (i.e. no momentum, regularization, or learning rate schedule). We tested diagonalmatrix preconditioners with and without exponential gradient attenuation (like Adam and AdaGrad, respectively), and their fullmatrix analogues. The experiments were robust with respect to the choice of (we used ) and batch size.
In the first synthetic experiment (left)
, we exhibit an instance of logistic regression in dimension 10, with
samples generated from an extremely anisotropic Gaussian distribution, and binary labels determined by a random hyperplane. SGD converges the slowest, and diagonal AdaGrad consistently accelerates optimization. Finally, fullmatrix preconditioning (using cubictime matrix inversion) converges the fastest. In this setting, adding a window improved convergence, but not drastically; we elaborate below.Next, we show an optimization problem (right) which accentuates the utility of exponentially decaying gradient memory. We consider the problem of minimizing the logarithmic barrier function of a randomly generated anisotropic polytope, otherwise known as finding its analytic center: this replaces the logistic loss terms with , with generated the same way as above, and generated uniformly from . We observed the same ranking of convergence rates as in the first experiment, but the improvement afforded by the window was much clearer.
The primary conclusion of our synthetic experiments is to demonstrate some smallscale settings in which adaptive regularization ameliorates anisotropy in the optimization landscape. A subtler point is that the windowed variants can help with changing curvature, even for convex losses. Note that the curvature of the former landscape is constant (in that its Hessian matrix at different locations only changes by a scalar factor). The latter setting, in contrast, features a changing curvature (its Hessians do not commute in general), necessitating “forgetfulness” in adaptive curvature estimation.
In Section 3.4, we will return to these proofofconcept optimization instances, connecting them to an empirical study of curvature in more realistic landscapes.
3.2 GGT on deep convolutional models
We investigated the training dynamics of GGT on a typical deep architecture for computer vision. For this, we used a 26layer 3branch residual network with ShakeShake regularization, recently proposed in
[Gas17]. Aside from its ability to reach stateoftheart classification accuracy, this architecture also features a relatively low parameter count (M), enabling the use of a large window parameter ().In each experiment, we kept the cosine learning rate annealing schedule proposed in the paper, originally from [LH16]; performance degraded consistently and significantly with a fixed learning rate. For both Adam and GGT, we chose the commonly used parameters ; for SGD, we used momentum with parameter
. With correctly tuned RMSprop and Adadelta, with the same window parameters, training curves were virtually identical to those for Adam. We used the standard data augmentation techniques of 4pixel padding + random cropping and horizontal flipping.
Our results are shown in Figure 3 (top). In terms of training loss, GGT consistently dominated existing optimizers. We corroborate a number of observations from previous empirical studies of the generalization of optimizers. Most prominently, we found that SGD generalized slightly better than all others [WRS17, KS17] towards the end of training, including ours. The gap is less dramatic than that seen in [WRS17] for two reasons: we only show curves with a tuned and annealed learning rate; also, we use an architecture with powerful explicit regularization techniques which have gained attention since their publication. Our preliminary observation is that GGT shrinks this gap slightly, and expect that there is vastly more empirical work to be done concerning architectures synergistically tuned to default optimizers.
We also verify the longheld empirical observation that the learning rate decay of AdaGrad is too aggressive (e.g. in [Zei12]), resulting in convergence to a poor solution. Finally, as noted in [WRS17], we find that using a sufficiently low learning rate for any optimizer can result in a better training loss curve, but not without significantly degrading generalization performance ().
3.3 GGT on recurrent models
Next, we move to recurrent architectures for language modeling. We train a 3layer LSTM [HS97] with M parameters for characterlevel modeling of the Penn Treebank dataset [MKM94]. This is the setting in which we observe the most striking improvement over baselines. The particularities of this optimization task, and why it might be especially amenable to fullmatrix regularization, remain a fruitful research direction [PMB13]. Figure 3 (bottom) shows training and validation perplexities for the first epochs; no optimizer makes significant progress afterwards.
The state of the art for characterlevel language modeling is less thoroughly documented than its wordlevel counterpart, though we note that our endtoend result (validation perplexity after epochs) is competitive with that shown in [KMK16a]. In contrast, Adam, AdaGrad, and SGD reach , , and , respectively. Note that Adam is the de facto standard optimizer for language modeling [MDB17]. Even with iterations taking twice the time, we outperform all baselines in wallclock time throughout training.
We also tried using GGT as a dropin replacement for Adam in the stateoftheart wordlevel language modeling code accompanying [MKS17, MKS18]. Although we were competitive with Adam, we only observed an improvement in the first epochs. We hypothesize that the advantage of fullmatrix regularization in this setting is more marginal, as the gradients in the embedding layers are naturally sparse in the vocabulary (“onehot”) basis.
3.4 Empirical insights on the spectral decay
In this section, we unify the insights gleaned from the synthetic experiments and deep learning benchmarks. Along the way, we provide some interesting anecdotal observations on the evolution of the preconditioner matrices’ singular values.
We plot the density of the spectrum of the lowrank preconditioner as training progresses. Since the fast implementation of GGT takes an eigendecomposition of , we can read off the distribution of eigenvalues during training at no additional computational cost. Figure 4 visualizes the result of this experiment for the CNN and RNN training settings from the previous two sections. In each case, we observe that has a condition number of , noting that this can be visualized as the vertical range in the logarithmic plot.
This visualization affords a new way to see how CNN and RNN landscapes are fundamentally different: their gradient spectra evolve in very distinct ways over the course of training. Interestingly, the condition number of the CNN landscape surges near the end, which may be related to the the lowrank structure of welltrained nets noted by [AGNZ18]
, who derive rankdependent generalization bounds for neural networks. On recurrent models, the rapidly evolving spectral structure at the early stage of training indicates a possibly more complex landscape. Intriguingly, the enormous condition number (
) correlates with the massive lead of GGT over the others, confirming our intuition that fullmatrix preconditioning ameliorates anisotropy.To our knowledge, this is the first empirical study of this kind, using the covariance matrix of recent gradients as a surrogate to examining the changing curvature of the loss landscape. In the spirit of recent empirical lenses of this flavor [RGYSD17, LXTG17], we leave this as a way to visualize deep learning dynamics, possibly of independent exploratory interest.
4 A convergence rate analysis with adaptivity
In this section we outline an idealized version of GGT, for which we can prove convergence to an approximate firstorder critical point faster than SGD. As far as we know, this is the first provable guarantee for an adaptive gradient method in the nonconvex setting.
Throughout this section, we consider the setting of stochastic optimization of a differentiable nonconvex function , equipped with an unbiased variancebounded stochastic gradient oracle; that is, given a point, an algorithm can query independent stochastic gradients such that
The objective, as is standard in the theory of nonconvex optimization (see, e.g. [GL13, AZH16]), is to find an approximate stationary point ; that is, .
4.1 A suitable abstraction for GGT
Even in the convex setting, a convergence theorem in the form of that shown in the Adam paper [KB14] is mild, and not useful for reasoning about the benefit of adaptivity or gradient memory. In particular, the bound degrades with the attenuation parameters and . Although [RKK18] fix a technical glitch in the Adam proof by prescribing a closely related algorithm, the convergence guarantees are of the same form.
Instead, we argue that it is more illuminating to analyze a somewhat idealized relative of the algorithm, in exchange for stronger bounds. For this, we move to a variant of (fullmatrix) AdaGrad with “epochs”, or restarts, fully specified in the appendix as Algorithm 2. These restarts can be seen as another justification for using a window, in addition to our aforementioned experimental and intuitive arguments. We quantify the improvement of adaptive regularization, define the adaptivity ratio as
where is the sequence of stochastic gradients, the sequence of points played by the adaptive regularization algorithm, and is a comparator. For convex optimization problems is naturally the global minimum, but for nonconvex optimization it is a subtler choice, which we detail in Appendix A.
This ratio characterizes the benefit of using adaptive regularization, and was shown in [DHS11] to be always bounded by a quantity independent of , and potentially much smaller. Specifically, it was shown to be at times inversely proportional to the dimension in certain convex optimization problems, providing a theoretical justification for the speedup of adaptive regularization algorithms. For sake of completeness, in Appendix A.2 we restate one setting exemplifying this important fact.
4.2 Adaptive convergence rate guarantee
We informally state the main theorem below. We defer the full bound without suppressed smoothness constants and logarithmic factors, as well as all technical proofs, to Appendix A.
Theorem 4.1.
Let be a bounded, Lipschitz, and smooth function with stochastic gradient oracle , whose variance is at most . In expectation, Algorithm 2 outputs an approximate critical point of , with calls to .
This theorem matches and potentially improves the known analysis for stochastic gradient descent with the introduction of the datadependent adaptivity constant into the leadingorder term governing the rate of convergence. Since [DHS11] bounded by a quantity independent of , our theorem gives a rate of convergence.
We prove this result using two reductions. The first converts the online regret bound for the idealized algorithm to a convergence rate governed by the adaptivity ratio , for a wellconditioned convex function. This gives us an intermediate adaptive convergence result for convex optimization.
In our second reduction, using a modification of the usual descent lemma used in analyzing gradient descent in the nonconvex setting, we reduce a smooth nonconvex optimization problem to a sequence of wellconditioned convex optimization problems. We highlight the conceptual link between this twostage analysis and the value of forgetting gradient history: the nonconvex optimization problem is decomposed into a sequence of convex “soft trustregion” problems, between which the idealized GGT algorithm restarts its adaptive regularization. In Appendix A, we translate these intuitions into the formal main convergence theorem.
5 Conclusion
This work investigates fullmatrix adaptive regularization: our main contribution is to make this technique viable for largescale optimization, by a method for efficient multiplication by the inverse square root of a full secondmoment matrix over a short window of gradients. This leads to a new algorithm, GGT, a truly scalable optimization algorithm with fullmatrix adaptive preconditioning.
Through synthetic experiments, we have shown that GGT accelerates optimization in illconditioned loss landscapes; this is supported by accompanying adaptive convergence guarantees. Preliminary experiments show accelerated convergence on standard deep learning benchmarks, with very different training dynamics from existing diagonal adaptive methods. We accompany our algorithm and experiments with the first theoretical guarantees for adaptive regularization in the nonconvex setting, giving examples of provably faster convergence to firstorder critical points. We hope that GGT will be the first of a new class of algorithms for the modern largescale optimization toolbox, and to foster new discussion towards an everelusive understanding of loss landscapes in deep learning.
Acknowledgments
We are grateful to Yoram Singer, Tomer Koren, Nadav Cohen, and Sanjeev Arora for helpful discussions.
References
 [AAB16] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.

[AAZB17]
Naman Agarwal, Zeyuan AllenZhu, Brian Bullins, Elad Hazan, and Tengyu Ma.
Finding approximate local minima faster than gradient descent.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, pages 1195–1199. ACM, 2017.  [ABH17] Naman Agarwal, Brian Bullins, and Elad Hazan. Secondorder stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148–4187, 2017.
 [AGNZ18] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.
 [AZH16] Zeyuan AllenZhu and Elad Hazan. Variance reduction for faster nonconvex optimization. In International Conference on Machine Learning, pages 699–707, 2016.
 [Bro70] Charles G Broyden. The convergence of a class of doublerank minimization algorithms: 2. the new algorithm. IMA Journal of Applied Mathematics, 6(3):222–231, 1970.
 [CDHS17] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. “convex until proven guilty”: Dimensionfree acceleration of gradient descent on nonconvex functions. In International Conference on Machine Learning, pages 654–663, 2017.
 [CLL15] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MxNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
 [DHS11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.

[Doz16]
Timothy Dozat.
Incorporating Nesterov momentum into Adam.
2016.  [EM15] Murat A Erdogdu and Andrea Montanari. Convergence rates of subsampled newton methods. In Proceedings of the 28th International Conference on Neural Information Processing SystemsVolume 2, pages 3052–3060. MIT Press, 2015.
 [Fle70] Roger Fletcher. A new approach to variable metric algorithms. The Computer Journal, 13(3):317–322, 1970.
 [Gas17] Xavier Gastaldi. Shakeshake regularization. arXiv preprint arXiv:1705.07485, 2017.
 [GKS18] Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. arXiv preprint arXiv:1802.09568, 2018.
 [GL13] Saeed Ghadimi and Guanghui Lan. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
 [Gol70] Donald Goldfarb. A family of variablemetric methods derived by variational means. Mathematics of Computation, 24(109):23–26, 1970.
 [HAK07] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(23):169–192, 2007.
 [HK14] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic stronglyconvex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.
 [HS97] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [KB14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [KMK16a] David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, and Chris Pal. Zoneout: Regularizing rnns by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305, 2016.
 [KMK16b] Gabriel Krummenacher, Brian McWilliams, Yannic Kilcher, Joachim M Buhmann, and Nicolai Meinshausen. Scalable adaptive stochastic optimization using random projections. In Advances in Neural Information Processing Systems, pages 1750–1758, 2016.
 [KS17] Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017.
 [LACBL16] Haipeng Luo, Alekh Agarwal, Nicolo CesaBianchi, and John Langford. Efficient second order online learning by sketching. In Advances in Neural Information Processing Systems, pages 902–910, 2016.
 [LH16] Ilya Loshchilov and Frank Hutter. Sgdr: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.
 [LN89] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical Programming, 45(13):503–528, 1989.
 [LXTG17] Hao Li, Zheng Xu, Gavin Taylor, and Tom Goldstein. Visualizing the loss landscape of neural nets. arXiv preprint arXiv:1712.09913, 2017.

[MBJ18]
James Martens, Jimmy Ba, and Matt Johnson.
Kroneckerfactored curvature approximations for recurrent neural networks.
2018.  [MDB17] Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589, 2017.
 [MG15] James Martens and Roger Grosse. Optimizing neural networks with kroneckerfactored approximate curvature. In International Conference on Machine Learning, pages 2408–2417, 2015.
 [MKM94] Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank: annotating predicate argument structure. In Proceedings of the workshop on Human Language Technology, pages 114–119. Association for Computational Linguistics, 1994.
 [MKS17] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
 [MKS18] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240, 2018.
 [MRVW16] Nishant A Mehta, Alistair Rendell, Anish Varghese, and Christfried Webers. Compadagrad: A compressed, complementary, computationallyefficient adaptive gradient method. arXiv preprint arXiv:1609.03319, 2016.

[PGC17]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in PyTorch.
In NIPSW, 2017.  [PMB13] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318, 2013.
 [RGYSD17] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha SohlDickstein. Svcca: Singular vector canonical correlation analysis for deep understanding and improvement. arXiv preprint arXiv:1706.05806, 2017.
 [RKK18] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018.
 [Sha70] David F Shanno. Conditioning of quasinewton methods for function minimization. Mathematics of Computation, 24(111):647–656, 1970.
 [TH12] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
 [WRS17] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4151–4161, 2017.
 [Zei12] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
 [Zin03] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pages 928–936, 2003.
Appendix A Full adaptive convergence analysis
To reiterate the main paper, in this section we develop an idealized version of GGT, which features fullmatrix adaptive regularization and a principled choice of windowing epochs. We prove that it can converge to an approximate firstorder critical point faster than SGD, with convergence rate controlled by an adaptivity ratio . To our knowledge, this is the first provable guarantee for an adaptive gradient method in the nonconvex setting.
We consider the standard setting of stochastic optimization of a differentiable nonconvex function , equipped with a boundedvariance stochastic gradient oracle; that is, given a point, we can query independent stochastic gradients such that
The objective, as is standard in nonconvex optimization, is to find a point for which . We will also assume that has a Lipschitz gradient; i.e. .
Our algorithm makes a reduction to the case of online convex optimization. The setting formally is as follows – given a convex set and a class of convex functions , an adversary selects a sequence , and the player selects a sequence of points . The standard objective here is to minimize regret, defined as
Two popular algorithms to minimize online regret are online gradient descent (OGD) [Zin03] and AdaGrad [DHS11]. Due to adaptive regularization, AdaGrad can often be advantageous over OGD. We capture this notion by defining the adaptivity ratio as
where and . The numerator is the regret of AdaGrad, and the denominator is proportional to the upper bound on the regret of OGD.
It follows from the bounds of [DHS11] that is in the range for the diagonal version of AdaGrad, depending on the geometry of the optimization problem. The value of for full matrix AdaGrad is unknown, but examples are known for which it is significantly smaller than one. For completeness, we conclude this section with such an example.
In the rest of this section we will be using AdaGrad as a subroutine in our proposed algorithms. In this regard, while stating the bounds for our algorithms we use as an upper bound over the adaptivity ratio of each individual run of the AdaGrad subroutine. Furthermore, our algorithms will instantiate the online setting in the stochastic setting, where are picked randomly. In such settings will denote an upper bound on at each step of each run. A weak upper bound on can be obtained by where is the diameter of the underlying set.
a.1 Main Theorem
Theorem A.1.
We prove the theorem in two steps. First we prove the following theorem about Algorithm 2 which reduces smooth nonconvex optimization problem to a sequence of wellconditioned (strongly convex and smooth) convex optimization problems.
Theorem A.2.
Consider a nonconvex function such that and a point such that . Further suppose we are given an iterative algorithm with the guarantee such that given a smooth and strongly convex function accesible through a stochastic gradient oracle, if is run on for steps, it produces a point such that
Then the point output by Algorithm 2 is such that
Further, we propose an AdaGradlike algorithm (Algorithm 3) that makes calls to the stochastic gradient oracle and minimizes a smooth and strongly convex function to error. We prove the following theorem regarding Algorithm 3.
Theorem A.3.
Suppose f is a strongly convex and smooth function equipped with a bounded stochastic gradient oracle. If we have initial point such that , when run with and , Algorithm 3 guarantees
in a total number of oracle calls.
Note that due to the presence of in the above bound, we hope that the above could be much better than OGD. Our analysis closely follows the analysis for SGD for strongly convex functions, given by [HK14]. We now prove Theorem A.1 using Theorems A.2 and A.3.
Proof of Theorem a.1.
Proof of Theorem a.2.
From the statement of the theorem we have that
Now consider the following equations which hold for any and any .
where the last inequality follows from smoothness. Now setting , summing the inequality over and rearranging gives us that
which proves the theorem. ∎
Proof of Theorem a.3.
Define . The first claim is that at the end of each epoch , it holds that
Note that the claim is true for as .
Additionally, for any epoch we have
Note that the minimum value of for any is , so . Therefore
where the last inequality is due to the induction hypothesis and Jensen’s inequality. As a consequence, we have
The total number of stochastic gradient oracle calls the algorithm makes is . ∎
a.2 Example: the advantage of adaptivity
This section shows an example originally provided in [DHS11], where the constant of adaptivity for full matrix AdaGrad is much smaller than 1 and adaptive regularization methods have significant advantage over SGD. Consider the setting where in each iteration we receive a training example and suffer hinge loss . Let be an orthonormal matrix and denote its columns. Let and our domain be , then is an optimum. Suppose for a fixed , we receive examples in the following way:
We show that in this case . We initialize . After the first iteration, AdaGrad updates and we have zero loss until the algorithm sees . Since and are orthogonal, . Similarly, AdaGrad suffers constant loss in each dimension, and .