1 Introduction
We consider finitesum optimization problems of the form
(1) 
which typically arise in neural network training, e.g. for empirical risk minimization over a set of data points . Here, is a convex loss and represents the neural network mapping parameterized by
, which is nonconvex due to its multiplicative nature and potentially nonlinear activation functions. We assume that
is twice differentiable, i.e.. Nonconvex optimization problems are ubiquitous in machine learning. Among the most prominent examples are presentday deep neural networks, that have achieved outstanding results on core tasks such as collaborative filtering
(wang2015collaborative, ), sentence classification kim2014convolutional and image classification krizhevsky2012imagenet .In the era of big data and deep neural networks, stochastic gradient descent (SGD) is one of the most widely used training algorithms
bottou2010large . What makes SGD so attractive is its simplicity, its seeming universality and periteration cost that is independent of the size of the training set and scales linearly in the problem dimension . However, gradient descent is known to be inadequate to optimize functions that are not wellconditioned nesterov2013introductory ; shalev2017failures and thus adaptive firstorder methods that employ dynamic, coordinatewise learning rates based on past gradients – including Adagrad (duchi2011adaptive, ), RMSprop tieleman2012lecture and Adam kingma2014adam – have become a popular alternative to SGD when training neural networks and these methods often provide significant speedups over plain SGD when applied in practice. Yet, there exist no theoretical proofs that these methods are faster than gradient descent li2018convergence .From a theoretical perspective, Newton methods provide stronger convergence guarantees by appropriately transforming the gradient in illconditioned regions according to secondorder derivatives. It is precisely this Hessian information that allows regularized Newton methods to enjoy superlinear local convergence as well as escape saddle points provably conn2000trust , which can slow gradient methods down significantly du2017gradient and are believed to be prevailing in high dimensional problems dauphin2014identifying . While secondorder algorithms have a long standing history even in the realm of neural network training hagan1994training ; becker1988improving , they were mostly considered as too computationally and memory expensive for practical applications. Yet, the seminal work of martens2010deep renewed interest for their use in deep learning by proposing efficient Hessianfree
methods that only access secondorder information via matrixvector products which can be computed at the cost of an additional backpropagation
pearlmutter1994fast ; schraudolph2002fast . Among the class of regularized Newton methods, trust region conn2000trust and cubic regularization algorithms cartis2011adaptive are the most principled approaches, as they yield the strongest convergence guarantees. Recently, stochastic extensions have emerged xu2017newton ; yao2018inexact ; kohler2017sub ; gratton2017complexity , which suggest their applicability for deep learning.Yet, several practical considerations cast doubt on the theoretical superiority of secondorder methods when it comes to neural network training. Answers to the following questions are particularly unclear: Are saddles even an issue in deep learning? Is superlinear local convergence a desirable feature in machine learning applications (test error)? Are secondorder methods more "vulnerable" to subsampling noise? Do worstcase iteration complexities even matter in realworld settings? As a result, the value of Hessian information in neural network training is somewhat unclear apriori and so far a conclusive empirical study is still missing. We are only aware of the work of liu2018stochastic ; martens2015optimizing ; martens2010deep ; chapelle2011improved and xu2017second , which deliver promising first results but their empirical studies are by no means fully encompassing.
We thus here want to shed more light on the issue of first vs. secondorder optimization for neural networks. To do so, we equip the fully stochastic STORM algorithm chen2018stochastic ; gratton2017complexity with ellipsoidal norm constraints (Section 4) and then perform an extensive set of experiments where we evaluate this algorithm against standard firstorder methods (Section 5). Along the way, we give the following alternative interpretation of adaptive methods (Section 3): While gradient descent can be interpreted as a spherically constrained firstorder TR method, preconditioned gradient methods – such as Adagrad – can be seen as firstorder TR methods with ellipsoidal trust region constraint.
This observation is particularly interesting, since spherical constraints are blind to the underlying geometry of the problem but ellipsoids can adapt to local landscape characteristics, thereby allowing for more suitable steps in regions that are illconditioned. We will leverage this analogy and investigate the use of the Adagrad and RMSProp preconditioning matrices as ellipsoidal trust region shapes within secondorder TR methods. While the theory for ellipsoidal TR methods is wellstudied (e.g. conn2000trust ; yuan2015recent ), no ellipsoid fits all objective functions and our contribution thus lies in the identification of adequate matrixinduced constraints for the specific task of deep learning.
To sum up, our contribution is threefold:

We provide a new perspective on adaptive gradient methods that contributes to a better understanding of their innerworkings. Furthermore, we find that diagonal preconditioning is very effective since we observe empirically that many neural network problems exhibit diagonally dominated Hessian matrices.

We investigate the first application of ellipsoidal TR methods for deep learning. In Theorem 1 we show that the RMSProp preconditioning matrix can directly be applied as ellipsoid in secondorder TR algorithms while preserving convergence guarantees.

Finally, we provide a comprehensive benchmark of first vs. secondorder methods across different realworld datasets and architectures. We benchmark adaptive gradient methods and show results in terms of both epochs and wallclock time, a comparison we were not able to find in the literature.
Our main empirical results demonstrate that the net value of Hessian information is marginal. While our secondorder methods are mostly faster in terms of epochs, they just so mangage to keep pace with stochastic gradient methods in terms of time. Furthermore, they come at the cost of an increased number of hyperparameters and never converge to substantially better limit points. The latter observation suggests that neither saddles nor spurious local minima pose a major problem for firstorder methods in our settings. As a result, future advances in hardware and distributed optimization are needed to give Newton methods an edge over gradient based optimizers.
Finally – and more importantly for the work at hand – the ellipsoidal constraints we propose do prove to be a very effective modification of the trust region method in the sense they constantly outperform the spherical TR method, both in terms of time and asymptotic loss value on a variety of tasks and datasets (see Section 5).
2 Related work
Firstorder methods
The prototypical method for optimizing Eq. (1) is SGD (robbins1985stochastic, ). While the practical success of SGD in nonconvex optimization is unquestioned, the theoretical foundation of this phenomenon is still rather limited. Recent findings suggest the ability of this method to escape saddle points and reach local minima in polynomial time for general nonconvex problems but they either need to artificially add noise to the iterates (ge2015escaping, ; lee2016gradient, ) or make an assumption on the inherent noise of vanilla SGD daneshmand2018escaping . For neural network training, a recent line of research proclaims the effectiveness of SGD but the results usually come at the cost of fairly strong assumptions such as heavy overparametrization and Gaussian inputs du2017gradient ; brutzkus2017globally ; li2017convergence ; du2018power ; allen2018convergence . Adaptive gradient methods duchi2011adaptive ; tieleman2012lecture ; kingma2014adam build on the intuition that larger learning rates for smaller gradient components, and smaller learning rates for larger gradient components balance their respective influences and thereby make the methods behave as if they were optimizing a more isotropic surface. This intuition is put into practice by preconditioning the update direction with a diagonal matrix of squared past gradients. Such approaches have first been suggested for use in neural networks by lecun2012efficient . Recently, convergence guarantees for such methods are starting to appear ward2018adagrad ; li2018convergence . However, these are not superior to the worst case iteration complexity of standard gradient descent cartis2012much and it is yet to be explained rigorously under which conditions adaptive gradient methods can provably accelerate optimization.
Regularized Newton methods
The most principled class of regularized Newton methods are trust region (TR) and adaptive cubic regularization algorithms (ARC) conn2000trust ; cartis2011adaptive , which repeatedly optimize a local Taylor model of the objective while making sure that the step does not travel too far such that the model stays accurate. While the former finds firstorder stationary points within , ARC only takes at most . However, simple modifications to the TR framework allow these methods to obtain the same accelerated rate curtis2017trust . Both methods take at most iterations to find an approximate secondorder stationary point cartis2012complexity . These rates are optimal for secondorder Lipschitz continuous functions carmon2017lower ; cartis2012complexity and they can be retained even when only subsampled gradient and Hessian information is used kohler2017sub ; yao2018inexact ; xu2017newton ; blanchet2016convergence ; liu2018stochastic ; cartis2017global . Furthermore, the involved Hessian information can be computed solely based on Hessianvector products, which are implementable efficiently for neural networks pearlmutter1994fast . This makes these methods particularly attractive for deep learning but the empirical evidence of their applicability is so far very limited. We are only aware of the works of liu2018stochastic and xu2017second , which report promising first results but these are by no means fully encompassing. While the former does not benchmark any firstorder method at all, the latter uses large batch sizes and does not train convolutional networks nor benchmark any adaptive gradient method. Furthermore, the results are only reported in terms of backpropagations and not over wallclock time or epochs.
GaussNewton methods
Inspired by natural gradient descent algorithms amari1998natural , an interesting line of research proposes to replace the Hessian by (approximations of) the GeneralizedGaussNewton matrix (GGN) within a LevenbergMarquardt framework^{1}^{1}1This algorithm is a simplified TR method, originally tailored for nonlinear least squares problems nocedal2006nonlinear for optimization of neural networks lecun2012efficient ; martens2010deep ; chapelle2011improved ; martens2015optimizing . These methods also optimize a regularized quadratic model in each iteration and they have been termed hessianfree since only access to GGNvector products is required. As the GGN matrix is always positive semidefinite, they cannot leverage negative curvature to escape saddles and hence there exist no secondorder convergence guarantees. Furthermore, there are cases in neural network training where the Hessian is much better conditioned than the GaussNewton matrix mizutani2008second . Nevertheless, the above works report promising preliminary results and most notably grosse2016kronecker report that the hessianfree KFAC method can be faster than SGD when training a convnet.
In general, this line of work is to be seen as complementary to our approach since it is straight forward to replace the Hessian in the TR framework with the GGN matrix. Furthermore, the preconditioners used in martens2010deep and chapelle2011improved
, namely diagonal estimates of the empirical Fisher and Fisher matrix respectively, can directly be used as matrix norms in our ellipsoidal TR framework. A closer examination of the difference between Hessian and GGNInformation for neural network training is an interesting direction of future research.
3 An alternative view on adaptive gradient methods
Adaptively preconditioned gradient methods update iterates in the following way
where is a stochastic estimate of and is a positive definite symmetric preconditioning matrix. In Adagrad,
is the uncentered second moment matrix of the past gradients which we compute as follows
(2) 
where , is the identity matrix and . Building up on the intuition that past gradients might become obsolete in quickly changing nonconvex landscapes, RMSprop (and Adam) introduce an exponential weight decay on these terms leading to the preconditioning matrix
(3) 
where . In order to save computational efforts, the diagonal versions and are more commonly applied in practice, which in turn gives rise to coordinatewise adaptive stepsizes that are enlarged (reduced) in coordinates that have seen past gradient components with a larger (smaller) magnitude. In that way, the optimization methods can account for gradients of potentially different scales arising from e.g. different layers of the networks.
3.1 Adaptive preconditioning as ellipsoidal Trust Region
Starting from the fact that adaptive methods employ coordinatewise stepsizes, one can take a principled view on these methods. Namely, their update steps arise from minimizing a firstorder Taylor model of the function within an ellipsoidal search space around the current iterate , where the diameter of the ellipsoid along a certain coordinate is proportional to the associated stepsize. Correspondingly, vanilla SGD optimizes the same firstorder model within a spherical constraint. Fig. 1 (top) illustrates this effect on three quadratic objectives by showing not only the iterates of GD and Adagrad but also the implicit trust regions within which the local models were optimized at each step.^{2}^{2}2For illustrative purposes we only plot every other trust region. Since the models are linear, the constrained minimizer is always found on the boundary of the current trust region.
[width=0.3trim=22pt 22pt 30pt 30pt,clip,natwidth=610,natheight=642]figures/well_conditioned_walk4.pdf  [width=0.3trim=22pt 22pt 30pt 30pt,clip]figures/ill_conditioned_walk4.pdf  [width=0.3trim=22pt 22pt 30pt 30pt,clip]figures/tilted_walk2.pdf 
[width=0.3trim=22pt 22pt 30pt 30pt,clip]figures/well_conditioned_loss2_10runs_250iters.pdf  [width=0.3trim=22pt 22pt 30pt 30pt,clip]figures/ill_conditioned_loss2_10runs_250iters.pdf  [width=0.3trim=22pt 22pt 30pt 30pt,clip]figures/tilted_loss_30runs.pdf 
. Bottom: Average log suboptimality over iterations as well as 90% confidence intervals of 30 runs with random initialization
It is a well known fact that GD struggles to progress towards the minimizer of quadratics along lowcurvature directions, since the gradient in these directions has significantly smaller components than along high curvature (see e.g. (goh2017why, )). For fairly wellconditioned objectives this effect is negligible, as can be seen on the left hand side of Fig. 1. Yet, the plots in the middle show that this effect can drastically slow GD down when the problem is illconditioned. Particularly, once the method has reached the bottom of the valley it struggles to make progress along the horizontal axis. This observation resembles the classical zigzagging effect (nocedal2006nonlinear, ) and is precisely where the advantage of adaptive stepsize methods comes into play. As illustrated by the dashed trust region lines, Adagrad’s search space is damped along the direction of high curvature (vertical axis) and elongated along the low curvature direction (horizontal axis). This preconditioning allows the method to move more horizontally early on and thus enter the valley with a smaller distance to the optimizer along the low curvature direction which effectively circumvents the zigzagging effect.
Let us now formally establish the result that allows us to reinterpret adaptive gradient methods from the trust region perspective introduced above.
Theorem 1 (Preconditioned gradient methods as TR).
A preconditioned gradient step
(4) 
with stepsize , symmetric positive definite preconditioner and minimizes a first order model around in an ellipsoid given by in the sense that
(5) 
Corollary 1 (Rmsprop).
The step minimizes a first order Taylor model around in an ellipsoid given by (Eq. (3)) in the sense that
(6) 
Equivalent results can be established for Adam using as well as for Adagrad by replacing the matrix into the constraint in Eq. (6). Of course, the update procedure in Eq. (5) is merely a reinterpretation of the original preconditioned update Eq. (4) and thus the employed trust region radii are defined implicitly by the current gradient and stepsize but it is straightforward to embed this approach into a proper TR framework where the trust region radius is defined explicitly by the algorithm, which gives rise to first order TR methods that behave very similarly to GD and Adagrad in practice (see Fig. 5 in Appendix).
3.2 Diagonal versus full preconditioning
A closer look at Fig. 1 reveals that the first two problems come with level sets that are perfectly axisaligned, which makes these objectives particularly attractive for diagonal preconditioning. For a comparison, to the far right of Fig. 1 we report a further quadratic problem instance, where the Hessian is no longer zero on the offdiagonals. As can be seen, the interaction between coordinates introduces a tilt in the level sets and reduces the superiority of diagonal Adagrad over plain GD. However, using the full preconditioner reestablishes the original speed up. Yet, nondiagonal preconditioning comes at higher computational periteration cost since it involves taking the inverse square root of a large matrix, which is why this approach has been relatively unexplored (see (agarwal2018case, ) for a recent exception). In the following section, we find that such full matrix preconditioning is likely not worth the effort for neural networks since they show a surprisingly large degree of axis alignment.
3.3 Are neural networks axisaligned?
Early results by becker1988improving on the curvature structure of neural networks report a strong diagonal dominance of the Hessian matrix . This suggests that the loss surface is indeed somewhat axisaligned, which would render full matrix preconditioning obsolete. However, the reported numbers are only for tiny feedforward networks of at most 256 parameters. Therefore, we generalize these findings in the following to larger networks and contrast the diagonal dominance of real Hessian matrices to the expected behaviour of random Wigner matrices. For this purpose, let define the ratio of diagonal to overall mass of a matrix , i.e.
Proposition 1.
For a random Gaussian^{3}^{3}3The argument naturally extends to any distribution with positive expected absolute values Wigner matrix (Eq. (29)) the expected share of diagonal mass amounts to
Thus, if we suppose the Hessian at any given point were a random Wigner matrix we would expect the share of diagonal mass to fall with as the network grows in size. Yet, as can be seen in Fig. 2 the diagonal mass of realworld neural networks stays way above this theoretical value at random initialization, during training and after convergence. These findings are in line with becker1988improving and thus question the effectiveness of full matrix preconditioning and full matrix ellipsoidal TR methods.
Convnet 
MLP 
4 Secondorder Trust Region Methods
Cubic regularization nesterov2006cubic ; cartis2011adaptive and trust region methods belong to the family of globalized Newton methods. Both frameworks compute their parameter updates by optimizing regularized (former) or constrained (latter) secondorder Taylor models of the objective around the current iterate . In particular, in iteration the update step of the trust region algorithm is computed as
(7) 
where and and are either and or suitable approximations. The matrix induces the shape of the constraint set. So far, the common choice for neural networks is which gives rise to spherical trust regions xu2017second ; liu2018stochastic . By solving the constrained problem (7), TR methods overcome the problem that pure Newton steps may be ascending, attracted by saddles or not even computable in the case of indefinite Hessians. See Section 2 and Appendix B for more details.
4.1 Convergence of ellipsoidal Trust Region methods
Inspired by the success of adaptive gradient methods such as RMSProp, we investigate the use of their preconditioning matrices as norm inducing matrices for secondorder TR methods. One crucial condition for convergence is that the applied norms are not degenerate during the minimization process in the sense that the ellipsoids do not flatten out completely along any given direction.
Definition 1 (Uniformly equivalent norms).
The norm induced by a symmetric positive definite matrix is called uniformly equivalent, if there exists a constant such that
(8) 
We now establish a result which shows that the RMSProp ellipsoid is indeed uniformly equivalent.
Proposition 2 (Uniform equivalence).
Consequently, the ellipsoids can directly be applied to any convergent TR framework without losing convergence guarantees. Interestingly, this result cannot be established for the Adagrad ellipsoid , which reflects the widely known vanishing stepsize problem that arises since squared gradients are continuously added to the preconditioning matrix. It is mainly this effect that eventually inspired the development of RMSprop tieleman2012lecture and Adadelta zeiler2012adadelta .
Why ellipsoids?
There are many sources for illconditioning in neural networks such as uncentered and correlated inputs lecun2012efficient , saturated hidden units and different weight scales in different layers of the network van1998solving . While the quadratic term in the model (7) does account for such illconditioning to some extent, the spherical constraint is completely blind towards the loss surface but it is in general advisable to instead measure distances in norms that reflect the underlying geometry (Chapter 7.7 in conn2000trust ). The ellipsoids we propose are such that they allow for longer steps along coordinates that have seen small gradient components in past and vice versa. Thereby the TR shape is adaptively adjusted to fit the current region of the nonconvex loss landscape. Contrary to adaptive first order methods, we do not only adapt the shape but also the corresponding diameter () depending on whether or not the local Taylor model is an adequate approximation within the trust region at the current state of the algorithm. This procedure is not only effective when the iterates are in an illconditioned convex neighborhood of a minimizer (Figure 1) but it also helps to escape elongated plateaus (Sect. 5).
4.2 A stochastic TR framework for neural network training
Since neural network training often constitutes a large scale learning problem in which the number of datapoints is very high, we here opt for a fully stochastic TR framework chen2018stochastic
in order to circumvent memory issues and reduce computational complexity. Given that the involved function and derivative estimates are sufficiently accurate with fixed probability, such a framework has been shown to retain the convergence rate of deterministic TR methods to stationary points in expectation
blanchet2016convergence . For finitesum objectives such as Eq. (1) the required level of accuracy can be obtained by simple minibatching.(9) 
The main difference between this and most existing approaches in kohler2017sub ; xu2017newton ; yao2018inexact ; cartis2017global lies in the computation of , which is traditionally computed as full function over stochastic model decrease. Computing solely based on subsampled quantities has the nice sideeffect of disentangling these two potential sources of error: an overoptimistic trust region radius or insufficiently small batch sizes. Namely, the quantity from Eq. 9
is an unbiased estimate of the
used in fully deterministic algorithms and hence the trust region radius is adjusted purely based on the current adequacy of the local quadratic approximation.5 Experiments
To validate our claim that ellipsoidal TR methods yield improved performance over spherical ones, we run a large set of experiments on three image datasets and three types of network architectures. As outlined in Section 2, many theoretical arguments suggest the use of regularized Newton methods for optimization of neural networks but certain practical objections exist (see Appendix F.2) and so far there is no conclusive empirical evidence (see discussion in Section 2). Hence, we also present a comprehensive benchmark with state of the art firstorder methods. We fix the sample size for all gradient methods to 128 but grid search the stepsize since the ratio of these two quantities effectively determines the level of stochasticity jastrzkebski2017three . The TR methods mostly use 512 samples, since they are more likely to overfit very small minibatches. More details can be found in Appendix C.
Secondorder methods
As can be seen in Figure 3
the ellipsoidal TR methods consistently outperform their spherical counterpart in the sense that they reach full accuracy substantially faster on all problems and in all measures (time and epochs). Furthermore, their limit points are in all cases lower than those of the uniform TR method and especially on the autoencoders this makes an actual difference in the image reconstruction quality (see Figure
12). We thus draw the clear conclusion that the ellipsoidal trust region constraints we propose are to be preferred over their spherical counterpart when training neural networks.ConvNet  MLP  autoencoder  

FashionMNIST 

CIFAR10 
Benchmark with SGD
While the ellipsoidal TR methods are slightly superior in terms of epochs, they only just manage to keep pace with firstorder methods in terms of wallclock time (Figure 10). Furthermore, the limit points of both first and secondorder methods yield the same order of loss in most experiments. When taking gradient norms into account (reported in Appendix J) we indeed find no spurious local minima and only one saddle point despite potentially arbitrarily nonconvex loss landscapes. This suggests that secondorder information is not needed to circumvent such obstacles in neural network optimization landscapes. Finally, we had to warm start the TR methods with a few firstorder epochs to achieve good performance in terms of time on the convnets (see Appendix I.1 for a detailed discussion).
6 Conclusion
We investigated the use of ellipsoidal trust region constraints for neural networks. We have shown that the RMSProp matrix satisfies the necessary conditions for convergence and our experimental results demonstrate that ellipsoidal TR methods outperform their spherical counterparts significantly. We thus consider the development of further ellipsoids that can potentially adapt even better to the loss landscape such as e.g. (block) diagonal hessian approximations(e.g. bekas2007estimator ) or approximations of higher order derivatives as an interesting direction of future research.
Yet, our comprehensive empirical study also highlights that the value of Hessian information for neural network optimization is limited for mainly two reasons: 1) secondorder methods rarely yield better limit points, which suggests that saddles and spurious local minima are not a major obstacle 2) low periteration costs render gradient methods superior in terms of time. The latter observation suggests that advances in hardware and distributed secondorder algorithms (e.g. osawa2018second ; dunner2018distributed ) will be needed to speed up computations before Newtontype methods can replace (stochastic) gradient methods in deep learning. Finally, since Trust Region methods come with a large number of hyperparameters, Bayesian hyperparameter optimization is another interesting direction of future work.
References
 (1) Leonard Adolphs, Hadi Daneshmand, Aurélien Lucchi, and Thomas Hofmann. Local saddle point optimization: A curvature exploitation approach. CoRR, abs/1805.05751, 2018.
 (2) Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, and Yi Zhang. The case for fullmatrix adaptive regularization. arXiv preprint arXiv:1806.02958, 2018.

(3)
Guillaume Alain, Nicolas Le Roux, and PierreAntoine Manzagol.
Negative eigenvalues of the hessian in deep neural networks.
2018.  (4) Zeyuan AllenZhu. Natasha 2: Faster nonconvex optimization than sgd. In Advances in Neural Information Processing Systems, pages 2680–2691, 2018.
 (5) Zeyuan AllenZhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962, 2018.
 (6) ShunIchi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 (7) Sue Becker, Yann Le Cun, et al. Improving the convergence of backpropagation learning with second order methods. In Proceedings of the 1988 connectionist models summer school, pages 29–37. San Matteo, CA: Morgan Kaufmann, 1988.
 (8) Costas Bekas, Effrosyni Kokiopoulou, and Yousef Saad. An estimator for the diagonal of a matrix. Applied numerical mathematics, 57(1112):1214–1229, 2007.
 (9) Jose Blanchet, Coralia Cartis, Matt Menickelly, and Katya Scheinberg. Convergence rate analysis of a stochastic trust region method for nonconvex optimization. arXiv preprint arXiv:1609.07428, 2016.
 (10) Leon Bottou. Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
 (11) Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for largescale machine learning. SIAM Review, 60(2):223–311, 2018.
 (12) Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint arXiv:1702.07966, 2017.
 (13) Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. arXiv preprint arXiv:1710.11606, 2017.
 (14) Yair Carmon, Oliver Hinder, John C Duchi, and Aaron Sidford. " convex until proven guilty": Dimensionfree acceleration of gradient descent on nonconvex functions. arXiv preprint arXiv:1705.02766, 2017.
 (15) Coralia Cartis, Nicholas IM Gould, and Ph L Toint. Complexity bounds for secondorder optimality in unconstrained optimization. Journal of Complexity, 28(1):93–108, 2012.
 (16) Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011.
 (17) Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. How Much Patience to You Have?: A Worstcase Perspective on Smooth Noncovex Optimization. Science and Technology Facilities Council Swindon, 2012.
 (18) Coralia Cartis and Katya Scheinberg. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, pages 1–39, 2017.
 (19) Olivier Chapelle and Dumitru Erhan. Improved preconditioner for hessian free optimization. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, volume 201, 2011.
 (20) Ruobing Chen, Matt Menickelly, and Katya Scheinberg. Stochastic optimization using a trustregion method and random models. Mathematical Programming, 169(2):447–487, 2018.
 (21) Anna Choromanska, Mikael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.
 (22) Andrew R Conn, Nicholas IM Gould, and Philippe L Toint. Trust region methods. SIAM, 2000.
 (23) Frank E Curtis and Daniel P Robinson. Exploiting negative curvature in deterministic and stochastic optimization. arXiv preprint arXiv:1703.00412, 2017.
 (24) Frank E Curtis and Daniel P Robinson. How to characterize the worstcase performance of algorithms for nonconvex optimization. arXiv preprint arXiv:1802.01062, 2018.
 (25) Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a worstcase iteration complexity of for nonconvex optimization. Mathematical Programming, 162(12):1–32, 2017.
 (26) Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. arXiv preprint arXiv:1803.05999, 2018.
 (27) Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.
 (28) Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems, pages 1067–1077, 2017.
 (29) Simon S Du and Jason D Lee. On the power of overparametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206, 2018.
 (30) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 (31) Celestine Dünner, Aurelien Lucchi, Matilde Gargiani, An Bian, Thomas Hofmann, and Martin Jaggi. A distributed secondorder algorithm you can trust. arXiv preprint arXiv:1806.07569, 2018.

(32)
Kenji Fukumizu and Shunichi Amari.
Local minima and plateaus in hierarchical structures of multilayer perceptrons.
Neural networks, 13(3):317–327, 2000. 
(33)
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan.
Escaping from saddle pointsonline stochastic gradient for tensor decomposition.
In COLT, pages 797–842, 2015.  (34) Gabriel Goh. Why momentum really works. Distill, 2017.
 (35) Serge Gratton, Clément W Royer, Luís N Vicente, and Zaikun Zhang. Complexity and global rates of trustregion methods based on probabilistic models. IMA Journal of Numerical Analysis, 2017.
 (36) Roger Grosse and James Martens. A kroneckerfactored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pages 573–582, 2016.
 (37) Martin T Hagan and Mohammad B Menhaj. Training feedforward networks with the marquardt algorithm. IEEE transactions on Neural Networks, 5(6):989–993, 1994.
 (38) Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
 (39) Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
 (40) Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456, 2017.

(41)
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.
In Advances in neural information processing systems, pages 315–323, 2013.  (42) Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
 (43) Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
 (44) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 (45) Jonas Moritz Kohler and Aurelien Lucchi. Subsampled cubic regularization for nonconvex optimization. In International Conference on Machine Learning, 2017.
 (46) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 (47) Yann A LeCun, Léon Bottou, Genevieve B Orr, and KlausRobert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.
 (48) Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246–1257, 2016.
 (49) Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. arXiv preprint arXiv:1805.08114, 2018.

(50)
Yuanzhi Li and Yang Yuan.
Convergence analysis of twolayer neural networks with relu activation.
In Advances in Neural Information Processing Systems, pages 597–607, 2017.  (51) Liu Liu, Xuanqing Liu, ChoJui Hsieh, and Dacheng Tao. Stochastic secondorder methods for nonconvex optimization with inexact hessian and gradient. arXiv preprint arXiv:1809.09853, 2018.
 (52) James Martens. Deep learning via hessianfree optimization. In ICML, volume 27, pages 735–742, 2010.
 (53) James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
 (54) James Martens and Roger Grosse. Optimizing neural networks with kroneckerfactored approximate curvature. In International conference on machine learning, pages 2408–2417, 2015.
 (55) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
 (56) Eiji Mizutani and Stuart E Dreyfus. Secondorder stagewise backpropagation for hessianmatrix analyses and investigation of negative curvature. Neural Networks, 21(23):193–203, 2008.
 (57) Yu Nesterov. Accelerating the cubic regularization of newton’s method on convex problems. Mathematical Programming, 112(1):159–181, 2008.
 (58) Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
 (59) Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
 (60) Jorge Nocedal and Stephen J Wright. Numerical optimization, 2nd Edition. Springer, 2006.
 (61) Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. Secondorder optimization method for large minibatch: Training resnet50 on imagenet in 35 epochs. arXiv preprint arXiv:1811.12019, 2018.
 (62) Ioannis Panageas and Georgios Piliouras. Gradient descent only converges to minimizers: Nonisolated critical points and invariant regions. arXiv preprint arXiv:1605.00405, 2016.
 (63) Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.

(64)
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in pytorch.
2017.  (65) Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.
 (66) Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323, 2016.
 (67) Herbert Robbins and Sutton Monro. A stochastic approximation method. In The Annals of Mathematical Statistics  Volume 22, Number 3. Institute of Mathematical Statistics, 1951.
 (68) Nicol N Schraudolph. Fast curvature matrixvector products for secondorder gradient descent. Neural computation, 14(7):1723–1738, 2002.
 (69) Shai ShalevShwartz, Ohad Shamir, and Shaked Shammah. Failures of gradientbased deep learning. arXiv preprint arXiv:1703.07950, 2017.
 (70) Trond Steihaug. The conjugate gradient method and trust regions in large scale optimization. SIAM Journal on Numerical Analysis, 20(3):626–637, 1983.
 (71) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
 (72) Patrick Van Der Smagt and Gerd Hirzinger. Solving the illconditioning in neural network learning. In Neural networks: tricks of the trade, pages 193–206. Springer, 1998.
 (73) Hao Wang, Naiyan Wang, and DitYan Yeung. Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1235–1244, 2015.
 (74) Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. arXiv preprint arXiv:1806.01811, 2018.
 (75) Eugene P Wigner. Characteristic vectors of bordered matrices with infinite dimensions i. In The Collected Works of Eugene Paul Wigner, pages 524–540. Springer, 1993.
 (76) Peng Xu, Farbod RoostaKhorasan, and Michael W Mahoney. Secondorder optimization for nonconvex machine learning: An empirical study. arXiv preprint arXiv:1708.07827, 2017.
 (77) Peng Xu, Farbod RoostaKhorasani, and Michael W Mahoney. Newtontype methods for nonconvex optimization under inexact hessian information. arXiv preprint arXiv:1708.07164, 2017.
 (78) Zhewei Yao, Peng Xu, Farbod RoostaKhorasani, and Michael W Mahoney. Inexact nonconvex newtontype methods. arXiv preprint arXiv:1802.06925, 2018.
 (79) Yaxiang Yuan. Recent advances in trust region algorithms. Mathematical Programming, 151(1):249–281, 2015.
 (80) Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Appendix A Notation
Throughout this work, scalars are denoted by regular lower case letters, vectors by bold lower case letters and matrices as well as tensors by bold upper case letters. By we denote an arbitrary norm. For a symmetric positive definite matrix we introduce the compact notation , where .
Appendix B Equivalence of Preconditioned Gradient Descent and First Order Trust Region Methods
Theorem 2 (Theorem 1 restated).
A preconditioned gradient step
(10) 
with stepsize , symmetric positive definite preconditioner and minimizes a first order local model around in an ellipsoid given by in the sense that
(11)  
Proof.
We start the proof by noting that the optimization problem (11) is convex. For the constraint satisfies the Slater condition since is a strictly feasible point. As a result, any KKT point is a feasible minimizer and vice versa.
Let denote the Lagrange dual of (5)
(12) 
Any point is a KKT point if and only if the following system of equations is satisfied
(13)  
(14)  
(15)  
(16) 
For as given in Eq. (4) we have that
(17) 
and thus (14) and (15) hold with equality such that any is feasible. Furthermore,
(18)  
is zero for . As a result, is a KKT point of the convex Problem (5) which proves the assertion.
∎
To illustrate this theoretical result we run gradient descent and Adagrad as well as the two corresponding firstorder TR methods^{4}^{4}4Essentially Algorithm 1 with based on a first order Taylor expansion. on an illconditioned quadratic problem. While the method 1st TR optimizes a linear model within a ball in each iteration, 1st TR optimizes the same model over the ellipsoid given by the Adagrad matrix . The results in Figure 5 show that the methods behave very similar to their constant stepsize analogues.
[width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/ill_conditioned_walk_withTR.pdf  [width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/ill_conditioned_loss_withTR.pdf 
Appendix C Convergence of ellipsoidal TR methods
Spherical constrained TR methods
Under standard smoothness assumptions, spherical TR algorithms achieve criticality after iterations and additionally almost positive curvature in iterations. These rates are attained without the need of actually knowing the Hessian’s Lipschitz constant. The complete statement of the convergence results for TR methods can be found in Theorem 4.3 of [15]. Interestingly, these rates can be improved to match the (optimal) firstorder worst case complexity of Cubic Regularization by applying small modifications to the TR framework. As stated in Section E.2.2 the involved subproblems do not need to be solved globally in each iteration.
For both, cubic regularization and trust region methods, many stochastic extensions have emerged in literature that alleviate the need to compute exact derivative information without losing the above mentioned convergence guarantees with high probability [45, 76, 77, 9, 35, 35]). For the deep learning setting, the analysis of [9] is most relevant since it also allows the algorithm to run solely based on subsampled function evaluations.
Ellipsoidal constrained TR methods
In order to prove such results for ellipsoidal Trust Region methods one must ensure that the applied norms are coherent during the complete minimization process in the sense that the ellipsoids do not flatten out (or blow up) completely along any given direction. This intuition is formalized in Assumption 1 which we restate here for the sake of clarity.
Definition 2 (Definition 1 restated).
There exists a constant such that
(19) 
Having uniformly equivalent norms is necessary and sufficient to prove that ellipsoidal TR methods enjoy the same convergence rate as classical ball constrained Trust Region algorithms.
Towards this end, [22] identify the following sufficient condition on the basis of which we will prove that our proposed ellipsoid is indeed uniformly equivalent under some mild assumptions.
Lemma 1 (Theorem 6.7.1 in [22]).
Proposition 3 (Uniform equivalence).
Proof.
The basic building block of our ellipsoid matrix consists of the current and past stochastic gradients
(21) 
We consider which is built up as follows^{5}^{5}5This is a generalization of the diagonal variant proposed by [71], which precondition the gradient step by an elementwise division with the squareroot of the following estimate .
(22) 
From the construction of it directly follows that for any unit length vector we have
(23)  
which proves the lower bound for . Now, let us consider the upper end of the spectrum of . Towards this end, recall the geometric series expansion
(24) 
and the fact that is a sum of exponentially weighted rankone positive semidefinite matrices of the form . Thus
where the latter inequality holds per assumption for any sample size . Combining these facts we get that
(25)  
As a result we have that
(26) 
Finally, to achieve uniform equivalence we need the r.h.s. of (26) to be bounded by . This gives rise to a quadratic equation in , namely
(27) 
which holds for any and any as long as
(28) 
Such an always exists but one needs to choose smaller and smaller values as the upper bound on the gradient norm grows. For example, the usual value is valid for all .
All of the above arguments naturally extend to the diagonal preconditioner .
∎
Interestingly, this result cannot be established for the Adagrad inspired ellipsoid , which reflects the commonly noticed effect that the stepsizes of firstorder Adagrad shrink over time as squared gradients are continuously added to the preconditioning matrix. It is mainly this effect that eventually inspired the development of Rmsprop [71], Adadelta [80] and simliar approaches.
Appendix D Diagonal Dominance in Neural Networks
d.1 Proof of Proposition 1
Proposition 4 (Proposition 1 restated).
For random Gaussian Wigner matrix formed as
(29) 
where stands for i.i.d. draws [75], the expected share of diagonal mass amounts to
(30) 
Proof.
(31)  
which simplifies to
if the diagonal and offdiagonal elements come from the same Gaussian distribution (
). ∎For the sake of simplicity we only consider Gaussian Wigner matrices but the above argument naturally extends to any distribution with positive expected absolute values, i.e. we only exclude the Dirac delta function as probability density.
CONV  MLP 

Appendix E Background on secondorder optimization
e.1 Newton’s Method
The canonical secondorder method is Newton’s methods. This algorithm uses the inverse Hessian as a scaling matrix and thus has updates of the form
(32) 
which is equivalent to optimizing the local quadratic model
(33) 
to first order stationarity. Using curvature information to rescale the steepest descent direction gives Newton’s method the useful property of being linearly scale invariant. This gives rise to a problem independent local convergence rate that is superlinear and even quadratic in the case of Lipschitz continuous Hessians (see [60] Theorem 3.5), whereas gradient descent at best achieves linear local convergence [58].
However, there are certain drawbacks associated with applying classical Newton’s method. First of all, the Hessian matrix may be singular and thus not invertible. Secondly, even if it is invertible the local quadratic model (Eq. (33)) that is minimized in each NM iteration may simply be an inadequate approximation of the true objective. As a result, the Newton step is not necessarily a descent step. It may hence approximate arbitrary critical points (including local maxima) or even diverge. Finally, the cost of forming and inverting the Hessian sum up to and are thus prohibitively high for applications in large dimensional problems.
e.2 Trust Region Methods
e.2.1 Outer iterations
Trust region methods are among the most principled approaches to overcome the above mentioned issues. These methods also construct a quadratic model but constrain the subproblem in such a way that the stepsize is restricted to stay within a certain radius within which the model is trusted to be sufficiently adequate
(34) 
Hence, contrary to linesearch methods this approach finds the step and its length simultaneously by optimizing (34). Subsequently the actual decrease is compared to the predicted decrease and the step is only accepted if the ratio exceeds some predefined success threshold . Furthermore, the trust region radius is decreased whenever falls below and it is increased whenever exceeds the "very successful" threshold . Thereby, the algorithm adaptively measures the accuracy of the secondorder Taylor model – which may change drastically over the parameter space depending on the behaviour of the higherorder derivatives^{6}^{6}6Note that the secondorder Taylor models assume constant curvature. – and adapts the effective length along which the model is trusted accordingly. See [22] for more details.
As a consequence, the plain Newton step is only taken if it lies within the trust region radius and yields a certain amount of decrease in the objective value. Since many functions look somehow quadratic close to a minimizer the radius can be shown to grow asymptotically under mild assumptions such that eventually full Newton steps are taken in every iteration which retains the local quadratic convergence rate [22].
[width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/newton_ascending.pdf  [width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/t_around_saddle.pdf 
e.2.2 Subproblem solver
Interestingly, there is no need to optimize (34) to global optimality to retain the remarkable global convergence properties of TR algorithms. From a practical point of view this is very good news since globally optimizing the involved quadratics scales quadratically in the problem dimension () which can be very large for neural networks. Instead, it suffices to do better than the Cauchy and Eigenpoint^{7}^{7}7which are the model minimizers along the gradient and the eigendirection associated with its smallest eigenvalue, respectively. simultaneously. One way to ensure this is to minimize in nested Krylov subspaces. These subspaces naturally include the gradient direction as well as increasingly accurate estimates of the leading eigendirection
(35) 
until (for example) the stopping criterion
(36) 
is met, which requires increased accuracy as the underlying trust region algorithm approaches criticality. Conjugate gradients and Lanczos method are two iterative routines that implicitly build up a conjugate and orthogonal basis for such a Krylov space respectively and they converge linearly on quadratic objectives with a squareroot dependency on the condition number of the Hessian [22]. We here employ the preconditionied SteihaugToint CG method [70] in order to cope with possible boundary solutions of (34) but similar techniques exist for the Lanczos solver as well for which we also provide code. As preconditioning matrix for CG we use the same matrix as for the ellipsoidal constraint.
e.3 Damped (Gauss)Newton methods
An alternative approach to actively constraining the region within which the model is trusted is to instead penalize the step norm in each iteration in a Lagrangian manner. This is done by socalled damped Newton methods that add a multiple of the identity matrix to the secondorder term in the model, which leads to the update step
(37) 
This can also be solved hessianfree by conjugate gradients (or other Krylov subspace methods). The penalty parameter is acting inversely to the trust region radius and it is often updated accordingly.
Many algorithms in the existing literature replace the use of in (37) with the Generalized Gauss Newton matrix [52, 19] or an approximation of the latter [54]. This matrix constitutes the first part of the wellknown GaussNewton decomposition
(38) 
where and are the first and second derivative of assuming that (binary classification and regression task) for simplicity here.
It is interesting to note that the GGN matrix of neural networks is equivalent to the Fisher matrix used in natural gradient descent [6]
for convex loss functions like meansquarederror and crossentropy loss
[63]. As can be seen in (38) the matrix is positive semidefinite (and low rank if ). As a result, there exist no secondorder convergence guarantees for such methods on general nonconvex problems. On the other end of the spectrum, the GGN also drops possibly positive terms from the Hessian (see (38)). Hence it is not guaranteed to be an upper bound on the latter in the PSD sense. Essentially, GGN approximations assume that the network is piecewise linear and thus the GGN and Hessian matrices only coincide in the case of linear and ReLU activations or noncurved loss functions. For any other activation the GGN matrix may approximate the Hessian only asymptotically and if the terms in (38) go to zero for all . In nonlinear least squares such problems are called zeroresidual problems and GN methods can be shown to have quadratic local convergence there. In any other case the convergence rate does not exceed the linear local convergence bound of gradient descent. In practice however there are cases where deep neural nets do show negative curvature in the neighborhood of a minimizer [11].When is small the step is similar to the plain GaussNewton update whereas for very large , Eq. (37) approximates a gradient descent update with a very small step size. Such algorithms are commonly known as LevenbergMarquardt algorithms and they were originally tailored towards solving nonlinear least squares problems [60] but they have been proposed for neural network training already early on [37]. Finally, [27] propose the use of the absolute Hessian instead of the GGN matrix in a framework similar to (37). This method has been termed saddlefree Newton even though its manifold of attraction to a given saddle is nonempty^{8}^{8}8It is the same as that for gradient descent, which renders the method unable to escape e.g. when initialized right on a saddle point. To be fair, the manifold of attraction for gradient descent constitutes a measure zero set [48]..
[width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/gn_around_saddle.pdf  [width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/abs_H_around_saddle.pdf 
e.3.1 Comparison to trust region
Contrary to TR methods though, the regularized Newton methods never take plain Newton steps since the regularization is always on (). Furthermore, if a positivedefinite Hessian approximation like the Generalized Gauss Newton matrix is used, this algorithm is not capable of exploiting negative curvature and there are cases in neural network training where the Hessian is much better conditioned than the GaussNewton matrix [56] (also see Figure 8). While some scholars believe that positivedefiniteness is a desirable feature [52, 19], we want to point out that following negative curvature directions is necessarily needed to escape saddle points and it can also be meaningful to follow directions of negative eigenvalue outside a saddle since they guarantee progress, whereas a gradient descent step yields at least progress (both under certain stepsize conditions) and one cannot conclude apriori which one is better in general [23, 3]. Finally, there is a growing body of firstorder algorithms that explicitly include additional negative curvature steps to derive stateoftheart time complexities in classical minimization problems [4, 66, 14, 40] as well as in minmax problems to avoid convergence to unstructured saddle points [1].
Despite these theoretical considerations, many methods based on GGN matrices have been applied to neural network training (see [53] and references therein) and particularly the hessianfree implementations of [52, 19] can be implemented very cheaply since they allow to make use of an efficient procedure for computing GGNvector products [68]. While these works provide promising results, [76] report superior performance of TR algorithms on the autoencoder architecture presented in Section J. Finally, [54] and [36] propose an algorithm termed KFAC that uses a nondiagonal, highrank approximation of the Fisher information matrix which can incorporated into an approximate^{9}^{9}9When using KFAC within a damped Newton scheme, one needs to add the term to the GGN approximation before inverting. However, the diagonal blocks are then no longer expressible as Kronecker products so [54] must instead use an approximation to the LM approach. LevenbergMarquardt scheme for both feedforward and convolutional neural networks since the inverse is readily available and thus no Krylov subspace methods need to be run to find the next step.
Unfortunately, we are not aware of any comprehensive empirical comparison between GaussNewton/Fisher Information and Hessian based approaches but we consider this to be a very interesting direction of future research.
Appendix F Using Hessian information in Neural Networks
f.1 Theoretical advantages
TR methods present several theoretical advantages which are partly illustrated in Figure 7:

Contrary to gradient descent, TR methods enjoy a superlinear local convergence rate independent of the local conditioning [60].

Contrary to gradient descent, Newton and damped Newton methods (such as LevenbergMarquardt) TR methods can leverage curvature information which is believed to be helpful for nonconvex optimization[3, 23, 14]. Thereby they escape elongated plateaus and (strict) saddle point quickly and thus provably converge to second order stationary points. In theory, saddles even emerge in shallow feed forward networks [32].

Just as damped Newton methods, TR methods update their iterates based on Hessianvector products, which can be computed roughly at the cost of two gradient computations [65].
f.2 Practical objections
In practice, however, these theoretical advantages do not always play out. Particularly, when training neural networks one must take the following considerations into account.

While superlinear local convergence is nice for traditional optimization, the ultimate goal of any learning task is generalization and thus it might no longer be desirable to even find a highaccuracy solution on the training data. Yet, secondorder methods can also be regularized with earlystopping, weight decay, dropout and similar techniques.

In practice, the value of worst case convergence rates is questionable since they are ultimately tailored towards worst case functions that rarely occur on real world problems [24]. Particularly

(Stochastic) gradient descent often finds an stationary point much faster than its theoretical () rate predicts.

Finally, secondorder methods may suffer substantially more from minibatching than firstorder methods do, since much more information is extracted from the batch in each secondorder step which leads to very noisy updates due to overfitting the batch. Yet, many realworld problems render large batch settings impossible due to memory restrictions.

In Figure J.2 we report that small batches lead to a Hessian Lipschitzness estimator with large variance and a surprising tendency towards overestimation which may lead to overly conservative stepsizes.

At this point we refer to the following Appendix C for a detailed conclusion regarding the above discussion on the basis of our experimental findings.
Appendix G Experimental results overview
g.1 Ellipsoidal Trust Region vs. Uniform Trust Region
Convolution  FullyConnected  Autoencoder  

CIFAR10 

MNIST 

FashionMNIST 
g.2 Ellipsoidal Trust Region vs. Firstorder Optimizer
Convolution  FullyConnected  Autoencoder  

CIFAR10 

MNIST 

FashionMNIST 
The above figures are intended to give the reader a quick overview of our results in terms of the loss values. Please see Appendix J for more details about gradient norm as well as train and test accuracy. There, we also provide the applied parameters, curvature computations, and reconstructed autoencoder images.
Appendix H Conclusion from Experiments
Observation I: Ellipsoidal TR methods consistently perform better than their spherical counterpart.
We find that ellipsoidal TR methods constantly and significantly outperform their spherical counterparts in the sense that at least one of them reaches the accuracy mark faster than spherical TR in all of our tested settings both in terms of epochs and time. Furthermore, the limit points of our proposed methods are in all cases lower or equal to those of the uniform TR methods and especially on the autoencoder architectures this makes an actual difference in the image reconstruction quality (see Figure 12).
Observation II: TR methods make mostly faster progress than firstorder methods in terms of epochs.
In summary, Figure 4 shows that TR methods outperform stochastic gradient descent methods in terms of epochs in the sense that in each of our settings at least one of the ellipsoidal TR algorithms achieves accuracy faster than all gradient methods. One exception is the autoencoder setting in which firstorder methods are superior.
Observation III: In terms of time, the stochastic ellipsoidal TR methods only just manage to keep pace with firstorder methods.
As can be seen in Figure 10, the stochastic ellipsoidal TR method can roughly keep pace with first order methods but the advantage seen in the epoch plots (Fig. 4) is diminished substantially. We hence believe that advances in both hardware and distributed algorithms will be needed to enable a more widespread use of secondorder methods in deep learning.
Observation IV: No spurious local minima and almost no saddles. Underparametrization can lead to severe illconditioning.
The limit points of all methods yield the same loss in the vast majority of our experiments which is somewhat surprising given that we use repeated (ten) random initializations in potentially very nonconvex and complicated landscapes. Nevertheless, no method gets stuck in a spurious local minima or saddle point for long. The only method that stays above all others at times is TR but a closer look at the gradient norm reveals that it has not yet converged to a critical point of any kind (see Appendix D). One exception is the CIFAR10 MLP for which TR finds much better loss values (and much higher accuracy) than the firstorder methods. Again, we note that the gradient norms stay elevated for all other methods (see Figure J.4) and thus hypothesize that this behaviour is due to the heavy underparametrization of the network. In fact, we find that the maximum curvature in this network is at least higher than in the Fashion and MNIST MLP (see Figure J.2).
As a matter of fact, the only architecture which gives rise to saddle points is the autoencoder on which the SGD and uniform TR loss plateaus (Fig. 10) while the gradient norm gets very small (e.g. Fig. J.6.3). Yet, SGD escapes this plateau eventually which is accompanied by a sudden increase in gradient norm. We note that the ellispoidal TR methods do find and leverage negative curvature directions to escape this area.
Observation V: Warmstart helps TR methods on curvy convnets.
We find improved performance for our secondorder algorithms on the convolutional networks when starting the TR methods after a few firstorder epochs. While we use only one firstorder epoch for MNIST and two epochs for FashionMNIST we need as much as firstorder iterations to reach full accuary within the 2000 seconds training time on CIFAR10. As described in Appendix I.1, warm starting more sophisticated optimization algorithms with a few epochs of SGD is common in the literature [66, 41, 55, 27].
Appendix I Things we tried that did not work (so well)
Finally and in order to complete the picture, we here report some approaches that did not work out as well as we expected.
i.1 Training ConvNets with Trust Region without warmstart
First off, we note that warm starting with vanilla SGD is standard for other techniques that aim to obtain lowervariance estimators of the optimum as e.g. variance reduction (see [66, 41] for image classification with MLPs), iterate averaging (see e.g. [55] and references therein for learning language models with RNNs) and saddlefree Newton [27]. The motivation for this approach is to start with SGD in order to quickly forget the initialization and then finetune with more sophisticated algorithms.
Using such an initialization we find significantly better performance in convolutional networks and especially for the CIFAR10 dataset. While we use only one firstorder epoch for MNIST and two epochs for FashionMNIST we need as much as firstorder iterations to reach full accuary within the 2000 seconds training time on CIFAR10.
MNIST  FashionMNIST  CIFAR  

ConvNet 
We hypothesize that the warm start is needed because sharp landscapes can render secondorder algorithms too conservative. Particularly, the CIFAR10 convnet is the smallest of the convnet architectures and yet the hardest dataset. We thus believe that this network is not overparameterized enough to bring about a fairly smooth and uncomplicated loss landscape. Indeed, as reported in Figure J.2 the maximum curvature in this network is about 1000 times higher than that of the other convnets and the other convnets themselves have again at least 10x higher maximum eigenvalues than the corresponding MLPs. In such a landscape, TR methods are likely to be overconservative since the increased curvature yields very narrow quadratic approximations.^{10}^{10}10Note that Figure J.2 only shows the upper end of the spectrum. However, when we considered the conjugate gradient steps in the early subproblems after initialization we found that those were also along very curvy directions which leads to a tiny stepsize since CG divides the search direction by the curvature along that line. This can drastically hurt the performance if the initialization is not close to any good local minimum. In this regard, note that the SGD stepsize of that we employ on the problem is way beyond the theoretical maximum of , which suggests that it is indeed meaningful to quickly navigate out of a potentially very complex initial region.
i.2 Lanczos subproblem solver
In theory the Lanczos solver brings about much better steps than CG whenever a direction of negative curvature is found. However, an efficient implementation of this routine needs functions for matrix factorization and linear systems solvers that can exploit sparsity in matrices that have a special banded structure, such as tridiagonality. While such functions exist for CPUs (e.g. in Scipy) we could not find any GPU implementation for PyTorch and using the standard function for dense matrices rendered the method far too slow. Thus, all of our results are presented using the CG solver but we think that even better performance can be achieved once an efficient Lanczos GPU implementation exists, particularly because we do find that the CG solver detects directions of negative curvature quite frequently. For that purpose, we provide code of a CPUtailored Generalized Lanczos Trust Region (GLTR) method inspired by [22] Algorithm 7.3.4 and 7.3.6. in the hope to facilitate future developments of a GPU version.
i.3 Trust Region Adam
It is possible to design a secondorder TR variant of the popular Adam optimizer. Here the main difference to RMSProp is a moving average over past gradients is used instead of just the current gradient. For firstorder methods this is no problem since only the gradient value of the past iterate must be stored. In secondorder TR algorithms, however, we compute Hessianvector product in each iteration and so we would need to store all past gradients including their graphs such that we can backprop through them a second time. While this is likely to run into memory problems it also takes way too much time since one Hv product now consists of many additional backpropagations. Yet, one can also use only for the linear part of the model and compute hessianvector products based solely on the current gradient. We tested this in some settings but found no significant improvement over and . One alternative would be to only store a small number of past gradients such as it is done in Adadelta [80] but we leave an assessment of this approach for future work.
Appendix J Further Experiment Details
j.1 Default parameters, architectures and datasets
(krylov tol.)  

TR_uni  512  10  0.95  1.1  1.5  0.1  
TR_ada  512  10  0.95  1.1  1.5  0.1  
TR_rms  512  10  0.95  1.1  1.75  0.1 
Parameters
Table 1 reports the default parameters we consider. Yet, we gridsearched a subset of them (e.g. learning rate of gradient methods) and report the problemspecific values in each corresponding subsection in case of deviation from the above stated values. For the secondorder methods we chose the values in order to maximize performance in terms of time. Other settings are likely to improve performance over epochs and backprops by a lot (e.g. a lower ). To compare all methods on the same basis we run all firstorder algorithms without momentum. Yet, we consider it an interesting direction of future research and note that momentum can also be applied for secondorder methods (see e.g. [57, 54]
For the sake of comparability, TR and TR are always run with the same parameters and hence the only difference is in the trust region shape. TR shares most of these parameters but we sometimes start with a smaller TR radius and have a higher decrease () since we observed that RMSprop usually also runs on much slower learning rates than SGD and Adagrad.
Datasets
We use three realworld datasets for image classification, namely CIFAR10, MNIST and FashionMNIST^{11}^{11}11All three datasets were accessed from https://www.tensorflow.org/api_docs/python/tf/keras/datasets. While MNIST and FashionMNIST are greyscale images, CIFAR10 are colored images of size . All three datasets have a fixed trainingtest split consisting of 50,000 (60,000 for FashionMNIST) and 10,000 images, respectively.
Network architectures
The MLP architectures are simple. For MNIST and FashionMNIST we use a network with tanh activations and a cross entropy loss. The networks has parameters. For the CIFAR10 MLP we use a architecture also with tanh activations and cross entropy loss. This network has parameters.
The MNIST and FashionMNIST autoencoders have the same architecture as the ones used in [38, 76, 52, 54]. The encoder structure is and the decoder is mirrored. Sigmoid activations are used in all but the central layer. The reconstructed images are fed pixelwise into a binary cross entropy loss. The network has a total of parameters. The CIFAR10 autoencoder is taken from the implementation of https://github.com/jellycsc/PyTorchCIFAR10autoencoder.
We use fairly small convnets that are taken from official PyTorch tutorials. Specifically, the (Fashion)MNIST network is specified in https://github.com/pytorch/examples/blob/master/mnist/main.py and the CIFAR10 network can be found under https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html. The total number of parameters in these networks amount to and respectively.
In all of our experiments each method was run on one Tesla P100 GPU using the PyTorch [64] library.
j.2 Observed curvature
Convolution  FullyConnected  

CIFAR10  
MNIST  
FashionMNIST 
figureMaximum eigenvalue of the network’s Hessian over time when training different networks with RMSProp. Average and 95% confidence interval of 10 random initializations
j.3 Reconstructed Images from Autoencoders
j.4 Cifar10
j.4.1 CIFAR10 ConvNet
figureCIFAR10 ConvNet time. TR methods use the default parameters of Table 1 but replace and take 20 epochs warmstart with RMSProp. Firstorder methods use 128 samples and , , .
j.4.2 Cifar10 Mlp
figureCIFAR10 MLP time: TR methods use the default parameters of Table 1. Firstorder methods use 128 samples and , , .
j.4.3 CIFAR10 Autoencoder
figureCIFAR10 autoencoder time: TR methods use the default parameters of Table 1. Firstorder methods use 128 samples and , , .
j.5 Mnist
j.5.1 MNIST convnet
figureMNIST ConvNet time: TR methods use the default parameters of Table 1 but but do one epoch of warmstart with RMSProp, use 1024 samples initially and replace . Firstorder methods use 128 samples and , , .
j.5.2 Mnist Mlp
figureMNIST MLP time: TR methods use the default parameters of Table 1 but replace . Firstorder methods use 128 samples and , , .
j.5.3 MNIST Autoencoder
figureMNIST autoencoder time: TR methods use the default parameters of Table 1 but replace . Firstorder methods use 128 samples and , , .
j.6 FashionMNIST
j.6.1 FashionMNIST convnet
figureFashionMNIST ConvNet time: TR methods use the default parameters of Table 1 but do two epochs of warmstart with RMSProp, start with 2048 samples and replace . TR and TR use Firstorder methods use 128 samples and , , .
j.6.2 FashionMNIST MLP
figureFashionMNIST MLP time: TR methods use the default parameters of Table 1 but replace . Firstorder methods use 128 samples and , , .
j.6.3 FashionMNIST Autoencoder
figureFashionMNIST autoencoder time: TR methods use the default parameters of Table 1 but replace . TR uses . Firstorder methods use 128 samples and , , .
Comments
There are no comments yet.