We consider finite-sum optimization problems of the form
which typically arise in neural network training, e.g. for empirical risk minimization over a set of data points . Here, is a convex loss and represents the neural network mapping parameterized by
, which is non-convex due to its multiplicative nature and potentially non-linear activation functions. We assume thatis twice differentiable, i.e.
. Non-convex optimization problems are ubiquitous in machine learning. Among the most prominent examples are present-day deep neural networks, that have achieved outstanding results on core tasks such as collaborative filtering(wang2015collaborative, ), sentence classification kim2014convolutional and image classification krizhevsky2012imagenet .
In the era of big data and deep neural networks, stochastic gradient descent (SGD) is one of the most widely used training algorithmsbottou2010large . What makes SGD so attractive is its simplicity, its seeming universality and per-iteration cost that is independent of the size of the training set and scales linearly in the problem dimension . However, gradient descent is known to be inadequate to optimize functions that are not well-conditioned nesterov2013introductory ; shalev2017failures and thus adaptive first-order methods that employ dynamic, coordinate-wise learning rates based on past gradients – including Adagrad (duchi2011adaptive, ), RMSprop tieleman2012lecture and Adam kingma2014adam – have become a popular alternative to SGD when training neural networks and these methods often provide significant speed-ups over plain SGD when applied in practice. Yet, there exist no theoretical proofs that these methods are faster than gradient descent li2018convergence .
From a theoretical perspective, Newton methods provide stronger convergence guarantees by appropriately transforming the gradient in ill-conditioned regions according to second-order derivatives. It is precisely this Hessian information that allows regularized Newton methods to enjoy superlinear local convergence as well as escape saddle points provably conn2000trust , which can slow gradient methods down significantly du2017gradient and are believed to be prevailing in high dimensional problems dauphin2014identifying . While second-order algorithms have a long standing history even in the realm of neural network training hagan1994training ; becker1988improving , they were mostly considered as too computationally and memory expensive for practical applications. Yet, the seminal work of martens2010deep renewed interest for their use in deep learning by proposing efficient Hessian-freepearlmutter1994fast ; schraudolph2002fast . Among the class of regularized Newton methods, trust region conn2000trust and cubic regularization algorithms cartis2011adaptive are the most principled approaches, as they yield the strongest convergence guarantees. Recently, stochastic extensions have emerged xu2017newton ; yao2018inexact ; kohler2017sub ; gratton2017complexity , which suggest their applicability for deep learning.
Yet, several practical considerations cast doubt on the theoretical superiority of second-order methods when it comes to neural network training. Answers to the following questions are particularly unclear: Are saddles even an issue in deep learning? Is superlinear local convergence a desirable feature in machine learning applications (test error)? Are second-order methods more "vulnerable" to sub-sampling noise? Do worst-case iteration complexities even matter in real-world settings? As a result, the value of Hessian information in neural network training is somewhat unclear a-priori and so far a conclusive empirical study is still missing. We are only aware of the work of liu2018stochastic ; martens2015optimizing ; martens2010deep ; chapelle2011improved and xu2017second , which deliver promising first results but their empirical studies are by no means fully encompassing.
We thus here want to shed more light on the issue of first- vs. second-order optimization for neural networks. To do so, we equip the fully stochastic STORM algorithm chen2018stochastic ; gratton2017complexity with ellipsoidal norm constraints (Section 4) and then perform an extensive set of experiments where we evaluate this algorithm against standard first-order methods (Section 5). Along the way, we give the following alternative interpretation of adaptive methods (Section 3): While gradient descent can be interpreted as a spherically constrained first-order TR method, preconditioned gradient methods – such as Adagrad – can be seen as first-order TR methods with ellipsoidal trust region constraint.
This observation is particularly interesting, since spherical constraints are blind to the underlying geometry of the problem but ellipsoids can adapt to local landscape characteristics, thereby allowing for more suitable steps in regions that are ill-conditioned. We will leverage this analogy and investigate the use of the Adagrad and RMSProp preconditioning matrices as ellipsoidal trust region shapes within second-order TR methods. While the theory for ellipsoidal TR methods is well-studied (e.g. conn2000trust ; yuan2015recent ), no ellipsoid fits all objective functions and our contribution thus lies in the identification of adequate matrix-induced constraints for the specific task of deep learning.
To sum up, our contribution is threefold:
We provide a new perspective on adaptive gradient methods that contributes to a better understanding of their inner-workings. Furthermore, we find that diagonal preconditioning is very effective since we observe empirically that many neural network problems exhibit diagonally dominated Hessian matrices.
We investigate the first application of ellipsoidal TR methods for deep learning. In Theorem 1 we show that the RMSProp preconditioning matrix can directly be applied as ellipsoid in second-order TR algorithms while preserving convergence guarantees.
Finally, we provide a comprehensive benchmark of first- vs. second-order methods across different real-world datasets and architectures. We benchmark adaptive gradient methods and show results in terms of both epochs and wall-clock time, a comparison we were not able to find in the literature.
Our main empirical results demonstrate that the net value of Hessian information is marginal. While our second-order methods are mostly faster in terms of epochs, they just so mangage to keep pace with stochastic gradient methods in terms of time. Furthermore, they come at the cost of an increased number of hyperparameters and never converge to substantially better limit points. The latter observation suggests that neither saddles nor spurious local minima pose a major problem for first-order methods in our settings. As a result, future advances in hardware and distributed optimization are needed to give Newton methods an edge over gradient based optimizers.
Finally – and more importantly for the work at hand – the ellipsoidal constraints we propose do prove to be a very effective modification of the trust region method in the sense they constantly outperform the spherical TR method, both in terms of time and asymptotic loss value on a variety of tasks and datasets (see Section 5).
2 Related work
The prototypical method for optimizing Eq. (1) is SGD (robbins1985stochastic, ). While the practical success of SGD in non-convex optimization is unquestioned, the theoretical foundation of this phenomenon is still rather limited. Recent findings suggest the ability of this method to escape saddle points and reach local minima in polynomial time for general non-convex problems but they either need to artificially add noise to the iterates (ge2015escaping, ; lee2016gradient, ) or make an assumption on the inherent noise of vanilla SGD daneshmand2018escaping . For neural network training, a recent line of research proclaims the effectiveness of SGD but the results usually come at the cost of fairly strong assumptions such as heavy overparametrization and Gaussian inputs du2017gradient ; brutzkus2017globally ; li2017convergence ; du2018power ; allen2018convergence . Adaptive gradient methods duchi2011adaptive ; tieleman2012lecture ; kingma2014adam build on the intuition that larger learning rates for smaller gradient components, and smaller learning rates for larger gradient components balance their respective influences and thereby make the methods behave as if they were optimizing a more isotropic surface. This intuition is put into practice by pre-conditioning the update direction with a diagonal matrix of squared past gradients. Such approaches have first been suggested for use in neural networks by lecun2012efficient . Recently, convergence guarantees for such methods are starting to appear ward2018adagrad ; li2018convergence . However, these are not superior to the worst case iteration complexity of standard gradient descent cartis2012much and it is yet to be explained rigorously under which conditions adaptive gradient methods can provably accelerate optimization.
Regularized Newton methods
The most principled class of regularized Newton methods are trust region (TR) and adaptive cubic regularization algorithms (ARC) conn2000trust ; cartis2011adaptive , which repeatedly optimize a local Taylor model of the objective while making sure that the step does not travel too far such that the model stays accurate. While the former finds first-order stationary points within , ARC only takes at most . However, simple modifications to the TR framework allow these methods to obtain the same accelerated rate curtis2017trust . Both methods take at most iterations to find an approximate second-order stationary point cartis2012complexity . These rates are optimal for second-order Lipschitz continuous functions carmon2017lower ; cartis2012complexity and they can be retained even when only sub-sampled gradient and Hessian information is used kohler2017sub ; yao2018inexact ; xu2017newton ; blanchet2016convergence ; liu2018stochastic ; cartis2017global . Furthermore, the involved Hessian information can be computed solely based on Hessian-vector products, which are implementable efficiently for neural networks pearlmutter1994fast . This makes these methods particularly attractive for deep learning but the empirical evidence of their applicability is so far very limited. We are only aware of the works of liu2018stochastic and xu2017second , which report promising first results but these are by no means fully encompassing. While the former does not benchmark any first-order method at all, the latter uses large batch sizes and does not train convolutional networks nor benchmark any adaptive gradient method. Furthermore, the results are only reported in terms of backpropagations and not over wall-clock time or epochs.
Inspired by natural gradient descent algorithms amari1998natural , an interesting line of research proposes to replace the Hessian by (approximations of) the Generalized-Gauss-Newton matrix (GGN) within a Levenberg-Marquardt framework111This algorithm is a simplified TR method, originally tailored for non-linear least squares problems nocedal2006nonlinear for optimization of neural networks lecun2012efficient ; martens2010deep ; chapelle2011improved ; martens2015optimizing . These methods also optimize a regularized quadratic model in each iteration and they have been termed hessian-free since only access to GGN-vector products is required. As the GGN matrix is always positive semidefinite, they cannot leverage negative curvature to escape saddles and hence there exist no second-order convergence guarantees. Furthermore, there are cases in neural network training where the Hessian is much better conditioned than the Gauss-Newton matrix mizutani2008second . Nevertheless, the above works report promising preliminary results and most notably grosse2016kronecker report that the hessian-free K-FAC method can be faster than SGD when training a convnet.
In general, this line of work is to be seen as complementary to our approach since it is straight forward to replace the Hessian in the TR framework with the GGN matrix. Furthermore, the preconditioners used in martens2010deep and chapelle2011improved
, namely diagonal estimates of the empirical Fisher and Fisher matrix respectively, can directly be used as matrix norms in our ellipsoidal TR framework. A closer examination of the difference between Hessian- and GGN-Information for neural network training is an interesting direction of future research.
3 An alternative view on adaptive gradient methods
Adaptively preconditioned gradient methods update iterates in the following way
where is a stochastic estimate of and is a positive definite symmetric pre-conditioning matrix. In Adagrad,
is the un-centered second moment matrix of the past gradients which we compute as follows
where , is the identity matrix and . Building up on the intuition that past gradients might become obsolete in quickly changing non-convex landscapes, RMSprop (and Adam) introduce an exponential weight decay on these terms leading to the preconditioning matrix
where . In order to save computational efforts, the diagonal versions and are more commonly applied in practice, which in turn gives rise to coordinate-wise adaptive stepsizes that are enlarged (reduced) in coordinates that have seen past gradient components with a larger (smaller) magnitude. In that way, the optimization methods can account for gradients of potentially different scales arising from e.g. different layers of the networks.
3.1 Adaptive preconditioning as ellipsoidal Trust Region
Starting from the fact that adaptive methods employ coordinate-wise stepsizes, one can take a principled view on these methods. Namely, their update steps arise from minimizing a first-order Taylor model of the function within an ellipsoidal search space around the current iterate , where the diameter of the ellipsoid along a certain coordinate is proportional to the associated stepsize. Correspondingly, vanilla SGD optimizes the same first-order model within a spherical constraint. Fig. 1 (top) illustrates this effect on three quadratic objectives by showing not only the iterates of GD and Adagrad but also the implicit trust regions within which the local models were optimized at each step.222For illustrative purposes we only plot every other trust region. Since the models are linear, the constrained minimizer is always found on the boundary of the current trust region.
|[width=0.3trim=22pt 22pt 30pt 30pt,clip,natwidth=610,natheight=642]figures/well_conditioned_walk4.pdf||[width=0.3trim=22pt 22pt 30pt 30pt,clip]figures/ill_conditioned_walk4.pdf||[width=0.3trim=22pt 22pt 30pt 30pt,clip]figures/tilted_walk2.pdf|
|[width=0.3trim=22pt 22pt 30pt 30pt,clip]figures/well_conditioned_loss2_10runs_250iters.pdf||[width=0.3trim=22pt 22pt 30pt 30pt,clip]figures/ill_conditioned_loss2_10runs_250iters.pdf||[width=0.3trim=22pt 22pt 30pt 30pt,clip]figures/tilted_loss_30runs.pdf|
. Bottom: Average log suboptimality over iterations as well as 90% confidence intervals of 30 runs with random initialization
It is a well known fact that GD struggles to progress towards the minimizer of quadratics along low-curvature directions, since the gradient in these directions has significantly smaller components than along high curvature (see e.g. (goh2017why, )). For fairly well-conditioned objectives this effect is negligible, as can be seen on the left hand side of Fig. 1. Yet, the plots in the middle show that this effect can drastically slow GD down when the problem is ill-conditioned. Particularly, once the method has reached the bottom of the valley it struggles to make progress along the horizontal axis. This observation resembles the classical zig-zagging effect (nocedal2006nonlinear, ) and is precisely where the advantage of adaptive stepsize methods comes into play. As illustrated by the dashed trust region lines, Adagrad’s search space is damped along the direction of high curvature (vertical axis) and elongated along the low curvature direction (horizontal axis). This preconditioning allows the method to move more horizontally early on and thus enter the valley with a smaller distance to the optimizer along the low curvature direction which effectively circumvents the zig-zagging effect.
Let us now formally establish the result that allows us to re-interpret adaptive gradient methods from the trust region perspective introduced above.
Theorem 1 (Preconditioned gradient methods as TR).
A preconditioned gradient step
with stepsize , symmetric positive definite preconditioner and minimizes a first order model around in an ellipsoid given by in the sense that
Corollary 1 (Rmsprop).
The step minimizes a first order Taylor model around in an ellipsoid given by (Eq. (3)) in the sense that
Equivalent results can be established for Adam using as well as for Adagrad by replacing the matrix into the constraint in Eq. (6). Of course, the update procedure in Eq. (5) is merely a re-interpretation of the original preconditioned update Eq. (4) and thus the employed trust region radii are defined implicitly by the current gradient and stepsize but it is straightforward to embed this approach into a proper TR framework where the trust region radius is defined explicitly by the algorithm, which gives rise to first order TR methods that behave very similarly to GD and Adagrad in practice (see Fig. 5 in Appendix).
3.2 Diagonal versus full preconditioning
A closer look at Fig. 1 reveals that the first two problems come with level sets that are perfectly axis-aligned, which makes these objectives particularly attractive for diagonal preconditioning. For a comparison, to the far right of Fig. 1 we report a further quadratic problem instance, where the Hessian is no longer zero on the off-diagonals. As can be seen, the interaction between coordinates introduces a tilt in the level sets and reduces the superiority of diagonal Adagrad over plain GD. However, using the full preconditioner re-establishes the original speed up. Yet, non-diagonal preconditioning comes at higher computational per-iteration cost since it involves taking the inverse square root of a large matrix, which is why this approach has been relatively unexplored (see (agarwal2018case, ) for a recent exception). In the following section, we find that such full matrix preconditioning is likely not worth the effort for neural networks since they show a surprisingly large degree of axis alignment.
3.3 Are neural networks axis-aligned?
Early results by becker1988improving on the curvature structure of neural networks report a strong diagonal dominance of the Hessian matrix . This suggests that the loss surface is indeed somewhat axis-aligned, which would render full matrix preconditioning obsolete. However, the reported numbers are only for tiny feed-forward networks of at most 256 parameters. Therefore, we generalize these findings in the following to larger networks and contrast the diagonal dominance of real Hessian matrices to the expected behaviour of random Wigner matrices. For this purpose, let define the ratio of diagonal to overall mass of a matrix , i.e.
For a random Gaussian333The argument naturally extends to any distribution with positive expected absolute values Wigner matrix (Eq. (29)) the expected share of diagonal mass amounts to
Thus, if we suppose the Hessian at any given point were a random Wigner matrix we would expect the share of diagonal mass to fall with as the network grows in size. Yet, as can be seen in Fig. 2 the diagonal mass of real-world neural networks stays way above this theoretical value at random initialization, during training and after convergence. These findings are in line with becker1988improving and thus question the effectiveness of full matrix preconditioning and full matrix ellipsoidal TR methods.
4 Second-order Trust Region Methods
Cubic regularization nesterov2006cubic ; cartis2011adaptive and trust region methods belong to the family of globalized Newton methods. Both frameworks compute their parameter updates by optimizing regularized (former) or constrained (latter) second-order Taylor models of the objective around the current iterate . In particular, in iteration the update step of the trust region algorithm is computed as
where and and are either and or suitable approximations. The matrix induces the shape of the constraint set. So far, the common choice for neural networks is which gives rise to spherical trust regions xu2017second ; liu2018stochastic . By solving the constrained problem (7), TR methods overcome the problem that pure Newton steps may be ascending, attracted by saddles or not even computable in the case of indefinite Hessians. See Section 2 and Appendix B for more details.
4.1 Convergence of ellipsoidal Trust Region methods
Inspired by the success of adaptive gradient methods such as RMSProp, we investigate the use of their preconditioning matrices as norm inducing matrices for second-order TR methods. One crucial condition for convergence is that the applied norms are not degenerate during the minimization process in the sense that the ellipsoids do not flatten out completely along any given direction.
Definition 1 (Uniformly equivalent norms).
The norm induced by a symmetric positive definite matrix is called uniformly equivalent, if there exists a constant such that
We now establish a result which shows that the RMSProp ellipsoid is indeed uniformly equivalent.
Proposition 2 (Uniform equivalence).
Consequently, the ellipsoids can directly be applied to any convergent TR framework without losing convergence guarantees. Interestingly, this result cannot be established for the Adagrad ellipsoid , which reflects the widely known vanishing stepsize problem that arises since squared gradients are continuously added to the preconditioning matrix. It is mainly this effect that eventually inspired the development of RMSprop tieleman2012lecture and Adadelta zeiler2012adadelta .
There are many sources for ill-conditioning in neural networks such as un-centered and correlated inputs lecun2012efficient , saturated hidden units and different weight scales in different layers of the network van1998solving . While the quadratic term in the model (7) does account for such ill-conditioning to some extent, the spherical constraint is completely blind towards the loss surface but it is in general advisable to instead measure distances in norms that reflect the underlying geometry (Chapter 7.7 in conn2000trust ). The ellipsoids we propose are such that they allow for longer steps along coordinates that have seen small gradient components in past and vice versa. Thereby the TR shape is adaptively adjusted to fit the current region of the non-convex loss landscape. Contrary to adaptive first order methods, we do not only adapt the shape but also the corresponding diameter () depending on whether or not the local Taylor model is an adequate approximation within the trust region at the current state of the algorithm. This procedure is not only effective when the iterates are in an ill-conditioned convex neighborhood of a minimizer (Figure 1) but it also helps to escape elongated plateaus (Sect. 5).
4.2 A stochastic TR framework for neural network training
Since neural network training often constitutes a large scale learning problem in which the number of datapoints is very high, we here opt for a fully stochastic TR framework chen2018stochastic
in order to circumvent memory issues and reduce computational complexity. Given that the involved function and derivative estimates are sufficiently accurate with fixed probability, such a framework has been shown to retain the convergence rate of deterministic TR methods to stationary points in expectationblanchet2016convergence . For finite-sum objectives such as Eq. (1) the required level of accuracy can be obtained by simple mini-batching.
The main difference between this and most existing approaches in kohler2017sub ; xu2017newton ; yao2018inexact ; cartis2017global lies in the computation of , which is traditionally computed as full function- over stochastic model decrease. Computing solely based on sub-sampled quantities has the nice side-effect of disentangling these two potential sources of error: an overoptimistic trust region radius or insufficiently small batch sizes. Namely, the quantity from Eq. 9
is an unbiased estimate of theused in fully deterministic algorithms and hence the trust region radius is adjusted purely based on the current adequacy of the local quadratic approximation.
To validate our claim that ellipsoidal TR methods yield improved performance over spherical ones, we run a large set of experiments on three image datasets and three types of network architectures. As outlined in Section 2, many theoretical arguments suggest the use of regularized Newton methods for optimization of neural networks but certain practical objections exist (see Appendix F.2) and so far there is no conclusive empirical evidence (see discussion in Section 2). Hence, we also present a comprehensive benchmark with state of the art first-order methods. We fix the sample size for all gradient methods to 128 but grid search the stepsize since the ratio of these two quantities effectively determines the level of stochasticity jastrzkebski2017three . The TR methods mostly use 512 samples, since they are more likely to overfit very small minibatches. More details can be found in Appendix C.
As can be seen in Figure 3
the ellipsoidal TR methods consistently outperform their spherical counterpart in the sense that they reach full accuracy substantially faster on all problems and in all measures (time and epochs). Furthermore, their limit points are in all cases lower than those of the uniform TR method and especially on the autoencoders this makes an actual difference in the image reconstruction quality (see Figure12). We thus draw the clear conclusion that the ellipsoidal trust region constraints we propose are to be preferred over their spherical counterpart when training neural networks.
Benchmark with SGD
While the ellipsoidal TR methods are slightly superior in terms of epochs, they only just manage to keep pace with first-order methods in terms of wall-clock time (Figure 10). Furthermore, the limit points of both first- and second-order methods yield the same order of loss in most experiments. When taking gradient norms into account (reported in Appendix J) we indeed find no spurious local minima and only one saddle point despite potentially arbitrarily non-convex loss landscapes. This suggests that second-order information is not needed to circumvent such obstacles in neural network optimization landscapes. Finally, we had to warm start the TR methods with a few first-order epochs to achieve good performance in terms of time on the convnets (see Appendix I.1 for a detailed discussion).
We investigated the use of ellipsoidal trust region constraints for neural networks. We have shown that the RMSProp matrix satisfies the necessary conditions for convergence and our experimental results demonstrate that ellipsoidal TR methods outperform their spherical counterparts significantly. We thus consider the development of further ellipsoids that can potentially adapt even better to the loss landscape such as e.g. (block-) diagonal hessian approximations(e.g. bekas2007estimator ) or approximations of higher order derivatives as an interesting direction of future research.
Yet, our comprehensive empirical study also highlights that the value of Hessian information for neural network optimization is limited for mainly two reasons: 1) second-order methods rarely yield better limit points, which suggests that saddles and spurious local minima are not a major obstacle 2) low per-iteration costs render gradient methods superior in terms of time. The latter observation suggests that advances in hardware and distributed second-order algorithms (e.g. osawa2018second ; dunner2018distributed ) will be needed to speed up computations before Newton-type methods can replace (stochastic) gradient methods in deep learning. Finally, since Trust Region methods come with a large number of hyperparameters, Bayesian hyperparameter optimization is another interesting direction of future work.
- (1) Leonard Adolphs, Hadi Daneshmand, Aurélien Lucchi, and Thomas Hofmann. Local saddle point optimization: A curvature exploitation approach. CoRR, abs/1805.05751, 2018.
- (2) Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, and Yi Zhang. The case for full-matrix adaptive regularization. arXiv preprint arXiv:1806.02958, 2018.
Guillaume Alain, Nicolas Le Roux, and Pierre-Antoine Manzagol.
Negative eigenvalues of the hessian in deep neural networks.2018.
- (4) Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. In Advances in Neural Information Processing Systems, pages 2680–2691, 2018.
- (5) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018.
- (6) Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
- (7) Sue Becker, Yann Le Cun, et al. Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 connectionist models summer school, pages 29–37. San Matteo, CA: Morgan Kaufmann, 1988.
- (8) Costas Bekas, Effrosyni Kokiopoulou, and Yousef Saad. An estimator for the diagonal of a matrix. Applied numerical mathematics, 57(11-12):1214–1229, 2007.
- (9) Jose Blanchet, Coralia Cartis, Matt Menickelly, and Katya Scheinberg. Convergence rate analysis of a stochastic trust region method for nonconvex optimization. arXiv preprint arXiv:1609.07428, 2016.
- (10) Leon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
- (11) Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
- (12) Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint arXiv:1702.07966, 2017.
- (13) Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. arXiv preprint arXiv:1710.11606, 2017.
- (14) Yair Carmon, Oliver Hinder, John C Duchi, and Aaron Sidford. " convex until proven guilty": Dimension-free acceleration of gradient descent on non-convex functions. arXiv preprint arXiv:1705.02766, 2017.
- (15) Coralia Cartis, Nicholas IM Gould, and Ph L Toint. Complexity bounds for second-order optimality in unconstrained optimization. Journal of Complexity, 28(1):93–108, 2012.
- (16) Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011.
- (17) Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. How Much Patience to You Have?: A Worst-case Perspective on Smooth Noncovex Optimization. Science and Technology Facilities Council Swindon, 2012.
- (18) Coralia Cartis and Katya Scheinberg. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, pages 1–39, 2017.
- (19) Olivier Chapelle and Dumitru Erhan. Improved preconditioner for hessian free optimization. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, volume 201, 2011.
- (20) Ruobing Chen, Matt Menickelly, and Katya Scheinberg. Stochastic optimization using a trust-region method and random models. Mathematical Programming, 169(2):447–487, 2018.
- (21) Anna Choromanska, Mikael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.
- (22) Andrew R Conn, Nicholas IM Gould, and Philippe L Toint. Trust region methods. SIAM, 2000.
- (23) Frank E Curtis and Daniel P Robinson. Exploiting negative curvature in deterministic and stochastic optimization. arXiv preprint arXiv:1703.00412, 2017.
- (24) Frank E Curtis and Daniel P Robinson. How to characterize the worst-case performance of algorithms for nonconvex optimization. arXiv preprint arXiv:1802.01062, 2018.
- (25) Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a worst-case iteration complexity of for nonconvex optimization. Mathematical Programming, 162(1-2):1–32, 2017.
- (26) Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. arXiv preprint arXiv:1803.05999, 2018.
- (27) Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.
- (28) Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems, pages 1067–1077, 2017.
- (29) Simon S Du and Jason D Lee. On the power of over-parametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206, 2018.
- (30) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- (31) Celestine Dünner, Aurelien Lucchi, Matilde Gargiani, An Bian, Thomas Hofmann, and Martin Jaggi. A distributed second-order algorithm you can trust. arXiv preprint arXiv:1806.07569, 2018.
Kenji Fukumizu and Shun-ichi Amari.
Local minima and plateaus in hierarchical structures of multilayer perceptrons.Neural networks, 13(3):317–327, 2000.
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan.
Escaping from saddle points-online stochastic gradient for tensor decomposition.In COLT, pages 797–842, 2015.
- (34) Gabriel Goh. Why momentum really works. Distill, 2017.
- (35) Serge Gratton, Clément W Royer, Luís N Vicente, and Zaikun Zhang. Complexity and global rates of trust-region methods based on probabilistic models. IMA Journal of Numerical Analysis, 2017.
- (36) Roger Grosse and James Martens. A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pages 573–582, 2016.
- (37) Martin T Hagan and Mohammad B Menhaj. Training feedforward networks with the marquardt algorithm. IEEE transactions on Neural Networks, 5(6):989–993, 1994.
- (38) Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
- (39) Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
- (40) Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456, 2017.
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.In Advances in neural information processing systems, pages 315–323, 2013.
- (42) Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
- (43) Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
- (44) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- (45) Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In International Conference on Machine Learning, 2017.
- (46) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- (47) Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.
- (48) Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246–1257, 2016.
- (49) Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. arXiv preprint arXiv:1805.08114, 2018.
Yuanzhi Li and Yang Yuan.
Convergence analysis of two-layer neural networks with relu activation.In Advances in Neural Information Processing Systems, pages 597–607, 2017.
- (51) Liu Liu, Xuanqing Liu, Cho-Jui Hsieh, and Dacheng Tao. Stochastic second-order methods for non-convex optimization with inexact hessian and gradient. arXiv preprint arXiv:1809.09853, 2018.
- (52) James Martens. Deep learning via hessian-free optimization. In ICML, volume 27, pages 735–742, 2010.
- (53) James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
- (54) James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417, 2015.
- (55) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
- (56) Eiji Mizutani and Stuart E Dreyfus. Second-order stagewise backpropagation for hessian-matrix analyses and investigation of negative curvature. Neural Networks, 21(2-3):193–203, 2008.
- (57) Yu Nesterov. Accelerating the cubic regularization of newton’s method on convex problems. Mathematical Programming, 112(1):159–181, 2008.
- (58) Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
- (59) Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
- (60) Jorge Nocedal and Stephen J Wright. Numerical optimization, 2nd Edition. Springer, 2006.
- (61) Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. Second-order optimization method for large mini-batch: Training resnet-50 on imagenet in 35 epochs. arXiv preprint arXiv:1811.12019, 2018.
- (62) Ioannis Panageas and Georgios Piliouras. Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405, 2016.
- (63) Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in pytorch.2017.
- (65) Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.
- (66) Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323, 2016.
- (67) Herbert Robbins and Sutton Monro. A stochastic approximation method. In The Annals of Mathematical Statistics - Volume 22, Number 3. Institute of Mathematical Statistics, 1951.
- (68) Nicol N Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural computation, 14(7):1723–1738, 2002.
- (69) Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. arXiv preprint arXiv:1703.07950, 2017.
- (70) Trond Steihaug. The conjugate gradient method and trust regions in large scale optimization. SIAM Journal on Numerical Analysis, 20(3):626–637, 1983.
- (71) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- (72) Patrick Van Der Smagt and Gerd Hirzinger. Solving the ill-conditioning in neural network learning. In Neural networks: tricks of the trade, pages 193–206. Springer, 1998.
- (73) Hao Wang, Naiyan Wang, and Dit-Yan Yeung. Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1235–1244, 2015.
- (74) Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. arXiv preprint arXiv:1806.01811, 2018.
- (75) Eugene P Wigner. Characteristic vectors of bordered matrices with infinite dimensions i. In The Collected Works of Eugene Paul Wigner, pages 524–540. Springer, 1993.
- (76) Peng Xu, Farbod Roosta-Khorasan, and Michael W Mahoney. Second-order optimization for non-convex machine learning: An empirical study. arXiv preprint arXiv:1708.07827, 2017.
- (77) Peng Xu, Farbod Roosta-Khorasani, and Michael W Mahoney. Newton-type methods for non-convex optimization under inexact hessian information. arXiv preprint arXiv:1708.07164, 2017.
- (78) Zhewei Yao, Peng Xu, Farbod Roosta-Khorasani, and Michael W Mahoney. Inexact non-convex newton-type methods. arXiv preprint arXiv:1802.06925, 2018.
- (79) Ya-xiang Yuan. Recent advances in trust region algorithms. Mathematical Programming, 151(1):249–281, 2015.
- (80) Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Appendix A Notation
Throughout this work, scalars are denoted by regular lower case letters, vectors by bold lower case letters and matrices as well as tensors by bold upper case letters. By we denote an arbitrary norm. For a symmetric positive definite matrix we introduce the compact notation , where .
Appendix B Equivalence of Preconditioned Gradient Descent and First Order Trust Region Methods
Theorem 2 (Theorem 1 restated).
A preconditioned gradient step
with stepsize , symmetric positive definite preconditioner and minimizes a first order local model around in an ellipsoid given by in the sense that
We start the proof by noting that the optimization problem (11) is convex. For the constraint satisfies the Slater condition since is a strictly feasible point. As a result, any KKT point is a feasible minimizer and vice versa.
Let denote the Lagrange dual of (5)
Any point is a KKT point if and only if the following system of equations is satisfied
For as given in Eq. (4) we have that
is zero for . As a result, is a KKT point of the convex Problem (5) which proves the assertion.
To illustrate this theoretical result we run gradient descent and Adagrad as well as the two corresponding first-order TR methods444Essentially Algorithm 1 with based on a first order Taylor expansion. on an ill-conditioned quadratic problem. While the method 1st TR optimizes a linear model within a ball in each iteration, 1st TR optimizes the same model over the ellipsoid given by the Adagrad matrix . The results in Figure 5 show that the methods behave very similar to their constant stepsize analogues.
|[width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/ill_conditioned_walk_withTR.pdf||[width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/ill_conditioned_loss_withTR.pdf|
Appendix C Convergence of ellipsoidal TR methods
Spherical constrained TR methods
Under standard smoothness assumptions, spherical TR algorithms achieve criticality after iterations and additionally almost positive curvature in iterations. These rates are attained without the need of actually knowing the Hessian’s Lipschitz constant. The complete statement of the convergence results for TR methods can be found in Theorem 4.3 of . Interestingly, these rates can be improved to match the (optimal) first-order worst case complexity of Cubic Regularization by applying small modifications to the TR framework. As stated in Section E.2.2 the involved subproblems do not need to be solved globally in each iteration.
For both, cubic regularization and trust region methods, many stochastic extensions have emerged in literature that alleviate the need to compute exact derivative information without losing the above mentioned convergence guarantees with high probability [45, 76, 77, 9, 35, 35]). For the deep learning setting, the analysis of  is most relevant since it also allows the algorithm to run solely based on sub-sampled function evaluations.
Ellipsoidal constrained TR methods
In order to prove such results for ellipsoidal Trust Region methods one must ensure that the applied norms are coherent during the complete minimization process in the sense that the ellipsoids do not flatten out (or blow up) completely along any given direction. This intuition is formalized in Assumption 1 which we restate here for the sake of clarity.
Definition 2 (Definition 1 restated).
There exists a constant such that
Having uniformly equivalent norms is necessary and sufficient to prove that ellipsoidal TR methods enjoy the same convergence rate as classical ball constrained Trust Region algorithms.
Towards this end,  identify the following sufficient condition on the basis of which we will prove that our proposed ellipsoid is indeed uniformly equivalent under some mild assumptions.
Lemma 1 (Theorem 6.7.1 in ).
Suppose that there exists a constant such that
then Definition 1 holds.
Proposition 3 (Uniform equivalence).
The basic building block of our ellipsoid matrix consists of the current and past stochastic gradients
We consider which is built up as follows555This is a generalization of the diagonal variant proposed by , which precondition the gradient step by an elementwise division with the square-root of the following estimate .
From the construction of it directly follows that for any unit length vector we have
which proves the lower bound for . Now, let us consider the upper end of the spectrum of . Towards this end, recall the geometric series expansion
and the fact that is a sum of exponentially weighted rank-one positive semi-definite matrices of the form . Thus
where the latter inequality holds per assumption for any sample size . Combining these facts we get that
As a result we have that
Finally, to achieve uniform equivalence we need the r.h.s. of (26) to be bounded by . This gives rise to a quadratic equation in , namely
which holds for any and any as long as
Such an always exists but one needs to choose smaller and smaller values as the upper bound on the gradient norm grows. For example, the usual value is valid for all .
All of the above arguments naturally extend to the diagonal preconditioner .
Interestingly, this result cannot be established for the Adagrad inspired ellipsoid , which reflects the commonly noticed effect that the stepsizes of first-order Adagrad shrink over time as squared gradients are continuously added to the preconditioning matrix. It is mainly this effect that eventually inspired the development of Rmsprop , Adadelta  and simliar approaches.
Appendix D Diagonal Dominance in Neural Networks
d.1 Proof of Proposition 1
Proposition 4 (Proposition 1 restated).
For random Gaussian Wigner matrix formed as
where stands for i.i.d. draws , the expected share of diagonal mass amounts to
which simplifies to
if the diagonal and off-diagonal elements come from the same Gaussian distribution (). ∎
For the sake of simplicity we only consider Gaussian Wigner matrices but the above argument naturally extends to any distribution with positive expected absolute values, i.e. we only exclude the Dirac delta function as probability density.
Appendix E Background on second-order optimization
e.1 Newton’s Method
The canonical second-order method is Newton’s methods. This algorithm uses the inverse Hessian as a scaling matrix and thus has updates of the form
which is equivalent to optimizing the local quadratic model
to first order stationarity. Using curvature information to rescale the steepest descent direction gives Newton’s method the useful property of being linearly scale invariant. This gives rise to a problem independent local convergence rate that is super-linear and even quadratic in the case of Lipschitz continuous Hessians (see  Theorem 3.5), whereas gradient descent at best achieves linear local convergence .
However, there are certain drawbacks associated with applying classical Newton’s method. First of all, the Hessian matrix may be singular and thus not invertible. Secondly, even if it is invertible the local quadratic model (Eq. (33)) that is minimized in each NM iteration may simply be an inadequate approximation of the true objective. As a result, the Newton step is not necessarily a descent step. It may hence approximate arbitrary critical points (including local maxima) or even diverge. Finally, the cost of forming and inverting the Hessian sum up to and are thus prohibitively high for applications in large dimensional problems.
e.2 Trust Region Methods
e.2.1 Outer iterations
Trust region methods are among the most principled approaches to overcome the above mentioned issues. These methods also construct a quadratic model but constrain the subproblem in such a way that the stepsize is restricted to stay within a certain radius within which the model is trusted to be sufficiently adequate
Hence, contrary to line-search methods this approach finds the step and its length simultaneously by optimizing (34). Subsequently the actual decrease is compared to the predicted decrease and the step is only accepted if the ratio exceeds some predefined success threshold . Furthermore, the trust region radius is decreased whenever falls below and it is increased whenever exceeds the "very successful" threshold . Thereby, the algorithm adaptively measures the accuracy of the second-order Taylor model – which may change drastically over the parameter space depending on the behaviour of the higher-order derivatives666Note that the second-order Taylor models assume constant curvature. – and adapts the effective length along which the model is trusted accordingly. See  for more details.
As a consequence, the plain Newton step is only taken if it lies within the trust region radius and yields a certain amount of decrease in the objective value. Since many functions look somehow quadratic close to a minimizer the radius can be shown to grow asymptotically under mild assumptions such that eventually full Newton steps are taken in every iteration which retains the local quadratic convergence rate .
|[width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/newton_ascending.pdf||[width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/t_around_saddle.pdf|
e.2.2 Subproblem solver
Interestingly, there is no need to optimize (34) to global optimality to retain the remarkable global convergence properties of TR algorithms. From a practical point of view this is very good news since globally optimizing the involved quadratics scales quadratically in the problem dimension () which can be very large for neural networks. Instead, it suffices to do better than the Cauchy- and Eigenpoint777which are the model minimizers along the gradient and the eigendirection associated with its smallest eigenvalue, respectively. simultaneously. One way to ensure this is to minimize in nested Krylov subspaces. These subspaces naturally include the gradient direction as well as increasingly accurate estimates of the leading eigendirection
until (for example) the stopping criterion
is met, which requires increased accuracy as the underlying trust region algorithm approaches criticality. Conjugate gradients and Lanczos method are two iterative routines that implicitly build up a conjugate and orthogonal basis for such a Krylov space respectively and they converge linearly on quadratic objectives with a square-root dependency on the condition number of the Hessian . We here employ the preconditionied Steihaug-Toint CG method  in order to cope with possible boundary solutions of (34) but similar techniques exist for the Lanczos solver as well for which we also provide code. As preconditioning matrix for CG we use the same matrix as for the ellipsoidal constraint.
e.3 Damped (Gauss-)Newton methods
An alternative approach to actively constraining the region within which the model is trusted is to instead penalize the step norm in each iteration in a Lagrangian manner. This is done by so-called damped Newton methods that add a multiple of the identity matrix to the second-order term in the model, which leads to the update step
This can also be solved hessian-free by conjugate gradients (or other Krylov subspace methods). The penalty parameter is acting inversely to the trust region radius and it is often updated accordingly.
Many algorithms in the existing literature replace the use of in (37) with the Generalized Gauss Newton matrix [52, 19] or an approximation of the latter . This matrix constitutes the first part of the well-known Gauss-Newton decomposition
where and are the first and second derivative of assuming that (binary classification and regression task) for simplicity here.
It is interesting to note that the GGN matrix of neural networks is equivalent to the Fisher matrix used in natural gradient descent 
for convex loss functions like mean-squared-error and cross-entropy loss. As can be seen in (38) the matrix is positive semidefinite (and low rank if ). As a result, there exist no second-order convergence guarantees for such methods on general non-convex problems. On the other end of the spectrum, the GGN also drops possibly positive terms from the Hessian (see (38)). Hence it is not guaranteed to be an upper bound on the latter in the PSD sense. Essentially, GGN approximations assume that the network is piece-wise linear and thus the GGN and Hessian matrices only coincide in the case of linear and ReLU activations or non-curved loss functions. For any other activation the GGN matrix may approximate the Hessian only asymptotically and if the terms in (38) go to zero for all . In non-linear least squares such problems are called zero-residual problems and GN methods can be shown to have quadratic local convergence there. In any other case the convergence rate does not exceed the linear local convergence bound of gradient descent. In practice however there are cases where deep neural nets do show negative curvature in the neighborhood of a minimizer .
When is small the step is similar to the plain Gauss-Newton update whereas for very large , Eq. (37) approximates a gradient descent update with a very small step size. Such algorithms are commonly known as Levenberg-Marquardt algorithms and they were originally tailored towards solving non-linear least squares problems  but they have been proposed for neural network training already early on . Finally,  propose the use of the absolute Hessian instead of the GGN matrix in a framework similar to (37). This method has been termed saddle-free Newton even though its manifold of attraction to a given saddle is non-empty888It is the same as that for gradient descent, which renders the method unable to escape e.g. when initialized right on a saddle point. To be fair, the manifold of attraction for gradient descent constitutes a measure zero set ..
|[width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/gn_around_saddle.pdf||[width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/abs_H_around_saddle.pdf|
e.3.1 Comparison to trust region
Contrary to TR methods though, the regularized Newton methods never take plain Newton steps since the regularization is always on (). Furthermore, if a positive-definite Hessian approximation like the Generalized Gauss Newton matrix is used, this algorithm is not capable of exploiting negative curvature and there are cases in neural network training where the Hessian is much better conditioned than the Gauss-Newton matrix  (also see Figure 8). While some scholars believe that positive-definiteness is a desirable feature [52, 19], we want to point out that following negative curvature directions is necessarily needed to escape saddle points and it can also be meaningful to follow directions of negative eigenvalue outside a saddle since they guarantee progress, whereas a gradient descent step yields at least progress (both under certain stepsize conditions) and one cannot conclude a-priori which one is better in general [23, 3]. Finally, there is a growing body of first-order algorithms that explicitly include additional negative curvature steps to derive state-of-the-art time complexities in classical minimization problems [4, 66, 14, 40] as well as in min-max problems to avoid convergence to unstructured saddle points .
Despite these theoretical considerations, many methods based on GGN matrices have been applied to neural network training (see  and references therein) and particularly the hessian-free implementations of [52, 19] can be implemented very cheaply since they allow to make use of an efficient procedure for computing GGN-vector products . While these works provide promising results,  report superior performance of TR algorithms on the autoencoder architecture presented in Section J. Finally,  and  propose an algorithm termed K-FAC that uses a non-diagonal, high-rank approximation of the Fisher information matrix which can incorporated into an approximate999When using K-FAC within a damped Newton scheme, one needs to add the term to the GGN approximation before inverting. However, the diagonal blocks are then no longer expressible as Kronecker products so  must instead use an approximation to the LM approach. Levenberg-Marquardt scheme for both feed-forward and convolutional neural networks since the inverse is readily available and thus no Krylov subspace methods need to be run to find the next step.
Unfortunately, we are not aware of any comprehensive empirical comparison between Gauss-Newton/Fisher Information and Hessian based approaches but we consider this to be a very interesting direction of future research.
Appendix F Using Hessian information in Neural Networks
f.1 Theoretical advantages
TR methods present several theoretical advantages which are partly illustrated in Figure 7:
Contrary to gradient descent, TR methods enjoy a superlinear local convergence rate independent of the local conditioning .
Contrary to gradient descent, Newton- and damped Newton methods (such as Levenberg-Marquardt) TR methods can leverage curvature information which is believed to be helpful for non-convex optimization[3, 23, 14]. Thereby they escape elongated plateaus and (strict) saddle point quickly and thus provably converge to second order stationary points. In theory, saddles even emerge in shallow feed forward networks .
Just as damped Newton methods, TR methods update their iterates based on Hessian-vector products, which can be computed roughly at the cost of two gradient computations .
f.2 Practical objections
In practice, however, these theoretical advantages do not always play out. Particularly, when training neural networks one must take the following considerations into account.
While superlinear local convergence is nice for traditional optimization, the ultimate goal of any learning task is generalization and thus it might no longer be desirable to even find a high-accuracy solution on the training data. Yet, second-order methods can also be regularized with early-stopping, weight decay, dropout and similar techniques.
In practice, the value of worst case convergence rates is questionable since they are ultimately tailored towards worst case functions that rarely occur on real world problems . Particularly
(Stochastic) gradient descent often finds an stationary point much faster than its theoretical () rate predicts.
Finally, second-order methods may suffer substantially more from mini-batching than first-order methods do, since much more information is extracted from the batch in each second-order step which leads to very noisy updates due to overfitting the batch. Yet, many real-world problems render large batch settings impossible due to memory restrictions.
In Figure J.2 we report that small batches lead to a Hessian Lipschitzness estimator with large variance and a surprising tendency towards overestimation which may lead to overly conservative stepsizes.
At this point we refer to the following Appendix C for a detailed conclusion regarding the above discussion on the basis of our experimental findings.
Appendix G Experimental results overview
g.1 Ellipsoidal Trust Region vs. Uniform Trust Region
g.2 Ellipsoidal Trust Region vs. First-order Optimizer
The above figures are intended to give the reader a quick overview of our results in terms of the loss values. Please see Appendix J for more details about gradient norm as well as train and test accuracy. There, we also provide the applied parameters, curvature computations, and reconstructed autoencoder images.
Appendix H Conclusion from Experiments
Observation I: Ellipsoidal TR methods consistently perform better than their spherical counterpart.
We find that ellipsoidal TR methods constantly and significantly outperform their spherical counterparts in the sense that at least one of them reaches the accuracy mark faster than spherical TR in all of our tested settings both in terms of epochs and time. Furthermore, the limit points of our proposed methods are in all cases lower or equal to those of the uniform TR methods and especially on the autoencoder architectures this makes an actual difference in the image reconstruction quality (see Figure 12).
Observation II: TR methods make mostly faster progress than first-order methods in terms of epochs.
In summary, Figure 4 shows that TR methods outperform stochastic gradient descent methods in terms of epochs in the sense that in each of our settings at least one of the ellipsoidal TR algorithms achieves accuracy faster than all gradient methods. One exception is the autoencoder setting in which first-order methods are superior.
Observation III: In terms of time, the stochastic ellipsoidal TR methods only just manage to keep pace with first-order methods.
As can be seen in Figure 10, the stochastic ellipsoidal TR method can roughly keep pace with first order methods but the advantage seen in the epoch plots (Fig. 4) is diminished substantially. We hence believe that advances in both hardware and distributed algorithms will be needed to enable a more widespread use of second-order methods in deep learning.
Observation IV: No spurious local minima and almost no saddles. Underparametrization can lead to severe ill-conditioning.
The limit points of all methods yield the same loss in the vast majority of our experiments which is somewhat surprising given that we use repeated (ten) random initializations in potentially very non-convex and complicated landscapes. Nevertheless, no method gets stuck in a spurious local minima or saddle point for long. The only method that stays above all others at times is TR but a closer look at the gradient norm reveals that it has not yet converged to a critical point of any kind (see Appendix D). One exception is the CIFAR-10 MLP for which TR finds much better loss values (and much higher accuracy) than the first-order methods. Again, we note that the gradient norms stay elevated for all other methods (see Figure J.4) and thus hypothesize that this behaviour is due to the heavy under-parametrization of the network. In fact, we find that the maximum curvature in this network is at least higher than in the Fashion- and MNIST MLP (see Figure J.2).
As a matter of fact, the only architecture which gives rise to saddle points is the autoencoder on which the SGD and uniform TR loss plateaus (Fig. 10) while the gradient norm gets very small (e.g. Fig. J.6.3). Yet, SGD escapes this plateau eventually which is accompanied by a sudden increase in gradient norm. We note that the ellispoidal TR methods do find and leverage negative curvature directions to escape this area.
Observation V: Warmstart helps TR methods on curvy convnets.
We find improved performance for our second-order algorithms on the convolutional networks when starting the TR methods after a few first-order epochs. While we use only one first-order epoch for MNIST and two epochs for Fashion-MNIST we need as much as first-order iterations to reach full accuary within the 2000 seconds training time on CIFAR-10. As described in Appendix I.1, warm starting more sophisticated optimization algorithms with a few epochs of SGD is common in the literature [66, 41, 55, 27].
Appendix I Things we tried that did not work (so well)
i.1 Training ConvNets with Trust Region without warmstart
First off, we note that warm starting with vanilla SGD is standard for other techniques that aim to obtain lower-variance estimators of the optimum as e.g. variance reduction (see [66, 41] for image classification with MLPs), iterate averaging (see e.g.  and references therein for learning language models with RNNs) and saddle-free Newton . The motivation for this approach is to start with SGD in order to quickly forget the initialization and then fine-tune with more sophisticated algorithms.
Using such an initialization we find significantly better performance in convolutional networks and especially for the CIFAR-10 dataset. While we use only one first-order epoch for MNIST and two epochs for Fashion-MNIST we need as much as first-order iterations to reach full accuary within the 2000 seconds training time on CIFAR-10.
We hypothesize that the warm start is needed because sharp landscapes can render second-order algorithms too conservative. Particularly, the CIFAR-10 convnet is the smallest of the convnet architectures and yet the hardest dataset. We thus believe that this network is not over-parameterized enough to bring about a fairly smooth and uncomplicated loss landscape. Indeed, as reported in Figure J.2 the maximum curvature in this network is about 1000 times higher than that of the other convnets and the other convnets themselves have again at least 10x higher maximum eigenvalues than the corresponding MLPs. In such a landscape, TR methods are likely to be over-conservative since the increased curvature yields very narrow quadratic approximations.101010Note that Figure J.2 only shows the upper end of the spectrum. However, when we considered the conjugate gradient steps in the early subproblems after initialization we found that those were also along very curvy directions which leads to a tiny stepsize since CG divides the search direction by the curvature along that line. This can drastically hurt the performance if the initialization is not close to any good local minimum. In this regard, note that the SGD stepsize of that we employ on the problem is way beyond the theoretical maximum of , which suggests that it is indeed meaningful to quickly navigate out of a potentially very complex initial region.
i.2 Lanczos subproblem solver
In theory the Lanczos solver brings about much better steps than CG whenever a direction of negative curvature is found. However, an efficient implementation of this routine needs functions for matrix factorization and linear systems solvers that can exploit sparsity in matrices that have a special banded structure, such as tridiagonality. While such functions exist for CPUs (e.g. in Scipy) we could not find any GPU implementation for PyTorch and using the standard function for dense matrices rendered the method far too slow. Thus, all of our results are presented using the CG solver but we think that even better performance can be achieved once an efficient Lanczos GPU implementation exists, particularly because we do find that the CG solver detects directions of negative curvature quite frequently. For that purpose, we provide code of a CPU-tailored Generalized Lanczos Trust Region (GLTR) method inspired by  Algorithm 7.3.4 and 7.3.6. in the hope to facilitate future developments of a GPU version.
i.3 Trust Region Adam
It is possible to design a second-order TR variant of the popular Adam optimizer. Here the main difference to RMSProp is a moving average over past gradients is used instead of just the current gradient. For first-order methods this is no problem since only the gradient value of the past iterate must be stored. In second-order TR algorithms, however, we compute Hessian-vector product in each iteration and so we would need to store all past gradients including their graphs such that we can backprop through them a second time. While this is likely to run into memory problems it also takes way too much time since one Hv product now consists of many additional backpropagations. Yet, one can also use only for the linear part of the model and compute hessian-vector products based solely on the current gradient. We tested this in some settings but found no significant improvement over and . One alternative would be to only store a small number of past gradients such as it is done in Adadelta  but we leave an assessment of this approach for future work.
Appendix J Further Experiment Details
j.1 Default parameters, architectures and datasets
Table 1 reports the default parameters we consider. Yet, we grid-searched a subset of them (e.g. learning rate of gradient methods) and report the problem-specific values in each corresponding subsection in case of deviation from the above stated values. For the second-order methods we chose the values in order to maximize performance in terms of time. Other settings are likely to improve performance over epochs and backprops by a lot (e.g. a lower ). To compare all methods on the same basis we run all first-order algorithms without momentum. Yet, we consider it an interesting direction of future research and note that momentum can also be applied for second-order methods (see e.g. [57, 54]
For the sake of comparability, TR and TR are always run with the same parameters and hence the only difference is in the trust region shape. TR shares most of these parameters but we sometimes start with a smaller TR radius and have a higher decrease () since we observed that RMSprop usually also runs on much slower learning rates than SGD and Adagrad.
We use three real-world datasets for image classification, namely CIFAR-10, MNIST and Fashion-MNIST111111All three datasets were accessed from https://www.tensorflow.org/api_docs/python/tf/keras/datasets. While MNIST and Fashion-MNIST are greyscale images, CIFAR-10 are colored images of size . All three datasets have a fixed training-test split consisting of 50,000 (60,000 for Fashion-MNIST) and 10,000 images, respectively.
The MLP architectures are simple. For MNIST and Fashion-MNIST we use a network with tanh activations and a cross entropy loss. The networks has parameters. For the CIFAR-10 MLP we use a architecture also with tanh activations and cross entropy loss. This network has parameters.
The MNIST and Fashion-MNIST autoencoders have the same architecture as the ones used in [38, 76, 52, 54]. The encoder structure is and the decoder is mirrored. Sigmoid activations are used in all but the central layer. The reconstructed images are fed pixelwise into a binary cross entropy loss. The network has a total of parameters. The CIFAR-10 autoencoder is taken from the implementation of https://github.com/jellycsc/PyTorch-CIFAR-10-autoencoder.
We use fairly small convnets that are taken from official PyTorch tutorials. Specifically, the (Fashion-)MNIST network is specified in https://github.com/pytorch/examples/blob/master/mnist/main.py and the CIFAR-10 network can be found under https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html. The total number of parameters in these networks amount to and respectively.
In all of our experiments each method was run on one Tesla P100 GPU using the PyTorch  library.
j.2 Observed curvature
figureMaximum eigenvalue of the network’s Hessian over time when training different networks with RMSProp. Average and 95% confidence interval of 10 random initializations
j.3 Reconstructed Images from Autoencoders
j.4.1 CIFAR-10 ConvNet
figureCIFAR-10 ConvNet time. TR methods use the default parameters of Table 1 but replace and take 20 epochs warmstart with RMSProp. First-order methods use 128 samples and , , .
j.4.2 Cifar-10 Mlp
figureCIFAR-10 MLP time: TR methods use the default parameters of Table 1. First-order methods use 128 samples and , , .
j.4.3 CIFAR-10 Autoencoder
figureCIFAR-10 autoencoder time: TR methods use the default parameters of Table 1. First-order methods use 128 samples and , , .
j.5.1 MNIST convnet
figureMNIST ConvNet time: TR methods use the default parameters of Table 1 but but do one epoch of warmstart with RMSProp, use 1024 samples initially and replace . First-order methods use 128 samples and , , .
j.5.2 Mnist Mlp
figureMNIST MLP time: TR methods use the default parameters of Table 1 but replace . First-order methods use 128 samples and , , .
j.5.3 MNIST Autoencoder
figureMNIST autoencoder time: TR methods use the default parameters of Table 1 but replace . First-order methods use 128 samples and , , .
j.6.1 Fashion-MNIST convnet
figureFashion-MNIST ConvNet time: TR methods use the default parameters of Table 1 but do two epochs of warmstart with RMSProp, start with 2048 samples and replace . TR and TR use First-order methods use 128 samples and , , .
j.6.2 Fashion-MNIST MLP
figureFashion-MNIST MLP time: TR methods use the default parameters of Table 1 but replace . First-order methods use 128 samples and , , .
j.6.3 Fashion-MNIST Autoencoder
figureFashion-MNIST autoencoder time: TR methods use the default parameters of Table 1 but replace . TR uses . First-order methods use 128 samples and , , .