Ellipsoidal Trust Region Methods and the Marginal Value of Hessian Information for Neural Network Training

05/22/2019 ∙ by Leonard Adolphs, et al. ∙ ETH Zurich 4

We investigate the use of ellipsoidal trust region constraints for second-order optimization of neural networks. This approach can be seen as a higher-order counterpart of adaptive gradient methods, which we here show to be interpretable as first-order trust region methods with ellipsoidal constraints. In particular, we show that the preconditioning matrix used in RMSProp and Adam satisfies the necessary conditions for convergence of (first- and) second-order trust region methods and report that this ellipsoidal constraint constantly outperforms its spherical counterpart in practice. We furthermore set out to clarify the long-standing question of the potential superiority of Newton methods in deep learning. In this regard, we run extensive benchmarks across different datasets and architectures to find that comparable performance to gradient descent algorithms can be achieved but using Hessian information does not give rise to better limit points and comes at the cost of increased hyperparameter tuning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider finite-sum optimization problems of the form

(1)

which typically arise in neural network training, e.g. for empirical risk minimization over a set of data points . Here, is a convex loss and represents the neural network mapping parameterized by

, which is non-convex due to its multiplicative nature and potentially non-linear activation functions. We assume that

is twice differentiable, i.e.

. Non-convex optimization problems are ubiquitous in machine learning. Among the most prominent examples are present-day deep neural networks, that have achieved outstanding results on core tasks such as collaborative filtering 

(wang2015collaborative, ), sentence classification kim2014convolutional and image classification krizhevsky2012imagenet .

In the era of big data and deep neural networks, stochastic gradient descent (SGD) is one of the most widely used training algorithms 

bottou2010large . What makes SGD so attractive is its simplicity, its seeming universality and per-iteration cost that is independent of the size of the training set and scales linearly in the problem dimension . However, gradient descent is known to be inadequate to optimize functions that are not well-conditioned nesterov2013introductory ; shalev2017failures and thus adaptive first-order methods that employ dynamic, coordinate-wise learning rates based on past gradients – including Adagrad (duchi2011adaptive, ), RMSprop tieleman2012lecture and Adam kingma2014adam – have become a popular alternative to SGD when training neural networks and these methods often provide significant speed-ups over plain SGD when applied in practice. Yet, there exist no theoretical proofs that these methods are faster than gradient descent li2018convergence .

From a theoretical perspective, Newton methods provide stronger convergence guarantees by appropriately transforming the gradient in ill-conditioned regions according to second-order derivatives. It is precisely this Hessian information that allows regularized Newton methods to enjoy superlinear local convergence as well as escape saddle points provably conn2000trust , which can slow gradient methods down significantly du2017gradient and are believed to be prevailing in high dimensional problems dauphin2014identifying . While second-order algorithms have a long standing history even in the realm of neural network training hagan1994training ; becker1988improving , they were mostly considered as too computationally and memory expensive for practical applications. Yet, the seminal work of martens2010deep renewed interest for their use in deep learning by proposing efficient Hessian-free

methods that only access second-order information via matrix-vector products which can be computed at the cost of an additional backpropagation

pearlmutter1994fast ; schraudolph2002fast . Among the class of regularized Newton methods, trust region conn2000trust and cubic regularization algorithms cartis2011adaptive are the most principled approaches, as they yield the strongest convergence guarantees. Recently, stochastic extensions have emerged  xu2017newton ; yao2018inexact ; kohler2017sub ; gratton2017complexity , which suggest their applicability for deep learning.

Yet, several practical considerations cast doubt on the theoretical superiority of second-order methods when it comes to neural network training. Answers to the following questions are particularly unclear: Are saddles even an issue in deep learning? Is superlinear local convergence a desirable feature in machine learning applications (test error)? Are second-order methods more "vulnerable" to sub-sampling noise? Do worst-case iteration complexities even matter in real-world settings? As a result, the value of Hessian information in neural network training is somewhat unclear a-priori and so far a conclusive empirical study is still missing. We are only aware of the work of liu2018stochastic ; martens2015optimizing ; martens2010deep ; chapelle2011improved and xu2017second , which deliver promising first results but their empirical studies are by no means fully encompassing.

We thus here want to shed more light on the issue of first- vs. second-order optimization for neural networks. To do so, we equip the fully stochastic STORM algorithm chen2018stochastic ; gratton2017complexity with ellipsoidal norm constraints (Section 4) and then perform an extensive set of experiments where we evaluate this algorithm against standard first-order methods (Section 5). Along the way, we give the following alternative interpretation of adaptive methods (Section 3): While gradient descent can be interpreted as a spherically constrained first-order TR method, preconditioned gradient methods – such as Adagrad – can be seen as first-order TR methods with ellipsoidal trust region constraint.

This observation is particularly interesting, since spherical constraints are blind to the underlying geometry of the problem but ellipsoids can adapt to local landscape characteristics, thereby allowing for more suitable steps in regions that are ill-conditioned. We will leverage this analogy and investigate the use of the Adagrad and RMSProp preconditioning matrices as ellipsoidal trust region shapes within second-order TR methods. While the theory for ellipsoidal TR methods is well-studied (e.g. conn2000trust ; yuan2015recent ), no ellipsoid fits all objective functions and our contribution thus lies in the identification of adequate matrix-induced constraints for the specific task of deep learning.

To sum up, our contribution is threefold:

  • We provide a new perspective on adaptive gradient methods that contributes to a better understanding of their inner-workings. Furthermore, we find that diagonal preconditioning is very effective since we observe empirically that many neural network problems exhibit diagonally dominated Hessian matrices.

  • We investigate the first application of ellipsoidal TR methods for deep learning. In Theorem 1 we show that the RMSProp preconditioning matrix can directly be applied as ellipsoid in second-order TR algorithms while preserving convergence guarantees.

  • Finally, we provide a comprehensive benchmark of first- vs. second-order methods across different real-world datasets and architectures. We benchmark adaptive gradient methods and show results in terms of both epochs and wall-clock time, a comparison we were not able to find in the literature.

Our main empirical results demonstrate that the net value of Hessian information is marginal. While our second-order methods are mostly faster in terms of epochs, they just so mangage to keep pace with stochastic gradient methods in terms of time. Furthermore, they come at the cost of an increased number of hyperparameters and never converge to substantially better limit points. The latter observation suggests that neither saddles nor spurious local minima pose a major problem for first-order methods in our settings. As a result, future advances in hardware and distributed optimization are needed to give Newton methods an edge over gradient based optimizers.

Finally – and more importantly for the work at hand – the ellipsoidal constraints we propose do prove to be a very effective modification of the trust region method in the sense they constantly outperform the spherical TR method, both in terms of time and asymptotic loss value on a variety of tasks and datasets (see Section 5).

2 Related work

First-order methods

The prototypical method for optimizing Eq. (1) is SGD (robbins1985stochastic, ). While the practical success of SGD in non-convex optimization is unquestioned, the theoretical foundation of this phenomenon is still rather limited. Recent findings suggest the ability of this method to escape saddle points and reach local minima in polynomial time for general non-convex problems but they either need to artificially add noise to the iterates (ge2015escaping, ; lee2016gradient, ) or make an assumption on the inherent noise of vanilla SGD daneshmand2018escaping . For neural network training, a recent line of research proclaims the effectiveness of SGD but the results usually come at the cost of fairly strong assumptions such as heavy overparametrization and Gaussian inputs du2017gradient ; brutzkus2017globally ; li2017convergence ; du2018power ; allen2018convergence . Adaptive gradient methods duchi2011adaptive ; tieleman2012lecture ; kingma2014adam build on the intuition that larger learning rates for smaller gradient components, and smaller learning rates for larger gradient components balance their respective influences and thereby make the methods behave as if they were optimizing a more isotropic surface. This intuition is put into practice by pre-conditioning the update direction with a diagonal matrix of squared past gradients. Such approaches have first been suggested for use in neural networks by lecun2012efficient . Recently, convergence guarantees for such methods are starting to appear ward2018adagrad ; li2018convergence . However, these are not superior to the worst case iteration complexity of standard gradient descent cartis2012much and it is yet to be explained rigorously under which conditions adaptive gradient methods can provably accelerate optimization.

Regularized Newton methods

The most principled class of regularized Newton methods are trust region (TR) and adaptive cubic regularization algorithms (ARC) conn2000trust ; cartis2011adaptive , which repeatedly optimize a local Taylor model of the objective while making sure that the step does not travel too far such that the model stays accurate. While the former finds first-order stationary points within , ARC only takes at most . However, simple modifications to the TR framework allow these methods to obtain the same accelerated rate curtis2017trust . Both methods take at most iterations to find an approximate second-order stationary point cartis2012complexity . These rates are optimal for second-order Lipschitz continuous functions carmon2017lower ; cartis2012complexity and they can be retained even when only sub-sampled gradient and Hessian information is used kohler2017sub ; yao2018inexact ; xu2017newton ; blanchet2016convergence ; liu2018stochastic ; cartis2017global . Furthermore, the involved Hessian information can be computed solely based on Hessian-vector products, which are implementable efficiently for neural networks pearlmutter1994fast . This makes these methods particularly attractive for deep learning but the empirical evidence of their applicability is so far very limited. We are only aware of the works of liu2018stochastic and xu2017second , which report promising first results but these are by no means fully encompassing. While the former does not benchmark any first-order method at all, the latter uses large batch sizes and does not train convolutional networks nor benchmark any adaptive gradient method. Furthermore, the results are only reported in terms of backpropagations and not over wall-clock time or epochs.

Gauss-Newton methods

Inspired by natural gradient descent algorithms amari1998natural , an interesting line of research proposes to replace the Hessian by (approximations of) the Generalized-Gauss-Newton matrix (GGN) within a Levenberg-Marquardt framework111This algorithm is a simplified TR method, originally tailored for non-linear least squares problems nocedal2006nonlinear for optimization of neural networks lecun2012efficient ; martens2010deep ; chapelle2011improved ; martens2015optimizing . These methods also optimize a regularized quadratic model in each iteration and they have been termed hessian-free since only access to GGN-vector products is required. As the GGN matrix is always positive semidefinite, they cannot leverage negative curvature to escape saddles and hence there exist no second-order convergence guarantees. Furthermore, there are cases in neural network training where the Hessian is much better conditioned than the Gauss-Newton matrix mizutani2008second . Nevertheless, the above works report promising preliminary results and most notably grosse2016kronecker report that the hessian-free K-FAC method can be faster than SGD when training a convnet.

In general, this line of work is to be seen as complementary to our approach since it is straight forward to replace the Hessian in the TR framework with the GGN matrix. Furthermore, the preconditioners used in martens2010deep and chapelle2011improved

, namely diagonal estimates of the empirical Fisher and Fisher matrix respectively, can directly be used as matrix norms in our ellipsoidal TR framework. A closer examination of the difference between Hessian- and GGN-Information for neural network training is an interesting direction of future research.

3 An alternative view on adaptive gradient methods

Adaptively preconditioned gradient methods update iterates in the following way

where is a stochastic estimate of and is a positive definite symmetric pre-conditioning matrix. In Adagrad,

is the un-centered second moment matrix of the past gradients which we compute as follows

(2)

where , is the identity matrix and . Building up on the intuition that past gradients might become obsolete in quickly changing non-convex landscapes, RMSprop (and Adam) introduce an exponential weight decay on these terms leading to the preconditioning matrix

(3)

where . In order to save computational efforts, the diagonal versions and are more commonly applied in practice, which in turn gives rise to coordinate-wise adaptive stepsizes that are enlarged (reduced) in coordinates that have seen past gradient components with a larger (smaller) magnitude. In that way, the optimization methods can account for gradients of potentially different scales arising from e.g. different layers of the networks.

3.1 Adaptive preconditioning as ellipsoidal Trust Region

Starting from the fact that adaptive methods employ coordinate-wise stepsizes, one can take a principled view on these methods. Namely, their update steps arise from minimizing a first-order Taylor model of the function within an ellipsoidal search space around the current iterate , where the diameter of the ellipsoid along a certain coordinate is proportional to the associated stepsize. Correspondingly, vanilla SGD optimizes the same first-order model within a spherical constraint. Fig. 1 (top) illustrates this effect on three quadratic objectives by showing not only the iterates of GD and Adagrad but also the implicit trust regions within which the local models were optimized at each step.222For illustrative purposes we only plot every other trust region. Since the models are linear, the constrained minimizer is always found on the boundary of the current trust region.

[width=0.3trim=22pt 22pt 30pt 30pt,clip,natwidth=610,natheight=642]figures/well_conditioned_walk4.pdf [width=0.3trim=22pt 22pt 30pt 30pt,clip]figures/ill_conditioned_walk4.pdf [width=0.3trim=22pt 22pt 30pt 30pt,clip]figures/tilted_walk2.pdf
[width=0.3trim=22pt 22pt 30pt 30pt,clip]figures/well_conditioned_loss2_10runs_250iters.pdf [width=0.3trim=22pt 22pt 30pt 30pt,clip]figures/ill_conditioned_loss2_10runs_250iters.pdf [width=0.3trim=22pt 22pt 30pt 30pt,clip]figures/tilted_loss_30runs.pdf
Figure 1: Top: Iterates and implicit trust regions of GD and Adagrad on two quadratic objectives with different condition number

. Bottom: Average log suboptimality over iterations as well as 90% confidence intervals of 30 runs with random initialization

It is a well known fact that GD struggles to progress towards the minimizer of quadratics along low-curvature directions, since the gradient in these directions has significantly smaller components than along high curvature (see e.g. (goh2017why, )). For fairly well-conditioned objectives this effect is negligible, as can be seen on the left hand side of Fig. 1. Yet, the plots in the middle show that this effect can drastically slow GD down when the problem is ill-conditioned. Particularly, once the method has reached the bottom of the valley it struggles to make progress along the horizontal axis. This observation resembles the classical zig-zagging effect (nocedal2006nonlinear, ) and is precisely where the advantage of adaptive stepsize methods comes into play. As illustrated by the dashed trust region lines, Adagrad’s search space is damped along the direction of high curvature (vertical axis) and elongated along the low curvature direction (horizontal axis). This preconditioning allows the method to move more horizontally early on and thus enter the valley with a smaller distance to the optimizer along the low curvature direction which effectively circumvents the zig-zagging effect.

Let us now formally establish the result that allows us to re-interpret adaptive gradient methods from the trust region perspective introduced above.

Theorem 1 (Preconditioned gradient methods as TR).

A preconditioned gradient step

(4)

with stepsize , symmetric positive definite preconditioner and minimizes a first order model around in an ellipsoid given by in the sense that

(5)
Corollary 1 (Rmsprop).

The step minimizes a first order Taylor model around in an ellipsoid given by (Eq. (3)) in the sense that

(6)

Equivalent results can be established for Adam using as well as for Adagrad by replacing the matrix into the constraint in Eq. (6). Of course, the update procedure in Eq. (5) is merely a re-interpretation of the original preconditioned update Eq. (4) and thus the employed trust region radii are defined implicitly by the current gradient and stepsize but it is straightforward to embed this approach into a proper TR framework where the trust region radius is defined explicitly by the algorithm, which gives rise to first order TR methods that behave very similarly to GD and Adagrad in practice (see Fig. 5 in Appendix).

3.2 Diagonal versus full preconditioning

A closer look at Fig. 1 reveals that the first two problems come with level sets that are perfectly axis-aligned, which makes these objectives particularly attractive for diagonal preconditioning. For a comparison, to the far right of Fig. 1 we report a further quadratic problem instance, where the Hessian is no longer zero on the off-diagonals. As can be seen, the interaction between coordinates introduces a tilt in the level sets and reduces the superiority of diagonal Adagrad over plain GD. However, using the full preconditioner re-establishes the original speed up. Yet, non-diagonal preconditioning comes at higher computational per-iteration cost since it involves taking the inverse square root of a large matrix, which is why this approach has been relatively unexplored (see (agarwal2018case, ) for a recent exception). In the following section, we find that such full matrix preconditioning is likely not worth the effort for neural networks since they show a surprisingly large degree of axis alignment.

3.3 Are neural networks axis-aligned?

Early results by becker1988improving on the curvature structure of neural networks report a strong diagonal dominance of the Hessian matrix . This suggests that the loss surface is indeed somewhat axis-aligned, which would render full matrix preconditioning obsolete. However, the reported numbers are only for tiny feed-forward networks of at most 256 parameters. Therefore, we generalize these findings in the following to larger networks and contrast the diagonal dominance of real Hessian matrices to the expected behaviour of random Wigner matrices. For this purpose, let define the ratio of diagonal to overall mass of a matrix , i.e.

Proposition 1.

For a random Gaussian333The argument naturally extends to any distribution with positive expected absolute values Wigner matrix (Eq. (29)) the expected share of diagonal mass amounts to

Thus, if we suppose the Hessian at any given point were a random Wigner matrix we would expect the share of diagonal mass to fall with as the network grows in size. Yet, as can be seen in Fig. 2 the diagonal mass of real-world neural networks stays way above this theoretical value at random initialization, during training and after convergence. These findings are in line with becker1988improving and thus question the effectiveness of full matrix preconditioning and full matrix ellipsoidal TR methods.

Convnet

MLP

Figure 2: Diagonal mass of Hessian relative to of corresponding Wigner matrix at random initialization, middle and end of training with RMSProp on CIFAR-10. Average and 95% Confidence Interval over 10 independent runs. See Figure  6 for MNIST results.

4 Second-order Trust Region Methods

Cubic regularization nesterov2006cubic ; cartis2011adaptive and trust region methods belong to the family of globalized Newton methods. Both frameworks compute their parameter updates by optimizing regularized (former) or constrained (latter) second-order Taylor models of the objective around the current iterate . In particular, in iteration the update step of the trust region algorithm is computed as

(7)

where and and are either and or suitable approximations. The matrix induces the shape of the constraint set. So far, the common choice for neural networks is which gives rise to spherical trust regions xu2017second ; liu2018stochastic . By solving the constrained problem (7), TR methods overcome the problem that pure Newton steps may be ascending, attracted by saddles or not even computable in the case of indefinite Hessians. See Section 2 and Appendix B for more details.

4.1 Convergence of ellipsoidal Trust Region methods

Inspired by the success of adaptive gradient methods such as RMSProp, we investigate the use of their preconditioning matrices as norm inducing matrices for second-order TR methods. One crucial condition for convergence is that the applied norms are not degenerate during the minimization process in the sense that the ellipsoids do not flatten out completely along any given direction.

Definition 1 (Uniformly equivalent norms).

The norm induced by a symmetric positive definite matrix is called uniformly equivalent, if there exists a constant such that

(8)

We now establish a result which shows that the RMSProp ellipsoid is indeed uniformly equivalent.

Proposition 2 (Uniform equivalence).

Suppose for all and . Then there always exists such that the proposed preconditioning matrix (Eq. (3)) is uniformly equivalent, i.e. Def. 1 holds. The same holds for the diagonal variant.

Consequently, the ellipsoids can directly be applied to any convergent TR framework without losing convergence guarantees. Interestingly, this result cannot be established for the Adagrad ellipsoid , which reflects the widely known vanishing stepsize problem that arises since squared gradients are continuously added to the preconditioning matrix. It is mainly this effect that eventually inspired the development of RMSprop tieleman2012lecture and Adadelta zeiler2012adadelta .

Why ellipsoids?

There are many sources for ill-conditioning in neural networks such as un-centered and correlated inputs lecun2012efficient , saturated hidden units and different weight scales in different layers of the network van1998solving . While the quadratic term in the model (7) does account for such ill-conditioning to some extent, the spherical constraint is completely blind towards the loss surface but it is in general advisable to instead measure distances in norms that reflect the underlying geometry (Chapter 7.7 in conn2000trust ). The ellipsoids we propose are such that they allow for longer steps along coordinates that have seen small gradient components in past and vice versa. Thereby the TR shape is adaptively adjusted to fit the current region of the non-convex loss landscape. Contrary to adaptive first order methods, we do not only adapt the shape but also the corresponding diameter () depending on whether or not the local Taylor model is an adequate approximation within the trust region at the current state of the algorithm. This procedure is not only effective when the iterates are in an ill-conditioned convex neighborhood of a minimizer (Figure 1) but it also helps to escape elongated plateaus (Sect. 5).

4.2 A stochastic TR framework for neural network training

Since neural network training often constitutes a large scale learning problem in which the number of datapoints is very high, we here opt for a fully stochastic TR framework chen2018stochastic

in order to circumvent memory issues and reduce computational complexity. Given that the involved function and derivative estimates are sufficiently accurate with fixed probability, such a framework has been shown to retain the convergence rate of deterministic TR methods to stationary points in expectation 

blanchet2016convergence . For finite-sum objectives such as Eq. (1) the required level of accuracy can be obtained by simple mini-batching.

1:  Input: , , , ,
2:  for  do
3:     Sample , and with batch sizes
4:     Compute preconditioner s.t. D1 holds
5:     Obtain by solving (Eq. (7)) s.t. Eq (36) holds
6:     Compute actual over predicted decrease on batch
(9)
7:     Set
8:  end for
Algorithm 1 Stochastic Ellipsoidal Trust Region Method

The main difference between this and most existing approaches in  kohler2017sub ; xu2017newton ; yao2018inexact ; cartis2017global lies in the computation of , which is traditionally computed as full function- over stochastic model decrease. Computing solely based on sub-sampled quantities has the nice side-effect of disentangling these two potential sources of error: an overoptimistic trust region radius or insufficiently small batch sizes. Namely, the quantity from Eq. 9

is an unbiased estimate of the

used in fully deterministic algorithms and hence the trust region radius is adjusted purely based on the current adequacy of the local quadratic approximation.

5 Experiments

To validate our claim that ellipsoidal TR methods yield improved performance over spherical ones, we run a large set of experiments on three image datasets and three types of network architectures. As outlined in Section 2, many theoretical arguments suggest the use of regularized Newton methods for optimization of neural networks but certain practical objections exist (see Appendix F.2) and so far there is no conclusive empirical evidence (see discussion in Section 2). Hence, we also present a comprehensive benchmark with state of the art first-order methods. We fix the sample size for all gradient methods to 128 but grid search the stepsize since the ratio of these two quantities effectively determines the level of stochasticity jastrzkebski2017three . The TR methods mostly use 512 samples, since they are more likely to overfit very small minibatches. More details can be found in Appendix C.

Second-order methods

As can be seen in Figure 3

the ellipsoidal TR methods consistently outperform their spherical counterpart in the sense that they reach full accuracy substantially faster on all problems and in all measures (time and epochs). Furthermore, their limit points are in all cases lower than those of the uniform TR method and especially on the autoencoders this makes an actual difference in the image reconstruction quality (see Figure 

12). We thus draw the clear conclusion that the ellipsoidal trust region constraints we propose are to be preferred over their spherical counterpart when training neural networks.

ConvNet MLP autoencoder

Fashion-MNIST

CIFAR-10

Figure 3: Log loss over epochs. Average and CI of 10 independent runs. Green dotted line indicates accuracy, black vertical line indicates where TR methods start. See Figure 9 for results over time.
Benchmark with SGD

While the ellipsoidal TR methods are slightly superior in terms of epochs, they only just manage to keep pace with first-order methods in terms of wall-clock time (Figure 10). Furthermore, the limit points of both first- and second-order methods yield the same order of loss in most experiments. When taking gradient norms into account (reported in Appendix J) we indeed find no spurious local minima and only one saddle point despite potentially arbitrarily non-convex loss landscapes. This suggests that second-order information is not needed to circumvent such obstacles in neural network optimization landscapes. Finally, we had to warm start the TR methods with a few first-order epochs to achieve good performance in terms of time on the convnets (see Appendix I.1 for a detailed discussion).

ConvNet MLP autoencoder

Fashion-MNIST

CIFAR-10

Figure 4: Log loss over epochs. Same setting as Figure 3. See Figure 10 for results over time.

6 Conclusion

We investigated the use of ellipsoidal trust region constraints for neural networks. We have shown that the RMSProp matrix satisfies the necessary conditions for convergence and our experimental results demonstrate that ellipsoidal TR methods outperform their spherical counterparts significantly. We thus consider the development of further ellipsoids that can potentially adapt even better to the loss landscape such as e.g. (block-) diagonal hessian approximations(e.g. bekas2007estimator ) or approximations of higher order derivatives as an interesting direction of future research.

Yet, our comprehensive empirical study also highlights that the value of Hessian information for neural network optimization is limited for mainly two reasons: 1) second-order methods rarely yield better limit points, which suggests that saddles and spurious local minima are not a major obstacle 2) low per-iteration costs render gradient methods superior in terms of time. The latter observation suggests that advances in hardware and distributed second-order algorithms (e.g. osawa2018second ; dunner2018distributed ) will be needed to speed up computations before Newton-type methods can replace (stochastic) gradient methods in deep learning. Finally, since Trust Region methods come with a large number of hyperparameters, Bayesian hyperparameter optimization is another interesting direction of future work.

References

Appendix A Notation

Throughout this work, scalars are denoted by regular lower case letters, vectors by bold lower case letters and matrices as well as tensors by bold upper case letters. By we denote an arbitrary norm. For a symmetric positive definite matrix we introduce the compact notation , where .

Appendix B Equivalence of Preconditioned Gradient Descent and First Order Trust Region Methods

Theorem 2 (Theorem 1 restated).

A preconditioned gradient step

(10)

with stepsize , symmetric positive definite preconditioner and minimizes a first order local model around in an ellipsoid given by in the sense that

(11)
Proof.

We start the proof by noting that the optimization problem (11) is convex. For the constraint satisfies the Slater condition since is a strictly feasible point. As a result, any KKT point is a feasible minimizer and vice versa.

Let denote the Lagrange dual of (5)

(12)

Any point is a KKT point if and only if the following system of equations is satisfied

(13)
(14)
(15)
(16)

For as given in Eq. (4) we have that

(17)

and thus (14) and (15) hold with equality such that any is feasible. Furthermore,

(18)

is zero for . As a result, is a KKT point of the convex Problem (5) which proves the assertion.

To illustrate this theoretical result we run gradient descent and Adagrad as well as the two corresponding first-order TR methods444Essentially Algorithm 1 with based on a first order Taylor expansion. on an ill-conditioned quadratic problem. While the method 1st TR optimizes a linear model within a ball in each iteration, 1st TR optimizes the same model over the ellipsoid given by the Adagrad matrix . The results in Figure 5 show that the methods behave very similar to their constant stepsize analogues.

[width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/ill_conditioned_walk_withTR.pdf [width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/ill_conditioned_loss_withTR.pdf
Figure 5: Iterates (left) and log suboptimality (right) of GD, Adagrad and two full-featured first-order TR algorithms of which one (1st TR) is spherically constraint and the other (1st TR) uses as ellispoid.

Appendix C Convergence of ellipsoidal TR methods

Spherical constrained TR methods

Under standard smoothness assumptions, spherical TR algorithms achieve criticality after iterations and additionally almost positive curvature in iterations. These rates are attained without the need of actually knowing the Hessian’s Lipschitz constant. The complete statement of the convergence results for TR methods can be found in Theorem 4.3 of [15]. Interestingly, these rates can be improved to match the (optimal) first-order worst case complexity of Cubic Regularization by applying small modifications to the TR framework. As stated in Section E.2.2 the involved subproblems do not need to be solved globally in each iteration.

For both, cubic regularization and trust region methods, many stochastic extensions have emerged in literature that alleviate the need to compute exact derivative information without losing the above mentioned convergence guarantees with high probability [45, 76, 77, 9, 35, 35]). For the deep learning setting, the analysis of [9] is most relevant since it also allows the algorithm to run solely based on sub-sampled function evaluations.

Ellipsoidal constrained TR methods

In order to prove such results for ellipsoidal Trust Region methods one must ensure that the applied norms are coherent during the complete minimization process in the sense that the ellipsoids do not flatten out (or blow up) completely along any given direction. This intuition is formalized in Assumption 1 which we restate here for the sake of clarity.

Definition 2 (Definition 1 restated).

There exists a constant such that

(19)

Having uniformly equivalent norms is necessary and sufficient to prove that ellipsoidal TR methods enjoy the same convergence rate as classical ball constrained Trust Region algorithms.

Towards this end, [22] identify the following sufficient condition on the basis of which we will prove that our proposed ellipsoid is indeed uniformly equivalent under some mild assumptions.

Lemma 1 (Theorem 6.7.1 in [22]).

Suppose that there exists a constant such that

(20)

then Definition 1 holds.

Proposition 3 (Uniform equivalence).

Suppose that for all and . Then there always exists an such that the proposed preconditioning matrix (Eq. (3)) is uniformly equivalent, i.e. Definition 1 is satisfied. The same holds for the diagonal variant.

Proof.

The basic building block of our ellipsoid matrix consists of the current and past stochastic gradients

(21)

We consider which is built up as follows555This is a generalization of the diagonal variant proposed by [71], which precondition the gradient step by an elementwise division with the square-root of the following estimate .

(22)

From the construction of it directly follows that for any unit length vector we have

(23)

which proves the lower bound for . Now, let us consider the upper end of the spectrum of . Towards this end, recall the geometric series expansion

(24)

and the fact that is a sum of exponentially weighted rank-one positive semi-definite matrices of the form . Thus

where the latter inequality holds per assumption for any sample size . Combining these facts we get that

(25)

As a result we have that

(26)

Finally, to achieve uniform equivalence we need the r.h.s. of (26) to be bounded by . This gives rise to a quadratic equation in , namely

(27)

which holds for any and any as long as

(28)

Such an always exists but one needs to choose smaller and smaller values as the upper bound on the gradient norm grows. For example, the usual value is valid for all .

All of the above arguments naturally extend to the diagonal preconditioner .

Interestingly, this result cannot be established for the Adagrad inspired ellipsoid , which reflects the commonly noticed effect that the stepsizes of first-order Adagrad shrink over time as squared gradients are continuously added to the preconditioning matrix. It is mainly this effect that eventually inspired the development of Rmsprop [71], Adadelta [80] and simliar approaches.

Appendix D Diagonal Dominance in Neural Networks

d.1 Proof of Proposition 1

Proposition 4 (Proposition 1 restated).

For random Gaussian Wigner matrix formed as

(29)

where stands for i.i.d. draws [75], the expected share of diagonal mass amounts to

(30)
Proof.
(31)

which simplifies to

if the diagonal and off-diagonal elements come from the same Gaussian distribution (

). ∎

For the sake of simplicity we only consider Gaussian Wigner matrices but the above argument naturally extends to any distribution with positive expected absolute values, i.e. we only exclude the Dirac delta function as probability density.

CONV MLP
Figure 6: Share of diagonal mass of the Hessian relative to of the corresponding Wigner matrix at random initialization, after 50% iterations and at the end of training with RMSprop on MNIST. Average and 95% confidence interval over 10 runs. See Figure 2 for CIFAR-10 results.

Appendix E Background on second-order optimization

e.1 Newton’s Method

The canonical second-order method is Newton’s methods. This algorithm uses the inverse Hessian as a scaling matrix and thus has updates of the form

(32)

which is equivalent to optimizing the local quadratic model

(33)

to first order stationarity. Using curvature information to rescale the steepest descent direction gives Newton’s method the useful property of being linearly scale invariant. This gives rise to a problem independent local convergence rate that is super-linear and even quadratic in the case of Lipschitz continuous Hessians (see [60] Theorem 3.5), whereas gradient descent at best achieves linear local convergence [58].

However, there are certain drawbacks associated with applying classical Newton’s method. First of all, the Hessian matrix may be singular and thus not invertible. Secondly, even if it is invertible the local quadratic model (Eq. (33)) that is minimized in each NM iteration may simply be an inadequate approximation of the true objective. As a result, the Newton step is not necessarily a descent step. It may hence approximate arbitrary critical points (including local maxima) or even diverge. Finally, the cost of forming and inverting the Hessian sum up to and are thus prohibitively high for applications in large dimensional problems.

e.2 Trust Region Methods

e.2.1 Outer iterations

Trust region methods are among the most principled approaches to overcome the above mentioned issues. These methods also construct a quadratic model but constrain the subproblem in such a way that the stepsize is restricted to stay within a certain radius within which the model is trusted to be sufficiently adequate

(34)

Hence, contrary to line-search methods this approach finds the step and its length simultaneously by optimizing (34). Subsequently the actual decrease is compared to the predicted decrease and the step is only accepted if the ratio exceeds some predefined success threshold . Furthermore, the trust region radius is decreased whenever falls below and it is increased whenever exceeds the "very successful" threshold . Thereby, the algorithm adaptively measures the accuracy of the second-order Taylor model – which may change drastically over the parameter space depending on the behaviour of the higher-order derivatives666Note that the second-order Taylor models assume constant curvature. – and adapts the effective length along which the model is trusted accordingly. See [22] for more details.

As a consequence, the plain Newton step is only taken if it lies within the trust region radius and yields a certain amount of decrease in the objective value. Since many functions look somehow quadratic close to a minimizer the radius can be shown to grow asymptotically under mild assumptions such that eventually full Newton steps are taken in every iteration which retains the local quadratic convergence rate [22].

[width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/newton_ascending.pdf [width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/t_around_saddle.pdf
Figure 7: Level sets of the non-convex, coercive objective function . Newton’s Method makes a local quadratic model (blue dashed lines) and steps to its critical point. It may be thus be ascending (left) or attracted by a saddle point (right). TR methods relieve this issue by stepping to the minimizer of that model within a certain region (green dashed line).
e.2.2 Subproblem solver

Interestingly, there is no need to optimize (34) to global optimality to retain the remarkable global convergence properties of TR algorithms. From a practical point of view this is very good news since globally optimizing the involved quadratics scales quadratically in the problem dimension () which can be very large for neural networks. Instead, it suffices to do better than the Cauchy- and Eigenpoint777which are the model minimizers along the gradient and the eigendirection associated with its smallest eigenvalue, respectively. simultaneously. One way to ensure this is to minimize in nested Krylov subspaces. These subspaces naturally include the gradient direction as well as increasingly accurate estimates of the leading eigendirection

(35)

until (for example) the stopping criterion

(36)

is met, which requires increased accuracy as the underlying trust region algorithm approaches criticality. Conjugate gradients and Lanczos method are two iterative routines that implicitly build up a conjugate and orthogonal basis for such a Krylov space respectively and they converge linearly on quadratic objectives with a square-root dependency on the condition number of the Hessian [22]. We here employ the preconditionied Steihaug-Toint CG method [70] in order to cope with possible boundary solutions of (34) but similar techniques exist for the Lanczos solver as well for which we also provide code. As preconditioning matrix for CG we use the same matrix as for the ellipsoidal constraint.

e.3 Damped (Gauss-)Newton methods

An alternative approach to actively constraining the region within which the model is trusted is to instead penalize the step norm in each iteration in a Lagrangian manner. This is done by so-called damped Newton methods that add a multiple of the identity matrix to the second-order term in the model, which leads to the update step

(37)

This can also be solved hessian-free by conjugate gradients (or other Krylov subspace methods). The penalty parameter is acting inversely to the trust region radius and it is often updated accordingly.

Many algorithms in the existing literature replace the use of in (37) with the Generalized Gauss Newton matrix [52, 19] or an approximation of the latter [54]. This matrix constitutes the first part of the well-known Gauss-Newton decomposition

(38)

where and are the first and second derivative of assuming that (binary classification and regression task) for simplicity here.

It is interesting to note that the GGN matrix of neural networks is equivalent to the Fisher matrix used in natural gradient descent [6]

for convex loss functions like mean-squared-error and cross-entropy loss

[63]. As can be seen in (38) the matrix is positive semidefinite (and low rank if ). As a result, there exist no second-order convergence guarantees for such methods on general non-convex problems. On the other end of the spectrum, the GGN also drops possibly positive terms from the Hessian (see (38)). Hence it is not guaranteed to be an upper bound on the latter in the PSD sense. Essentially, GGN approximations assume that the network is piece-wise linear and thus the GGN and Hessian matrices only coincide in the case of linear and ReLU activations or non-curved loss functions. For any other activation the GGN matrix may approximate the Hessian only asymptotically and if the terms in (38) go to zero for all . In non-linear least squares such problems are called zero-residual problems and GN methods can be shown to have quadratic local convergence there. In any other case the convergence rate does not exceed the linear local convergence bound of gradient descent. In practice however there are cases where deep neural nets do show negative curvature in the neighborhood of a minimizer [11].

When is small the step is similar to the plain Gauss-Newton update whereas for very large , Eq. (37) approximates a gradient descent update with a very small step size. Such algorithms are commonly known as Levenberg-Marquardt algorithms and they were originally tailored towards solving non-linear least squares problems [60] but they have been proposed for neural network training already early on [37]. Finally, [27] propose the use of the absolute Hessian instead of the GGN matrix in a framework similar to (37). This method has been termed saddle-free Newton even though its manifold of attraction to a given saddle is non-empty888It is the same as that for gradient descent, which renders the method unable to escape e.g. when initialized right on a saddle point. To be fair, the manifold of attraction for gradient descent constitutes a measure zero set [48]..

[width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/gn_around_saddle.pdf [width=0.4trim=22pt 22pt 30pt 30pt,clip]figures/abs_H_around_saddle.pdf
Figure 8: Both, the GGN method and saddle-free Newton method make a positive definite quadratic model around the current iterate and thereby overcome the abstractedness of pure Newton towards the saddle (compare Figure 7). However, (i) none of these methods can escape the saddle once they are in the gradient manifold of attraction and (ii) as reported in [56] the GN matrix can be significantly less well conditioned than the absolute Hessian (here and so we had to add a damping factor of to make the GN step fit the plot.
e.3.1 Comparison to trust region

Contrary to TR methods though, the regularized Newton methods never take plain Newton steps since the regularization is always on (). Furthermore, if a positive-definite Hessian approximation like the Generalized Gauss Newton matrix is used, this algorithm is not capable of exploiting negative curvature and there are cases in neural network training where the Hessian is much better conditioned than the Gauss-Newton matrix [56] (also see Figure 8). While some scholars believe that positive-definiteness is a desirable feature [52, 19], we want to point out that following negative curvature directions is necessarily needed to escape saddle points and it can also be meaningful to follow directions of negative eigenvalue outside a saddle since they guarantee progress, whereas a gradient descent step yields at least progress (both under certain stepsize conditions) and one cannot conclude a-priori which one is better in general [23, 3]. Finally, there is a growing body of first-order algorithms that explicitly include additional negative curvature steps to derive state-of-the-art time complexities in classical minimization problems [4, 66, 14, 40] as well as in min-max problems to avoid convergence to unstructured saddle points [1].

Despite these theoretical considerations, many methods based on GGN matrices have been applied to neural network training (see [53] and references therein) and particularly the hessian-free implementations of [52, 19] can be implemented very cheaply since they allow to make use of an efficient procedure for computing GGN-vector products [68]. While these works provide promising results, [76] report superior performance of TR algorithms on the autoencoder architecture presented in Section J. Finally, [54] and [36] propose an algorithm termed K-FAC that uses a non-diagonal, high-rank approximation of the Fisher information matrix which can incorporated into an approximate999When using K-FAC within a damped Newton scheme, one needs to add the term to the GGN approximation before inverting. However, the diagonal blocks are then no longer expressible as Kronecker products so [54] must instead use an approximation to the LM approach. Levenberg-Marquardt scheme for both feed-forward and convolutional neural networks since the inverse is readily available and thus no Krylov subspace methods need to be run to find the next step.

Unfortunately, we are not aware of any comprehensive empirical comparison between Gauss-Newton/Fisher Information and Hessian based approaches but we consider this to be a very interesting direction of future research.

Appendix F Using Hessian information in Neural Networks

f.1 Theoretical advantages

TR methods present several theoretical advantages which are partly illustrated in Figure 7:

  • Contrary to gradient descent, TR methods enjoy a superlinear local convergence rate independent of the local conditioning [60].

  • Contrary to gradient descent, Newton- and damped Newton methods (such as Levenberg-Marquardt) TR methods can leverage curvature information which is believed to be helpful for non-convex optimization[3, 23, 14]. Thereby they escape elongated plateaus and (strict) saddle point quickly and thus provably converge to second order stationary points. In theory, saddles even emerge in shallow feed forward networks [32].

  • Just as damped Newton methods, TR methods update their iterates based on Hessian-vector products, which can be computed roughly at the cost of two gradient computations [65].

  • Finally, slight modifications of the TR framework suffice to prove the same worst-case iteration complexity as Cubic Regularization methods [25], which is optimal for the class of second-order Lipschitz smooth functions [17, 13] and better than that of GD [58].

f.2 Practical objections

In practice, however, these theoretical advantages do not always play out. Particularly, when training neural networks one must take the following considerations into account.

  • While superlinear local convergence is nice for traditional optimization, the ultimate goal of any learning task is generalization and thus it might no longer be desirable to even find a high-accuracy solution on the training data. Yet, second-order methods can also be regularized with early-stopping, weight decay, dropout and similar techniques.

  • While saddle points are believed to be ubiquitous in high dimensions [21, 27], they do not seem to pose a serious problem when training neural networks.

    • First of all, saddles are – in our experience – hard to find in practice when using random initialization. In our set of experiments, only the Autoencoder architecture gave rise to potential saddle points. ([27] and [76] report the same saddle.)

    • Secondly, even vanilla GD is highly unlikely to ever converge to a saddle [62] and vanilla SGD may escape saddles even in polynomial time due to its inherent noise [26].

    • Third, second-order information is only useful to escape so-called strict-saddles. While linear neural networks (with quadratic loss) are indeed strict-saddles [42], this is generally not the case for non-linear nets [32] and particularly not for networks with non-curved activations such as ReLUs.

  • In practice, the value of worst case convergence rates is questionable since they are ultimately tailored towards worst case functions that rarely occur on real world problems [24]. Particularly

    • TR methods are often observed to perform just as good as or even better than cubic regularization [45, 76, 51]

    • (Stochastic) gradient descent often finds an stationary point much faster than its theoretical () rate predicts.

  • Finally, second-order methods may suffer substantially more from mini-batching than first-order methods do, since much more information is extracted from the batch in each second-order step which leads to very noisy updates due to overfitting the batch. Yet, many real-world problems render large batch settings impossible due to memory restrictions.

    • In Figure J.2 we report that small batches lead to a Hessian Lipschitzness estimator with large variance and a surprising tendency towards overestimation which may lead to overly conservative stepsizes.

At this point we refer to the following Appendix C for a detailed conclusion regarding the above discussion on the basis of our experimental findings.

Appendix G Experimental results overview

g.1 Ellipsoidal Trust Region vs. Uniform Trust Region

Convolution Fully-Connected Autoencoder

CIFAR-10

MNIST

Fashion-MNIST

Figure 9: Experiment with different trust region shapes overview. Average log loss as well as confidence interval over wall-clock time on one Tesla GPU.

g.2 Ellipsoidal Trust Region vs. First-order Optimizer

Convolution Fully-Connected Autoencoder

CIFAR-10

MNIST

Fashion-MNIST

Figure 10: Experiment comparing ellipsoidal Trust Region with several first-order methods. Average log loss as well as confidence interval over wall-clock time on one Tesla P100 GPU.

The above figures are intended to give the reader a quick overview of our results in terms of the loss values. Please see Appendix J for more details about gradient norm as well as train and test accuracy. There, we also provide the applied parameters, curvature computations, and reconstructed autoencoder images.

Appendix H Conclusion from Experiments

Observation I: Ellipsoidal TR methods consistently perform better than their spherical counterpart.

We find that ellipsoidal TR methods constantly and significantly outperform their spherical counterparts in the sense that at least one of them reaches the accuracy mark faster than spherical TR in all of our tested settings both in terms of epochs and time. Furthermore, the limit points of our proposed methods are in all cases lower or equal to those of the uniform TR methods and especially on the autoencoder architectures this makes an actual difference in the image reconstruction quality (see Figure 12).

Observation II: TR methods make mostly faster progress than first-order methods in terms of epochs.

In summary, Figure 4 shows that TR methods outperform stochastic gradient descent methods in terms of epochs in the sense that in each of our settings at least one of the ellipsoidal TR algorithms achieves accuracy faster than all gradient methods. One exception is the autoencoder setting in which first-order methods are superior.

Observation III: In terms of time, the stochastic ellipsoidal TR methods only just manage to keep pace with first-order methods.

As can be seen in Figure 10, the stochastic ellipsoidal TR method can roughly keep pace with first order methods but the advantage seen in the epoch plots (Fig. 4) is diminished substantially. We hence believe that advances in both hardware and distributed algorithms will be needed to enable a more widespread use of second-order methods in deep learning.

Observation IV: No spurious local minima and almost no saddles. Underparametrization can lead to severe ill-conditioning.

The limit points of all methods yield the same loss in the vast majority of our experiments which is somewhat surprising given that we use repeated (ten) random initializations in potentially very non-convex and complicated landscapes. Nevertheless, no method gets stuck in a spurious local minima or saddle point for long. The only method that stays above all others at times is TR but a closer look at the gradient norm reveals that it has not yet converged to a critical point of any kind (see Appendix D). One exception is the CIFAR-10 MLP for which TR finds much better loss values (and much higher accuracy) than the first-order methods. Again, we note that the gradient norms stay elevated for all other methods (see Figure J.4) and thus hypothesize that this behaviour is due to the heavy under-parametrization of the network. In fact, we find that the maximum curvature in this network is at least higher than in the Fashion- and MNIST MLP (see Figure J.2).

As a matter of fact, the only architecture which gives rise to saddle points is the autoencoder on which the SGD and uniform TR loss plateaus (Fig. 10) while the gradient norm gets very small (e.g. Fig. J.6.3). Yet, SGD escapes this plateau eventually which is accompanied by a sudden increase in gradient norm. We note that the ellispoidal TR methods do find and leverage negative curvature directions to escape this area.

Observation V: Warmstart helps TR methods on curvy convnets.

We find improved performance for our second-order algorithms on the convolutional networks when starting the TR methods after a few first-order epochs. While we use only one first-order epoch for MNIST and two epochs for Fashion-MNIST we need as much as first-order iterations to reach full accuary within the 2000 seconds training time on CIFAR-10. As described in Appendix I.1, warm starting more sophisticated optimization algorithms with a few epochs of SGD is common in the literature [66, 41, 55, 27].

Appendix I Things we tried that did not work (so well)

Finally and in order to complete the picture, we here report some approaches that did not work out as well as we expected.

i.1 Training ConvNets with Trust Region without warmstart

First off, we note that warm starting with vanilla SGD is standard for other techniques that aim to obtain lower-variance estimators of the optimum as e.g. variance reduction (see [66, 41] for image classification with MLPs), iterate averaging (see e.g. [55] and references therein for learning language models with RNNs) and saddle-free Newton [27]. The motivation for this approach is to start with SGD in order to quickly forget the initialization and then fine-tune with more sophisticated algorithms.

Using such an initialization we find significantly better performance in convolutional networks and especially for the CIFAR-10 dataset. While we use only one first-order epoch for MNIST and two epochs for Fashion-MNIST we need as much as first-order iterations to reach full accuary within the 2000 seconds training time on CIFAR-10.

MNIST Fashion-MNIST CIFAR

ConvNet

Figure 11: Convnet results without warmstart. With warm start all methods achieve at least a logloss value of . A look at the vertical axis above reveals that the performance stays almost the same for MNIST, drops a bit for Fashion-MNIST and decays a lot for CIFAR. Interestingly, this pattern goes hand in hand with an increase in curvature across the different datasets (rising from left to right) (see Figure J.2)

We hypothesize that the warm start is needed because sharp landscapes can render second-order algorithms too conservative. Particularly, the CIFAR-10 convnet is the smallest of the convnet architectures and yet the hardest dataset. We thus believe that this network is not over-parameterized enough to bring about a fairly smooth and uncomplicated loss landscape. Indeed, as reported in Figure J.2 the maximum curvature in this network is about 1000 times higher than that of the other convnets and the other convnets themselves have again at least 10x higher maximum eigenvalues than the corresponding MLPs. In such a landscape, TR methods are likely to be over-conservative since the increased curvature yields very narrow quadratic approximations.101010Note that Figure J.2 only shows the upper end of the spectrum. However, when we considered the conjugate gradient steps in the early subproblems after initialization we found that those were also along very curvy directions which leads to a tiny stepsize since CG divides the search direction by the curvature along that line. This can drastically hurt the performance if the initialization is not close to any good local minimum. In this regard, note that the SGD stepsize of that we employ on the problem is way beyond the theoretical maximum of , which suggests that it is indeed meaningful to quickly navigate out of a potentially very complex initial region.

i.2 Lanczos subproblem solver

In theory the Lanczos solver brings about much better steps than CG whenever a direction of negative curvature is found. However, an efficient implementation of this routine needs functions for matrix factorization and linear systems solvers that can exploit sparsity in matrices that have a special banded structure, such as tridiagonality. While such functions exist for CPUs (e.g. in Scipy) we could not find any GPU implementation for PyTorch and using the standard function for dense matrices rendered the method far too slow. Thus, all of our results are presented using the CG solver but we think that even better performance can be achieved once an efficient Lanczos GPU implementation exists, particularly because we do find that the CG solver detects directions of negative curvature quite frequently. For that purpose, we provide code of a CPU-tailored Generalized Lanczos Trust Region (GLTR) method inspired by [22] Algorithm 7.3.4 and 7.3.6. in the hope to facilitate future developments of a GPU version.

i.3 Trust Region Adam

It is possible to design a second-order TR variant of the popular Adam optimizer. Here the main difference to RMSProp is a moving average over past gradients is used instead of just the current gradient. For first-order methods this is no problem since only the gradient value of the past iterate must be stored. In second-order TR algorithms, however, we compute Hessian-vector product in each iteration and so we would need to store all past gradients including their graphs such that we can backprop through them a second time. While this is likely to run into memory problems it also takes way too much time since one Hv product now consists of many additional backpropagations. Yet, one can also use only for the linear part of the model and compute hessian-vector products based solely on the current gradient. We tested this in some settings but found no significant improvement over and . One alternative would be to only store a small number of past gradients such as it is done in Adadelta [80] but we leave an assessment of this approach for future work.

Appendix J Further Experiment Details

j.1 Default parameters, architectures and datasets

(krylov tol.)
TR_uni 512 10 0.95 1.1 1.5 0.1
TR_ada 512 10 0.95 1.1 1.5 0.1
TR_rms 512 10 0.95 1.1 1.75 0.1
Table 1: Default parameters
Parameters

Table 1 reports the default parameters we consider. Yet, we grid-searched a subset of them (e.g. learning rate of gradient methods) and report the problem-specific values in each corresponding subsection in case of deviation from the above stated values. For the second-order methods we chose the values in order to maximize performance in terms of time. Other settings are likely to improve performance over epochs and backprops by a lot (e.g. a lower ). To compare all methods on the same basis we run all first-order algorithms without momentum. Yet, we consider it an interesting direction of future research and note that momentum can also be applied for second-order methods (see e.g. [57, 54]

For the sake of comparability, TR and TR are always run with the same parameters and hence the only difference is in the trust region shape. TR shares most of these parameters but we sometimes start with a smaller TR radius and have a higher decrease () since we observed that RMSprop usually also runs on much slower learning rates than SGD and Adagrad.

Datasets

We use three real-world datasets for image classification, namely CIFAR-10, MNIST and Fashion-MNIST111111All three datasets were accessed from https://www.tensorflow.org/api_docs/python/tf/keras/datasets. While MNIST and Fashion-MNIST are greyscale images, CIFAR-10 are colored images of size . All three datasets have a fixed training-test split consisting of 50,000 (60,000 for Fashion-MNIST) and 10,000 images, respectively.

Network architectures

The MLP architectures are simple. For MNIST and Fashion-MNIST we use a network with tanh activations and a cross entropy loss. The networks has parameters. For the CIFAR-10 MLP we use a architecture also with tanh activations and cross entropy loss. This network has parameters.

The MNIST and Fashion-MNIST autoencoders have the same architecture as the ones used in [38, 76, 52, 54]. The encoder structure is and the decoder is mirrored. Sigmoid activations are used in all but the central layer. The reconstructed images are fed pixelwise into a binary cross entropy loss. The network has a total of parameters. The CIFAR-10 autoencoder is taken from the implementation of https://github.com/jellycsc/PyTorch-CIFAR-10-autoencoder.

We use fairly small convnets that are taken from official PyTorch tutorials. Specifically, the (Fashion-)MNIST network is specified in https://github.com/pytorch/examples/blob/master/mnist/main.py and the CIFAR-10 network can be found under https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html. The total number of parameters in these networks amount to and respectively.

In all of our experiments each method was run on one Tesla P100 GPU using the PyTorch [64] library.

j.2 Observed curvature

Convolution Fully-Connected
CIFAR-10
MNIST
Fashion-MNIST

figureMaximum eigenvalue of the network’s Hessian over time when training different networks with RMSProp. Average and 95% confidence interval of 10 random initializations

j.3 Reconstructed Images from Autoencoders

Original SGD Adagrad Rmsprop TR Uniform TR Adagrad TR RMSprop
Figure 12: Original and reconstructed MNIST digits (left), Fashion-MNIST items (middle), and CIFAR-10 classes (right) for different optimization methods after convergence. Compare Figure 9 & 10 for corresponding loss.

j.4 Cifar-10

j.4.1 CIFAR-10 ConvNet

figureCIFAR-10 ConvNet time. TR methods use the default parameters of Table 1 but replace and take 20 epochs warmstart with RMSProp. First-order methods use 128 samples and , , .

j.4.2 Cifar-10 Mlp

figureCIFAR-10 MLP time: TR methods use the default parameters of Table 1. First-order methods use 128 samples and , , .

j.4.3 CIFAR-10 Autoencoder

figureCIFAR-10 autoencoder time: TR methods use the default parameters of Table 1. First-order methods use 128 samples and , , .

j.5 Mnist

j.5.1 MNIST convnet

figureMNIST ConvNet time: TR methods use the default parameters of Table 1 but but do one epoch of warmstart with RMSProp, use 1024 samples initially and replace . First-order methods use 128 samples and , , .

j.5.2 Mnist Mlp

figureMNIST MLP time: TR methods use the default parameters of Table 1 but replace . First-order methods use 128 samples and , , .

j.5.3 MNIST Autoencoder

figureMNIST autoencoder time: TR methods use the default parameters of Table 1 but replace . First-order methods use 128 samples and , , .

j.6 Fashion-MNIST

j.6.1 Fashion-MNIST convnet

figureFashion-MNIST ConvNet time: TR methods use the default parameters of Table 1 but do two epochs of warmstart with RMSProp, start with 2048 samples and replace . TR and TR use First-order methods use 128 samples and , , .

j.6.2 Fashion-MNIST MLP

figureFashion-MNIST MLP time: TR methods use the default parameters of Table 1 but replace . First-order methods use 128 samples and , , .

j.6.3 Fashion-MNIST Autoencoder

figureFashion-MNIST autoencoder time: TR methods use the default parameters of Table 1 but replace . TR uses . First-order methods use 128 samples and , , .