1 Introduction
Large deep neural networks trained on massive data sets have led to major advances in machine learning performance (
[22]). Current practice is to train networks using gradient descent (SGD) and momentum optimizers, along with naturalgradientlike methods ([17, 43, 11, 21]). As distributed computation availability increases, total walltime to train large models has become a substantial bottleneck, and approaches that decrease total walltime without sacrificing model generalization are very valuable.In the simplest version of minibatch SGD, one computes the average gradient of the loss over a small set of examples, and takes a step in the direction of the negative gradient. It is well known that the convergence of the original SGD algorithm ([29]
) has two terms, one of which depends on the variance of the gradient estimate. In practice, decreasing the variance by increasing the batch size suffers from diminishing returns, often resulting in speedups that are sublinear in batch size, and even worse, in degraded generalization performance (
[20]). Some recent work ([13, 41, 42]) suggests that by carefully tuning learning rates and other hyperparameter schedules, it is possible to train architectures like ResNets and AlexNet on Imagenet with large minibatches of up to 8192 with no loss of accuracy, shortening training time to hours instead of days or weeks.There have been many attempts to incorporate secondorder Hessian information into stochastic optimizers (see related work below). Such algorithms either explicitly approximate the Hessian (or its inverse), or exploit the use of Hessianvector products. Unfortunately, the additional computational cost and implementation complexity often outweigh the benefit of improved descent directions. Consequently, their adoption has been limited, and it has largely been unclear whether such algorithms would be successful on large modern machine learning tasks.
In this work, we attack the problem of training with reduced walltime via a novel stochastic optimization algorithm that uses (limited) second order information without explicit approximations of Hessian matrices or even Hessianvector products. On each minibatch, our algorithm computes a descent direction by solving an intermediate optimization problem, and inverting the Hessian of the minibatch. Explicit computations with Hessian matrices are extremely expensive, so we develop an inner loop iteration that applies the Hessian inverse without explicitly representing the Hessian, or computing a Hessian vector product. The key ingredients in this iteration are the Neumann series expansion for the matrix inverse, and an observation that allows us to replace each occurrence of the Hessian with a single gradient evaluation.
We conduct largescale experiments using real models (InceptionV3, Resnet50, Resnet101, InceptionResnetV2) on the ImageNet dataset. Compared to recent work, our algorithm has favourable scaling properties; we are able to obtain linear speedup up to a batch size of 32000, while maintaining or even improving model quality compared to the baseline. Additionally, our algorithm when run using smaller minibatches is able to improve the validation error by 0.80.9% across all the models we try; alternatively, we can maintain baseline model quality and obtain a 1030% decrease in the number of steps. Our algorithm is easy to use in practice, with the learning rate as the sole hyperparameter.
1.1 Related Work
There has been an explosion of research in developing faster stochastic optimization algorithms: there are any number of secondorder algorithms that represent and exploit curvature information ([31, 3, 24, 38, 33, 25, 26, 19, 5, 8, 39, 23, 14, 4]). An alternative line of research has focused on variance reduction ([18, 32, 10]), where very careful model evaluations are chosen to decrease the variance in the stochastic gradient. Despite the proliferation of new ideas, none of these optimizers have become very popular: the added computational cost and implementation complexity, along with the lack of largescale experimental evaluations (results are usually reported on small datasets like MNIST or CIFAR10), have largely failed to convince practitioners that real improvements can be had on large models and datasets. Recent work has focused on using very large batches ([13, 41]
). These papers rely on careful hyperparameter selection and hardware choices to scale up the minibatch size to 8192 without degradation in the evaluation metrics.
2 Algorithmic Ideas
Let be the inputs to a neural net with some weights : we want the neural net to learn to predict a target
which may be discrete or continuous. We will do so by minimizing the loss function
where is drawn from the data distribution, and is a per sample loss function. Thus, we want to solve the optimization problemIf the true data distribution is not known (as is the case in practice), the expected loss is replaced with an empirical loss. Given a set of training samples , let be the loss for a particular sample . Then the problem we want to solve is
(1) 
Consider a regularized first order approximation of around the point :
Minimizing leads to the familiar rule for gradient descent, . If the loss function is convex, we could instead compute a local quadratic approximation of the loss as
(2) 
where is the (positive definite) Hessian of the empirical loss. Minimizing gives the Newton update rule . This involves solving a linear system:
(3) 
Our algorithm works as follows: on each minibatch, we will form a separate quadratic subproblem as in Equation (2). We will solve these subproblems using an iteration scheme we describe in Section 2.1. Unfortunately, the naive application of this iteration scheme requires a Hessian matrix; we show how to avoid this in Section 2.2. We make this algorithm practical in Section 3.
2.1 Neumann Series
There are many way to solve the linear system in Equation (3). An explicit representation of the Hessian matrix is prohibitively expensive; thus a natural first attempt is to use Hessianvector products instead. Such a strategy might apply a conjugate gradient or Lanczos type iteration using efficiently computed Hessianvector products via the Pearlmutter trick ([28]) to directly minimize the quadratic form. In our preliminary experiments with this idea, the cost of the Hessianvector products overwhelms any improvements from a better descent direction (see also Appendix A). We take an even more indirect approach, eschewing even Hessianvector products.
At the heart of our method lies a power series expansion of the approximate inverse for solving linear systems; this idea is well known, and it appears in various guises as Chebyshev iteration, the conjugate gradient method, the Lanczos algorithm, and Nesterov accelerated gradient methods. In our case, we use the Neumann power series for matrix inverses – given a matrix
whose eigenvalues,
satisfy , the inverse is given by:This is the familiar geometric series with the substitution . Using this, we can solve the linear system via a recurrence relation
(4) 
where we can easily show that . This is the well known Richardson iteration ([37]), and is equivalent to gradient descent on the quadratic objective.
2.2 Quadratic approximations for minibatches
A full batch method is impractical for even moderately large networks trained on modest amounts of data. The usual practice is to obtain an unbiased estimate of the loss by using a minibatch. Given a minibatch from the training set
of size , let(5) 
be the function that we optimize at a particular step. Similar to Equation (2), we form the stochastic quadratic approximation for the minibatch as:
(6) 
As before, we compute a descent direction by solving a linear system, , but now, the linear system is only over the minibatch. To do so, we use the Neumann series in Equation (4). Let us assume that the Hessian is positive definite^{1}^{1}1We will show how to remove the positive definite assumption in Section 3.1, with an operator norm bound . Setting , we define the Neumann iterates by making the substitutions , , and into Equation (4):
(7) 
The above sequence of reductions is justified by the following crucial observation: the bold term on the second line is a first order approximation to for sufficiently small via the Taylor series:
By using first order only information at a point that is not the current weights, we have been able to incorporate curvature information in a matrixfree fashion. This approximation is the sole reason that we pick the slowly converging Neumann series – it allows for extremely cheap incorporation of secondorder information. We are now ready to state our idealized Neumann algorithm:
The practical solution of Equation (6) occupies the rest of this paper, but let us pause to reflect on what we have done so far. The difference between our technique and the typical stochastic quasiNewton algorithm is as follows: in an idealized stochastic quasiNewton algorithm, one hopes to approximate the Hessian of the total loss and then to invert it to obtain the descent direction . We, on the other hand, are content to approximate the Hessian only on the minibatch to obtain the descent direction . These two quantities are fundamentally different, even in expectation, as the presence of the batch in both the Hessian and gradient estimates leads to a product that does not factor. One can think of stochastic quasiNewton algorithms as trying to find the best descent direction by using secondorder information about the total objective, whereas our algorithm tries to find a descent direction by using secondorder information implied by the minibatch. While it is well understood in the literature that trying to use curvature information based on a minibatch is inadvisable, we justify this by noting that our curvature information arises solely from gradient evaluations, and that in the large batch setting, gradients have much better concentration properties than Hessians.
The two loop structure of Algorithm 1 is a common idea in the literature (for example, [6, 2, 39]): typically though, one solves a difficult convex optimization problem in the innerloop. In contrast, we solve a much easier linear system in the innerloop: this idea is also found in ([24, 38, 5]), where the curvature information is derived from more expensive Hessianvector products.
Here, we diverge from typical optimization papers for machine learning: instead of deriving a rate of convergence using standard assumptions on smoothness and strong convexity, we move onto the much more poorly defined problem of building an optimizer that actually works for largescale deep neural nets.
3 An optimizer for neural networks
Our idealized Neumann optimizer algorithm is deeply impractical. The main problems are:

We assumed that the expected Hessian is positive definite, and furthermore that the Hessian on each minibatch is also positive definite.

There are four hyperparameters that significantly affect optimization – , inner loop iterations and batch size.
We shall introduce two separate techniques for convexifying the problem – one for the total Hessian and one for minibatch Hessians, and we will reduce the number of hyperparameters to just a single learning rate.
3.1 Convexification
In a deterministic setting, one of the best known techniques for dealing with nonconvexity in the objective is cubic regularization ([27]): adding a regularization term of to the objective function, where is a scalar hyperparameter weight. This is studied in [6], where it is shown that under mild assumptions, gradient descent on the regularized objective converges to a secondorder stationary point (i.e., Theorem 3.1). The cubic regularization method falls under a broad class of trust region methods. This term is essential to theoretically guarantee convergence to a critical point. We draw inspiration from this work and add two regularization terms – a cubic regularizer, , and a repulsive regularizer, to the objective, where is an exponential moving average of the parameters over the course of optimization. The two terms oppose each other  the cubic term is attractive and prevents large updates to the parameters especially when the learning rate is high (in the initial part of the training), while the second term adds a repulsive potential and starts dominating when the learning rate becomes small (at the end of training). The regularized objective is and its gradient is
(8) 
Even if the expected Hessian is positive definite, this does not imply that the Hessians of individual batches themselves are also positive definite. This poses substantial difficulties since the intermediate quadratic forms become unbounded, and have an arbitrary minimum in the span of the subspace of negative eigenvalues. Suppose that the eigenvalues of the Hessian, , satisfy , then define the coefficients:
In this case, the matrix is a positive definite matrix. If we use this matrix instead of in the inner loop, we obtain updates to the descent direction:
(9) 
It is not clear a priori that the matrix will yield good descent directions, but if is small compared to , then the perturbation does not affect the Hessian beyond a simple scaling. This is the case later in training ([30, 7, 9]), but to validate it, we conducted an experiment (see Appendix A), where we compute the extremal minibatch Hessian eigenvalues using the Lanczos algorithm. Over the trajectory of training, the following qualitative behaviour emerges:

Initially, there are many large negative eigenvalues.

During the course of optimization, these large negative eigenvalues decrease in magnitude towards zero.

Simultaneously, the largest positive eigenvalues continuously increase (almost linearly) over the course of optimization.
This validates our minibatch convexification routine. In principle, the cubic regularizer is redundant – if each minibatch is convex, then the overall problem is also convex. But since we only crudely estimate and , the cubic regularizer ensures convexity without excessively large distortions to the Hessian in . Based on the findings in our experimental study, we set , and .
3.2 Running the optimizer: SGD Burn in and Inner Loop Iterations
We now make some adjustments to the idealized Neumann algorithm to improve performance and stability in training. The first change is trivial – we add a very short phase of vanilla SGD at the start. SGD is typically more robust to the pathologies of initialization than other optimization algorithms, and a “warmup” phase is not uncommon (even in a momentum method, the initial steps are dampened by virtue of using exponential weighted averages starting at 0).
Next, there is an open question of how many inner loop iterations to take. Our experience is that there are substantial diminishing marginal returns to reusing a minibatch. A deep net has on the order of millions of parameters, and even the largest minibatch size is less than fifty thousand examples. Thus, we can not hope to rely on very finegrained information from each minibatch. From an efficiency perspective, we need to keep the number of inner loop iterations very low; on the other hand, this leads to the algorithm degenerating into an SGDesque iteration, where the inner loop descent directions are never truly useful. We solve this problem as follows: instead of freezing a minibatch and then computing gradients with respect to this minibatch at every iteration of the inner loop, we compute a stochastic gradient at every iteration of the inner loop. One can think of this as solving a stochastic optimization subproblem in the inner loop instead of solving a deterministic optimization problem. This small change is effective in practice, and also frees us from having to carefully pick the number of inner loop iterations – instead of having to carefully balance considerations of optimization quality in the inner loop with overfitting on a particular minibatch, the optimizer now becomes relatively insensitive to the number of inner loop iterations; we pick a doubling schedule for our experiments, but a linear one (as presented in Algorithm 2) works equally well. Additionally, since the inner and outer loop updates are now identical, we simply apply a single learning rate instead of two.
Finally, there is the question of how to set the minibatch size for our algorithm. Since we are trying to extract secondorder information from the minibatch, we hypothesize that Neumann optimizer is better suited to the large batch setting, and that one should pick the minibatch size as large as possible. We provide experimental evidence for this hypothesis in Section 4.
As an implementation simplification, the maintained in Algorithm 2 are actually the displaced parameters in Equation (2.2). This slight notational shift then allows us to “flatten” the two loop structure with no change in the underlying iteration. In Table 1, we compile a list of hyperparameters that work across a wide range of models (all our experiments, on both large and small models, used these values): the only one that the user has to select is the learning rate.
Hyperparameter  Setting 

Cubic Regularizer  
Repulsive Regularizer  
Moving Average  
Momentum  , starting at and peaking at . 
Number of SGD warmup steps  
Number of reset steps  K, starts at epochs, and doubles after every reset. 
4 Experiments
We experimentally evaluated our optimizer on several large convolutional neural networks for image classification. While our experiments were successful on smaller datasets (CIFAR10 and CIFAR100) without any hyperparameter modifications, we shall only report results on the ImageNet dataset.
Our experiments were run in Tensorflow (
[1]), on Tesla P100 GPUs, in our distributed infrastructure. To abstract away the variability inherent in a distributed system such as network traffic, job loads, preemptions etc, we use training epochs as our notion of time. Since we use the same amount of computation and memory as an Adam optimizer ([21]), our step times are on par with commonly used optimizers. We used the standard Inception data augmentation ([12]) for all models. We used an input image size of for the InceptionV3 and InceptionResnetV2 models, and for all Resnet models, and measured the evaluation metrics using a single crop. We intend to open source our code at a later date.Neumann optimizer seems to be robust to different initializations and trajectories (see Appendix ). In particular, the final evaluation metrics are stable do not vary significantly from run to run, so we present results from single runs throughout our experimental section.
4.1 Fixed Minibatch Size: Better Accuracy or Faster Training
First, we compared our Neumann optimizer to standard optimization algorithms fixing the minibatch size. To this end, for the baselines we trained an InceptionV3 model ([35]), a Resnet50 and Resnet101 ([15, 16]), and finally an InceptionResnetV2 ([34]
). The InceptionV3 and InceptionResnetV2 models were trained as in their respective papers, using the RMSProp optimizer (
[17]) in a synchronous fashion, additionally increasing the minibatch size to 64 (from 32) to account for modern hardware. The Resnet50 and Resnet101 models were trained with a minibatch size of 32 in an asynchronous fashion using SGD with momentum 0.9, and a learning rate of 0.045 that decayed every 2 epochs by a factor of 0.94. In all cases, we used 50 GPUs. When training synchronously, we scale the learning rate linearly after an initial burnin period of 5 epochs where we slowly ramp up the learning rate as suggested by [13], and decay every 40 epochs by a factor of 0.3 (this is a similar schedule to the asynchronous setting because ). Additionally, we run Adam to compare against a popular baseline algorithm.Baseline  Neumann  Improvement  

InceptionV3  21.7 %  20.8 %  0.91% 
Resnet50  23.9 %  23.0 %  0.94 % 
Resnet101  22.6 %  21.7%  0.86 % 
InceptionResnetV2  20.3 %  19.5 %  0.84 % 
We evaluate our optimizer in terms of final test accuracy (top1 validation error), and the number of epochs needed to achieve a fixed accuracy. In Figure 2, we can see the training curves and the test error for Inception V3 as compared to the baseline RMSProp.
The salient characteristics are as follows: first, the classification loss (the sum of the main cross entropy loss and the auxiliary head loss) is not improved, and secondly there are oscillations early in training that also manifest in the evaluation. The oscillations are rather disturbing, and we hypothesize that they stem from slight misspecification of the hyperparameter , but all the models we train appear to be robust to these oscillations. The lack of improvement in classification loss is interesting, especially since the evaluation error is improved by a nontrivial increment of 0.80.9 %. This improvement is consistent across all our models (see Table 2 and Figure 2). As far as we know, it is unusual to obtain an improvement of this quality when changing from a welltuned optimizer. We discuss the open problems raised by these results in the discussion section.
This improvement in generalization can also tradedoff for faster training: if one is content to obtain the previous baseline validation error, then one can simply run the Neumann optimizer for fewer steps. This yields a 1030% speedup whilst maintaining the current baseline accuracy.
On these large scale image classification models, Adam shows substantially inferior performance compared to both Neumann optimizer and RMSProp. This reflects our understanding that architectures and algorithms are tuned to each other for optimal performance. For the rest of this paper, we will compare Neumann optimizer with RMSProp only.
4.2 Linearscaling at very large batch sizes
Earlier, we hypothesized that our method is able to efficiently use large batches. We study this by training a Resnet50 on increasingly large batches (using the same learning rate schedule as in Section 4.1) as shown in Figure 3 and Table 3. Each GPU can handle a minibatch of 32 examples, so for example, a batch size of 8000 implies 250 GPUs. For batch sizes of 16000 and 32000, we used 250 GPUs, each evaluating the model and its gradient multiple times before applying any updates.
Our algorithm scales to very large minibatches: up to minibatches of size 32000, we are still better than the baseline. To our knowledge, our Neumann Optimizer is a new stateoftheart in taking advantage of large minibatch sizes while maintaining model quality. Compared to [13], it can take advantage of 4x larger minibatches; compared to [42, 41] it uses the same minibatch size but matches baseline accuracy while [42, 41] suffers from a 0.40.7% degradation.
Batch Size  Top1 Validation Error  # Epochs 

1600  23.0 %  226 
4000  23.0 %  230 
8000  23.1 %  258 
16000  23.5 %  210 
32000  24.0 %  237 
4.3 Effect of Regularization
We studied the effect of regularization by performing an ablation experiment (setting and to 0). Our main findings are summarized in Table 4 (and Figure 6 in Appendix B). We can see that regularization improves validation performance, but even without it, there is a performance improvement from just running the Neumann optimizer.
Method  Top1 Error 

Baseline  24.3 % 
Neumann (without regularization)  23.5 % 
Neumann (with regularization)  23.0 % 
4.4 Negative result for sequencetosequence RNNs
We also tried our algorithm on a largescale sequencetosequence speechsynthesis model called Tacotron ([40]
), where we were unable to obtain any speedup or quality improvements. Training this model requires aggressive gradient clipping; we suspect the Neumann optimizer responds poorly to this, as our approximation of the Hessian in Equation (
2.2) breaks down.5 Discussion
In this paper, we have presented a large batch optimization algorithm for training deep neural nets; roughly speaking, our algorithm implicitly inverts the Hessian of individual minibatches. Our algorithm is practical, and the only hyperparameter that needs tuning is the learning rate. Experimentally, we have shown the optimizer is able to handle very large minibatch sizes up to 32000 without any degradation in quality relative to current baseline models. Intriguingly, at smaller minibatch sizes, the optimizer is able to produce models that generalize better, and improve top1 validation error by 0.80.9% across a variety of architectures with no attendant drop in the classification loss.
We believe the latter phenomenon is worth further investigation, especially since the Neumann optimizer does not improve the training loss. This indicates that, somehow, the optimizer has found a different local optimum. We think that this confirms the general idea that optimization and generalization can not be decoupled in deep neural nets.
Acknowledgments
We would like to thank RJ SkerryRyan for his help in writing the Tensorflow GPU kernel for the optimizer, and Matt Hoffman for his comennts and feedback.
References
 [1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. Tensorflow: Largescale machine learning on heterogeneous systems, 2015. url h ttp. tensorflow. org/. So ftware available from tensorflow. org.
 [2] Agarwal, N., Bullins, B., and Hazan, E. Second order stochastic optimization in linear time. Optimization Methods for the Next Generation of Machine Learning workshop, ICML 2016 (2016).

[3]
Bordes, A., Bottou, L., and Gallinari, P.
Sgdqn: Careful quasinewton stochastic gradient descent.
Journal of Machine Learning Research 10, Jul (2009), 1737–1754.  [4] Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for largescale machine learning. arXiv preprint arXiv:1606.04838 (2016).
 [5] Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. A stochastic quasinewton method for largescale optimization. SIAM Journal on Optimization 26, 2 (2016), 1008–1031.
 [6] Carmon, Y., and Duchi, J. C. Gradient descent efficiently finds the cubicregularized nonconvex newton step. arXiv preprint arXiv:1612.00547 (2016).
 [7] Chaudhari, P., Choromanska, A., Soatto, S., and LeCun, Y. Entropysgd: Biasing gradient descent into wide valleys. CoRR abs/1611.01838 (2016).
 [8] Curtis, F. A selfcorrecting variablemetric algorithm for stochastic optimization. In International Conference on Machine Learning (2016), pp. 632–641.
 [9] Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in neural information processing systems (2014), pp. 2933–2941.
 [10] Defazio, A., Bach, F., and LacosteJulien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems (2014), pp. 1646–1654.
 [11] Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, Jul (2011), 2121–2159.
 [12] Github. Tensorflow models. https://github.com/tensorflow/models/blob/master/research/inception/, 2017.
 [13] Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).
 [14] Grosse, R., and Martens, J. A kroneckerfactored approximate fisher matrix for convolution layers. In International Conference on Machine Learning (2016), pp. 573–582.

[15]
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
(2016), pp. 770–778.  [16] He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In European Conference on Computer Vision (2016), Springer, pp. 630–645.
 [17] Hinton, G., Srivastava, N., and Swersky, K. Rmsprop: Divide the gradient by a running average of its recent magnitude. Neural networks for machine learning, Coursera lecture 6e (2012).
 [18] Johnson, R., and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems (2013), pp. 315–323.
 [19] Keskar, N. S., and Berahas, A. S. adaqn: An adaptive quasinewton algorithm for training rnns. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (2016), Springer, pp. 1–16.
 [20] Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On largebatch training for deep learning: Generalization gap and sharp minima. International Conference for Learning Representations (2017).
 [21] Kingma, D., and Ba, J. Adam: A method for stochastic optimization. International Conference for Learning Representations (2015).
 [22] LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature 521, 7553 (2015), 436–444.
 [23] Martens, J., and Grosse, R. Optimizing neural networks with kroneckerfactored approximate curvature. In International Conference on Machine Learning (2015), pp. 2408–2417.
 [24] Martens, J., and Sutskever, I. Training deep and recurrent networks with hessianfree optimization. In Neural networks: Tricks of the trade. Springer, 2012, pp. 479–535.
 [25] Mokhtari, A., and Ribeiro, A. Res: Regularized stochastic bfgs algorithm. IEEE Transactions on Signal Processing 62, 23 (2014), 6089–6104.
 [26] Mokhtari, A., and Ribeiro, A. Global convergence of online limited memory bfgs. Journal of Machine Learning Research 16, 1 (2015), 3151–3181.
 [27] Nesterov, Y., and Polyak, B. T. Cubic regularization of newton method and its global performance. Mathematical Programming 108, 1 (2006), 177–205.
 [28] Pearlmutter, B. A. Fast exact multiplication by the hessian. Neural computation 6, 1 (1994), 147–160.
 [29] Robbins, H., and Monro, S. A Stochastic Approximation Method. The Annals of Mathematical Statistics 22, 3 (1951), 400–407.
 [30] Sagun, L., Bottou, L., and LeCun, Y. Eigenvalues of the hessian in deep learning: Singularity and beyond.
 [31] Schraudolph, N. N., Yu, J., and Günter, S. A stochastic quasinewton method for online convex optimization. In Artificial Intelligence and Statistics (2007), pp. 436–443.
 [32] ShalevShwartz, S., and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research 14, Feb (2013), 567–599.
 [33] SohlDickstein, J., Poole, B., and Ganguli, S. Fast largescale optimization by unifying stochastic gradient and quasinewton methods. In International Conference on Machine Learning (2014), pp. 604–612.

[34]
Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In AAAI (2017), pp. 4278–4284.  [35] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 2818–2826.
 [36] Trefethen, L. N., and Bau, D. Numerical linear algebra, vol. 50. Siam, 1997.
 [37] Varga, R. S. Matrix iterative analysis, vol. 27. Springer Science & Business Media, 2009.
 [38] Vinyals, O., and Povey, D. Krylov subspace descent for deep learning. In Artificial Intelligence and Statistics (2012), pp. 1261–1268.
 [39] Wang, X., Ma, S., Goldfarb, D., and Liu, W. Stochastic quasinewton methods for nonconvex stochastic optimization. SIAM Journal on Optimization 27, 2 (2017), 927–956.
 [40] Wang, Y., SkerryRyan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al. Tacotron: Towards endtoend speech synthesis. Interspeech (2017).
 [41] You, Y., Gitman, I., and Ginsburg, B. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888 (2017).
 [42] You, Y., Zhang, Z., Demmel, J., Keutzer, K., and Hsieh, C.J. Imagenet training in 24 minutes. arXiv preprint arXiv:1709.05011 (2017).
 [43] Zeiler, M. D. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012).
Appendix A Minibatch Krylov algorithms, and Hessian Eigenvalue Estimates
There are many possible strategies for solving the quadratic minibatch optimization problem. In particular, various Krylov subspace methods ([24, 38]), such as conjugate gradient, are very appealing because of their fast convergence and ability to solve the linear system in Equation (3) using Hessianvector products. Unfortunately, in our preliminary experiments, none of these Krylov methods gave better optimizers – the Hessianvector product was simply too expensive relative to the quality of the descent directions.
On the other hand, the closelyrelated idea of running a Lanczos algorithm on the minibatch gives us excellent information about the eigenvalues of the minibatch Hessian. The Lanczos algorithm is a Krylov subspace method for computing the eigenvalues of a Hermitian matrix (see [36] for a detailed exposition). After iterations, the Lanczos algorithm outputs a tridiagonal matrix whose eigenvalues, known as Ritz values, typically are close to the extreme (largest magnitude) eigenvalues of the original matrix. Crucially, the Lanczos algorithm requires only the ability to perform matrixvector products; in our setting, one can compute Hessianvector products almost as cheaply as the gradient using the Pearlmutter trick ([28]), and thus we can use the Lanczos algorithm to compute estimates of the extremal eigenvalues of the Hessian.
Supposing that we have an upper bound on the most positive eigenvalue , then by applying a shift operation to the Hessian of the form , we can compute the most negative eigenvalue . This is useful when for example.
The following is an experiment that we ran on a CIFAR10 model: we trained the model as per usual using SGD. Along the trajectory of optimization, we ran a Lanczos algorithm to estimate the most positive and most negative eigenvalues. Figure 4 depicts these eigenvalues. Although the estimates of the minibatch eigenvalues are very noisy, the qualitative behaviour is still clear:

The maximum eigenvalue increases (almost) linearly over the course of optimization.

The most negative eigenvalue decays towards 0 (from below) over the course of optimization.
This is consistent with the existing results in the literature ([30, 7, 9]), and we use these observations to specify a parametric form for the parameter.
Appendix B Effect of Regularization
In this appendix, we study the effects of removing the cubic and repulsive regularizer terms in the objective. The output models are of lower quality, though the final evaluation metrics are still better than a baseline RMSProp.
Appendix C Robustness to initializations and randomness
In this appendix, we compare two different initializations and trajectories of the Neumann optimizer. Although the intermediate training losses and evaluation metrics are different, the final output model quality is the same, and are substantially better than RMSProp.