
Research of Damped Newton Stochastic Gradient Descent Method for Neural Network Training
Firstorder methods like stochastic gradient descent(SGD) are recently t...
read it

On the Promise of the Stochastic Generalized GaussNewton Method for Training DNNs
Following early work on Hessianfree methods for deep learning, we study...
read it

A GramGaussNewton Method Learning Overparameterized Deep Neural Networks for Regression Problems
Firstorder methods such as stochastic gradient descent (SGD) are curren...
read it

SPAN: A Stochastic Projected Approximate Newton Method
Secondorder optimization methods have desirable convergence properties....
read it

Understanding Impacts of HighOrder Loss Approximations and Features in Deep Learning Interpretation
Current methods to interpret deep learning models by generating saliency...
read it

SHINE: SHaring the INverse Estimate from the forward pass for bilevel optimization and implicit models
In recent years, implicit deep learning has emerged as a method to incre...
read it

WoodFisher: Efficient secondorder approximations for model compression
Secondorder information, in the form of Hessian or InverseHessianvec...
read it
Small steps and giant leaps: Minimal Newton solvers for Deep Learning
We propose a fast secondorder method that can be used as a dropin replacement for current deep learning solvers. Compared to stochastic gradient descent (SGD), it only requires two additional forwardmode automatic differentiation operations per iteration, which has a computational cost comparable to two standard forward passes and is easy to implement. Our method addresses longstanding issues with current secondorder solvers, which invert an approximate Hessian matrix every iteration exactly or by conjugategradient methods, a procedure that is both costly and sensitive to noise. Instead, we propose to keep a single estimate of the gradient projected by the inverse Hessian matrix, and update it once per iteration. This estimate has the same size and is similar to the momentum variable that is commonly used in SGD. No estimate of the Hessian is maintained. We first validate our method, called CurveBall, on small problems with known closedform solutions (noisy Rosenbrock function and degenerate 2layer linear networks), where current deep learning solvers seem to struggle. We then train several large models on CIFAR and ImageNet, including ResNet and VGGf networks, where we demonstrate faster convergence with no hyperparameter tuning. Code is available.
READ FULL TEXT
Comments
There are no comments yet.