mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural Network Optimization

by   Yue Niu, et al.

Quasi-Newton methods still face significant challenges in training large-scale neural networks due to additional compute costs in the Hessian related computations and instability issues in stochastic training. A well-known method, L-BFGS that efficiently approximates the Hessian using history parameter and gradient changes, suffers convergence instability in stochastic training. So far, attempts that adapt L-BFGS to large-scale stochastic training incur considerable extra overhead, which offsets its convergence benefits in wall-clock time. In this paper, we propose mL-BFGS, a lightweight momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN) optimization. mL-BFGS introduces a nearly cost-free momentum scheme into L-BFGS update and greatly reduces stochastic noise in the Hessian, therefore stabilizing convergence during stochastic optimization. For model training at a large scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing compute and memory costs across all computing nodes. We provide a supporting convergence analysis for mL-BFGS in stochastic settings. To investigate mL-BFGS potential in large-scale DNN training, we train benchmark neural models using mL-BFGS and compare performance with baselines (SGD, Adam, and other quasi-Newton methods). Results show that mL-BFGS achieves both noticeable iteration-wise and wall-clock speedup.


page 1

page 2

page 3

page 4


Trust-Region Algorithms for Training Responses: Machine Learning Methods Using Indefinite Hessian Approximations

Machine learning (ML) problems are often posed as highly nonlinear and n...

Nys-Curve: Nyström-Approximated Curvature for Stochastic Optimization

The quasi-Newton methods generally provide curvature information by appr...

Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization

In this paper, we introduce Apollo, a quasi-Newton method for nonconvex ...

Deep Neural Network Learning with Second-Order Optimizers – a Practical Study with a Stochastic Quasi-Gauss-Newton Method

Training in supervised deep learning is computationally demanding, and t...

On the efficiency of Stochastic Quasi-Newton Methods for Deep Learning

While first-order methods are popular for solving optimization problems ...

One Epoch Is All You Need

In unsupervised learning, collecting more data is not always a costly pr...

Stochastic quasi-Newton with adaptive step lengths for large-scale problems

We provide a numerically robust and fast method capable of exploiting th...

Please sign up or login with your details

Forgot password? Click here to reset