Accelerating Hessian-free optimization for deep neural networks by implicit preconditioning and sampling

09/05/2013
by   Tara N. Sainath, et al.
0

Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2011

Krylov Subspace Descent for Deep Learning

In this paper, we propose a second order optimization method to learn mo...
research
09/14/2020

Input Hessian Regularization of Neural Networks

Regularizing the input gradient has shown to be effective in promoting t...
research
01/16/2013

Training Neural Networks with Stochastic Hessian-Free Optimization

Hessian-free (HF) optimization has been successfully used for training d...
research
05/22/2018

Meta-Learning with Hessian Free Approach in Deep Neural Nets Training

Meta-learning is a promising method to achieve efficient training method...
research
05/30/2023

Blockwise Stochastic Variance-Reduced Methods with Parallel Speedup for Multi-Block Bilevel Optimization

In this paper, we consider non-convex multi-block bilevel optimization (...
research
01/12/2022

There is a Singularity in the Loss Landscape

Despite the widespread adoption of neural networks, their training dynam...
research
05/23/2018

A Two-Stage Subspace Trust Region Approach for Deep Neural Network Training

In this paper, we develop a novel second-order method for training feed-...

Please sign up or login with your details

Forgot password? Click here to reset