A Mini-Block Natural Gradient Method for Deep Neural Networks

02/08/2022
by   Achraf Bahamou, et al.
0

The training of deep neural networks (DNNs) is currently predominantly done using first-order methods. Some of these methods (e.g., Adam, AdaGrad, and RMSprop, and their variants) incorporate a small amount of curvature information by using a diagonal matrix to precondition the stochastic gradient. Recently, effective second-order methods, such as KFAC, K-BFGS, Shampoo, and TNT, have been developed for training DNNs, by preconditioning the stochastic gradient by layer-wise block-diagonal matrices. Here we propose and analyze the convergence of an approximate natural gradient method, mini-block Fisher (MBF), that lies in between these two classes of methods. Specifically, our method uses a block-diagonal approximation to the Fisher matrix, where for each layer in the DNN, whether it is convolutional or feed-forward and fully connected, the associated diagonal block is also block-diagonal and is composed of a large number of mini-blocks of modest size. Our novel approach utilizes the parallelism of GPUs to efficiently perform computations on the large number of matrices in each layer. Consequently, MBF's per-iteration computational cost is only slightly higher than it is for first-order methods. Finally, the performance of our proposed method is compared to that of several baseline methods, on both Auto-encoder and CNN problems, to validate its effectiveness both in terms of time efficiency and generalization power.

READ FULL TEXT

page 6

page 21

page 25

page 26

research
06/16/2020

Practical Quasi-Newton Methods for Training Deep Neural Networks

We consider the development of practical stochastic quasi-Newton, and in...
research
03/31/2023

Analysis and Comparison of Two-Level KFAC Methods for Training Deep Neural Networks

As a second-order method, the Natural Gradient Descent (NGD) has the abi...
research
06/05/2021

Tensor Normal Training for Deep Learning Models

Despite the predominant use of first-order methods for training deep lea...
research
08/22/2018

Fisher Information and Natural Gradient Learning of Random Deep Networks

A deep neural network is a hierarchical nonlinear model transforming inp...
research
01/01/2021

An iterative K-FAC algorithm for Deep Learning

Kronecker-factored Approximate Curvature (K-FAC) method is a high effici...
research
11/21/2016

Scalable Adaptive Stochastic Optimization Using Random Projections

Adaptive stochastic gradient methods such as AdaGrad have gained popular...
research
05/26/2019

Stochastic Gradient Methods with Block Diagonal Matrix Adaptation

Adaptive gradient approaches that automatically adjust the learning rate...

Please sign up or login with your details

Forgot password? Click here to reset