Tensor Normal Training for Deep Learning Models

06/05/2021
by   Yi Ren, et al.
0

Despite the predominant use of first-order methods for training deep learning models, second-order methods, and in particular, natural gradient methods, remain of interest because of their potential for accelerating training through the use of curvature information. Several methods with non-diagonal preconditioning matrices, including KFAC and Shampoo, have been proposed and shown to be effective. Based on the so-called tensor normal (TN) distribution, we propose and analyze a brand new approximate natural gradient method, Tensor Normal Training (TNT), which like Shampoo, only requires knowledge on the shape of the training parameters. By approximating the probabilistically based Fisher matrix, as opposed to the empirical Fisher matrix, our method uses the layer-wise covariance of the sampling based gradient as the pre-conditioning matrix. Moreover, the assumption that the sampling-based (tensor) gradient follows a TN distribution, ensures that its covariance has a Kronecker separable structure, which leads to a tractable approximation to the Fisher matrix. Consequently, TNT's memory requirements and per-iteration computational costs are only slightly higher than those for first-order methods. In our experiments, TNT exhibited superior optimization performance to KFAC and Shampoo, and to state-of-the-art first-order methods. Moreover, TNT demonstrated its ability to generalize as well as these first-order methods, using fewer epochs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/08/2022

A Mini-Block Natural Gradient Method for Deep Neural Networks

The training of deep neural networks (DNNs) is currently predominantly d...
research
03/26/2018

A Common Framework for Natural Gradient and Taylor based Optimisation using Manifold Theory

This technical report constructs a theoretical framework to relate stand...
research
08/22/2018

Fisher Information and Natural Gradient Learning of Random Deep Networks

A deep neural network is a hierarchical nonlinear model transforming inp...
research
06/18/2019

Information matrices and generalization

This work revisits the use of information criteria to characterize the g...
research
01/01/2021

An iterative K-FAC algorithm for Deep Learning

Kronecker-factored Approximate Curvature (K-FAC) method is a high effici...
research
06/14/2021

NG+ : A Multi-Step Matrix-Product Natural Gradient Method for Deep Learning

In this paper, a novel second-order method called NG+ is proposed. By fo...
research
02/26/2018

Shampoo: Preconditioned Stochastic Tensor Optimization

Preconditioned gradient methods are among the most general and powerful ...

Please sign up or login with your details

Forgot password? Click here to reset