KAISA: An Adaptive Second-order Optimizer Framework for Deep Neural Networks

07/04/2021
by   J. Gregory Pauloski, et al.
15

Kronecker-factored Approximate Curvature (K-FAC) has recently been shown to converge faster in deep neural network (DNN) training than stochastic gradient descent (SGD); however, K-FAC's larger memory footprint hinders its applicability to large models. We present KAISA, a K-FAC-enabled, Adaptable, Improved, and ScAlable second-order optimizer framework that adapts the memory footprint, communication, and computation given specific models and hardware to achieve maximized performance and enhanced scalability. We quantify the tradeoffs between memory and communication cost and evaluate KAISA on large models, including ResNet-50, Mask R-CNN, U-Net, and BERT, on up to 128 NVIDIA A100 GPUs. Compared to the original optimizers, KAISA converges 18.1-36.3 faster across applications with the same global batch size. Under a fixed memory budget, KAISA converges 32.5 BERT-Large, respectively. KAISA can balance memory and communication to achieve scaling efficiency equal to or better than the baseline optimizers.

READ FULL TEXT

page 1

page 2

page 3

page 4

06/30/2022

Scalable K-FAC Training for Deep Neural Networks with Distributed Preconditioning

The second-order optimization methods, notably the D-KFAC (Distributed K...
07/01/2020

Convolutional Neural Network Training with Distributed K-FAC

Training neural networks with many processors can reduce time-to-solutio...
01/21/2019

AccUDNN: A GPU Memory Efficient Accelerator for Training Ultra-deep Deep Neural Networks

Typically, Ultra-deep neural network(UDNN) tends to yield high-quality m...
05/16/2022

Optimizing the optimizer for data driven deep neural networks and physics informed neural networks

We investigate the role of the optimizer in determining the quality of t...
01/30/2019

Memory-Efficient Adaptive Optimization for Large-Scale Learning

Adaptive gradient-based optimizers such as AdaGrad and Adam are among th...
06/04/2020

Scaling Distributed Training with Adaptive Summation

Stochastic gradient descent (SGD) is an inherently sequential training a...
06/15/2021

Scalable Second Order Optimization for Deep Learning

Optimization in machine learning, both theoretical and applied, is prese...

Code Repositories

kfac_pytorch

Distributed K-FAC Preconditioner for PyTorch


view repo