A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale

09/12/2023
by   Hao-Jun Michael Shi, et al.
0

Shampoo is an online and stochastic optimization algorithm belonging to the AdaGrad family of methods for training neural networks. It constructs a block-diagonal preconditioner where each block consists of a coarse Kronecker product approximation to full-matrix AdaGrad for each parameter of the neural network. In this work, we provide a complete description of the algorithm as well as the performance optimizations that our implementation leverages to train deep networks at-scale in PyTorch. Our implementation enables fast multi-GPU distributed data-parallel training by distributing the memory and computation associated with blocks of each parameter via PyTorch's DTensor data structure and performing an AllGather primitive on the computed search directions at each iteration. This major performance enhancement enables us to achieve at most a 10 compared against standard diagonal-scaling-based adaptive gradient methods. We validate our implementation by performing an ablation study on training ImageNet ResNet50, demonstrating Shampoo's superiority over standard training recipes with minimal hyperparameter tuning.

READ FULL TEXT

page 18

page 25

research
03/31/2023

Analysis and Comparison of Two-Level KFAC Methods for Training Deep Neural Networks

As a second-order method, the Natural Gradient Descent (NGD) has the abi...
research
07/01/2020

Convolutional Neural Network Training with Distributed K-FAC

Training neural networks with many processors can reduce time-to-solutio...
research
05/25/2017

Diagonal Rescaling For Neural Networks

We define a second-order neural network stochastic gradient training alg...
research
08/02/2016

Horn: A System for Parallel Training and Regularizing of Large-Scale Neural Networks

I introduce a new distributed system for effective training and regulari...
research
11/21/2016

Scalable Adaptive Stochastic Optimization Using Random Projections

Adaptive stochastic gradient methods such as AdaGrad have gained popular...
research
03/15/2020

Iterative training of neural networks for intra prediction

This paper presents an iterative training of neural networks for intra p...
research
06/24/2020

Randomized Block-Diagonal Preconditioning for Parallel Learning

We study preconditioned gradient-based optimization methods where the pr...

Please sign up or login with your details

Forgot password? Click here to reset