Convolutional Neural Network Training with Distributed K-FAC

07/01/2020
by   J. Gregory Pauloski, et al.
0

Training neural networks with many processors can reduce time-to-solution; however, it is challenging to maintain convergence and efficiency at large scales. The Kronecker-factored Approximate Curvature (K-FAC) was recently proposed as an approximation of the Fisher Information Matrix that can be used in natural gradient optimizers. We investigate here a scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale. We study optimization techniques such as layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling to reduce training time while preserving convergence. We use residual neural networks (ResNet) applied to the CIFAR-10 and ImageNet-1k datasets to evaluate the correctness and scalability of our K-FAC gradient preconditioner. With ResNet-50 on the ImageNet-1k dataset, our distributed K-FAC implementation converges to the 75.9 than does the classic stochastic gradient descent (SGD) optimizer across scales on a GPU cluster.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/03/2016

A Kronecker-factored approximate Fisher matrix for convolution layers

Second-order optimization methods such as natural gradient descent have ...
research
07/04/2021

KAISA: An Adaptive Second-order Optimizer Framework for Deep Neural Networks

Kronecker-factored Approximate Curvature (K-FAC) has recently been shown...
research
09/12/2023

A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale

Shampoo is an online and stochastic optimization algorithm belonging to ...
research
08/15/2019

Accelerated CNN Training Through Gradient Approximation

Training deep convolutional neural networks such as VGG and ResNet by gr...
research
03/16/2019

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

This paper reports our efforts on swCaffe, a highly efficient parallel f...
research
06/11/2021

Decoupled Greedy Learning of CNNs for Synchronous and Asynchronous Distributed Learning

A commonly cited inefficiency of neural network training using back-prop...
research
03/04/2013

Riemannian metrics for neural networks I: feedforward networks

We describe four algorithms for neural network training, each adapted to...

Please sign up or login with your details

Forgot password? Click here to reset