Scalable K-FAC Training for Deep Neural Networks with Distributed Preconditioning

06/30/2022
by   Lin Zhang, et al.
0

The second-order optimization methods, notably the D-KFAC (Distributed Kronecker Factored Approximate Curvature) algorithms, have gained traction on accelerating deep neural network (DNN) training on GPU clusters. However, existing D-KFAC algorithms require to compute and communicate a large volume of second-order information, i.e., Kronecker factors (KFs), before preconditioning gradients, resulting in large computation and communication overheads as well as a high memory footprint. In this paper, we propose DP-KFAC, a novel distributed preconditioning scheme that distributes the KF constructing tasks at different DNN layers to different workers. DP-KFAC not only retains the convergence property of the existing D-KFAC algorithms but also enables three benefits: reduced computation overhead in constructing KFs, no communication of KFs, and low memory footprint. Extensive experiments on a 64-GPU cluster show that DP-KFAC reduces the computation overhead by 1.55x-1.65x, the communication cost by 2.79x-3.15x, and the memory footprint by 1.14x-1.47x in each second-order update compared to the state-of-the-art D-KFAC methods.

READ FULL TEXT

page 10

page 13

research
07/04/2021

KAISA: An Adaptive Second-order Optimizer Framework for Deep Neural Networks

Kronecker-factored Approximate Curvature (K-FAC) has recently been shown...
research
08/02/2017

On the Importance of Consistency in Training Deep Neural Networks

We explain that the difficulties of training deep neural networks come f...
research
07/14/2021

Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

Distributed training with synchronous stochastic gradient descent (SGD) ...
research
10/27/2022

RePAST: A ReRAM-based PIM Accelerator for Second-order Training of DNN

The second-order training methods can converge much faster than first-or...
research
10/31/2022

Lita: Accelerating Distributed Training of Sparsely Activated Models

Scaling model parameters usually improves model quality, but at the pric...
research
03/18/2023

DevelSet: Deep Neural Level Set for Instant Mask Optimization

With the feature size continuously shrinking in advanced technology node...
research
12/08/2021

SASG: Sparsification with Adaptive Stochastic Gradients for Communication-efficient Distributed Learning

Stochastic optimization algorithms implemented on distributed computing ...

Please sign up or login with your details

Forgot password? Click here to reset