Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

07/14/2021
by   Shaohuai Shi, et al.
4

Distributed training with synchronous stochastic gradient descent (SGD) on GPU clusters has been widely used to accelerate the training process of deep models. However, SGD only utilizes the first-order gradient in model parameter updates, which may take days or weeks. Recent studies have successfully exploited approximate second-order information to speed up the training process, in which the Kronecker-Factored Approximate Curvature (KFAC) emerges as one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with distributed KFAC (D-KFAC), it incurs extensive computation as well as introduces extra communications during each iteration. In this work, we propose D-KFAC (SPD-KFAC) with smart parallelism of computing and communication tasks to reduce the iteration time. Specifically, 1) we first characterize the performance bottlenecks of D-KFAC, 2) we design and implement a pipelining mechanism for Kronecker factors computation and communication with dynamic tensor fusion, and 3) we develop a load balancing placement for inverting multiple matrices on GPU clusters. We conduct real-world experiments on a 64-GPU cluster with 100Gb/s InfiniBand interconnect. Experimental results show that our proposed SPD-KFAC training scheme can achieve 10 state-of-the-art algorithms.

READ FULL TEXT

page 1

page 3

page 9

page 10

research
11/27/2018

MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms

Distributed synchronous stochastic gradient descent has been widely used...
research
05/10/2018

Modeling and Evaluation of Synchronous Stochastic Gradient Descent in Distributed Deep Learning on Multiple GPUs

With huge amounts of training data, deep learning has made great breakth...
research
02/24/2023

Decoupling the All-Reduce Primitive for Accelerating Distributed Deep Learning

Communication scheduling has been shown to be effective in accelerating ...
research
06/30/2022

Scalable K-FAC Training for Deep Neural Networks with Distributed Preconditioning

The second-order optimization methods, notably the D-KFAC (Distributed K...
research
12/21/2013

GPU Asynchronous Stochastic Gradient Descent to Speed Up Neural Network Training

The ability to train large-scale neural networks has resulted in state-o...
research
12/18/2019

MG-WFBP: Merging Gradients Wisely for Efficient Communication in Distributed Deep Learning

Distributed synchronous stochastic gradient descent has been widely used...
research
05/20/2022

SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System

With the increasing diversity of ML infrastructures nowadays, distribute...

Please sign up or login with your details

Forgot password? Click here to reset