GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent

03/15/2018
by   Jeff Daily, et al.
0

In this paper, we present GossipGraD - a gossip communication protocol based Stochastic Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale systems. The salient features of GossipGraD are: 1) reduction in overall communication complexity from Θ(log(p)) for p compute nodes in well-studied SGD to O(1), 2) model diffusion such that compute nodes exchange their updates (gradients) indirectly after every log(p) steps, 3) rotation of communication partners for facilitating direct diffusion of gradients, 4) asynchronous distributed shuffle of samples during the feedforward phase in SGD to prevent over-fitting, 5) asynchronous communication of gradients for further reducing the communication cost of SGD and GossipGraD. We implement GossipGraD for GPU and CPU clusters and use NVIDIA GPUs (Pascal P100) connected with InfiniBand, and Intel Knights Landing (KNL) connected with Aries network. We evaluate GossipGraD using well-studied dataset ImageNet-1K ( 250GB), and widely studied neural network topologies such as GoogLeNet and ResNet50 (current winner of ImageNet Large Scale Visualization Research Challenge (ILSVRC)). Our performance evaluation using both KNL and Pascal GPUs indicates that GossipGraD can achieve perfect efficiency for these datasets and their associated neural network topologies. Specifically, for ResNet50, GossipGraD is able to achieve 100 P100 GPUs - while matching the top-1 classification accuracy published in literature.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/27/2021

AET-SGD: Asynchronous Event-triggered Stochastic Gradient Descent

Communication cost is the main bottleneck for the design of effective di...
research
06/12/2020

O(1) Communication for Distributed SGD through Two-Level Gradient Averaging

Large neural network models present a hefty communication challenge to d...
research
11/20/2019

Understanding Top-k Sparsification in Distributed Deep Learning

Distributed stochastic gradient descent (SGD) algorithms are widely depl...
research
11/02/2017

Efficient Training of Convolutional Neural Nets on Large Distributed Systems

Deep Neural Networks (DNNs) have achieved im- pressive accuracy in many ...
research
04/19/2019

Analyzing the benefits of communication channels between deep learning models

As artificial intelligence systems spread to more diverse and larger tas...
research
11/12/2019

Throughput Prediction of Asynchronous SGD in TensorFlow

Modern machine learning frameworks can train neural networks using multi...
research
06/06/2021

Distributed Learning and its Application for Time-Series Prediction

Extreme events are occurrences whose magnitude and potential cause exten...

Please sign up or login with your details

Forgot password? Click here to reset