Elastic Gossip: Distributing Neural Network Training Using Gossip-like Protocols

12/06/2018
by   Siddharth Pramod, et al.
0

Distributing Neural Network training is of particular interest for several reasons including scaling using computing clusters, training at data sources such as IOT devices and edge servers, utilizing underutilized resources across heterogeneous environments, and so on. Most contemporary approaches primarily address scaling using computing clusters and require high network bandwidth and frequent communication. This thesis presents an overview of standard approaches to distribute training and proposes a novel technique involving pairwise-communication using Gossip-like protocols, called Elastic Gossip. This approach builds upon an existing technique known as Elastic Averaging SGD (EASGD), and is similar to another technique called Gossiping SGD which also uses Gossip-like protocols. Elastic Gossip is empirically evaluated against Gossiping SGD using the MNIST digit recognition and CIFAR-10 classification tasks, using commonly used Neural Network architectures spanning Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs). It is found that Elastic Gossip, Gossiping SGD, and All-reduce SGD perform quite comparably, even though the latter entails a substantially higher communication cost. While Elastic Gossip performs better than Gossiping SGD in these experiments, it is possible that a more thorough search over hyper-parameter space, specific to a given application, may yield configurations of Gossiping SGD that work better than Elastic Gossip.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/22/2018

Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms

State-of-the-art distributed machine learning suffers from significant d...
research
10/27/2014

Parallel training of DNNs with Natural Gradient and Parameter Averaging

We describe the neural-network training framework used in the Kaldi spee...
research
07/13/2020

Adaptive Periodic Averaging: A Practical Approach to Reducing Communication in Distributed Learning

Stochastic Gradient Descent (SGD) is the key learning algorithm for many...
research
03/04/2021

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

Training deep neural networks on large datasets can often be accelerated...
research
02/28/2017

Learning What Data to Learn

Machine learning is essentially the sciences of playing with data. An ad...
research
10/13/2021

Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers

Motivated by extreme multi-label classification applications, we conside...

Please sign up or login with your details

Forgot password? Click here to reset