Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

01/29/2022
by   Minghao Yan, et al.
22

More than 70 of these idle compute are cheap CPUs with few cores that are not utilized during the less busy hours. This paper aims to enable those CPU cycles to train heavyweight AI models. Our goal is against mainstream frameworks, which focus on leveraging expensive specialized ultra-high bandwidth interconnect to address the communication bottleneck in distributed neural network training. This paper presents a distributed model-parallel training framework that enables training large neural networks on small CPU clusters with low Internet bandwidth. We build upon the adaptive sparse training framework introduced by the SLIDE algorithm. By carefully deploying sparsity over distributed nodes, we demonstrate several orders of magnitude faster model parallel training than Horovod, the main engine behind most commercial software. We show that with reduced communication, due to sparsity, we can train close to a billion parameter model on simple 4-16 core CPU nodes connected by basic low bandwidth interconnect. Moreover, the training time is at par with some of the best hardware accelerators.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/14/2021

DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks

Full-batch training on Graph Neural Networks (GNN) to learn the structur...
research
04/12/2021

Accelerating Neural Network Training with Distributed Asynchronous and Selective Optimization (DASO)

With increasing data and model complexities, the time required to train ...
research
03/30/2023

BOLT: An Automated Deep Learning Framework for Training and Deploying Large-Scale Neural Networks on Commodity CPU Hardware

Efficient large-scale neural network training and inference on commodity...
research
10/18/2020

Fast Distributed Training of Deep Neural Networks: Dynamic Communication Thresholding for Model and Data Parallelism

Data Parallelism (DP) and Model Parallelism (MP) are two common paradigm...
research
05/22/2018

Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication

Currently, progressively larger deep neural networks are trained on ever...
research
10/18/2021

EmbRace: Accelerating Sparse Communication for Distributed Training of NLP Neural Networks

Distributed data-parallel training has been widely used for natural lang...
research
03/07/2019

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

Deep Learning (DL) algorithms are the central focus of modern machine le...

Please sign up or login with your details

Forgot password? Click here to reset