Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

10/20/2020
by   Shaohuai Shi, et al.
0

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional state-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25 existing state-of-the-art systems on CNNs and Transformer. We finally break the record on DAWNBench on training ResNet-50 to 93

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/19/2015

Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines

Deep learning (DL) has achieved notable successes in many machine learni...
research
10/23/2021

Scalable Smartphone Cluster for Deep Learning

Various deep learning applications on smartphones have been rapidly risi...
research
05/19/2022

Nebula-I: A General Framework for Collaboratively Training Deep Learning Models on Low-Bandwidth Cloud Clusters

The ever-growing model size and scale of compute have attracted increasi...
research
01/17/2023

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication

Recent advances in deep learning base on growing model sizes and the nec...
research
08/15/2023

Prism: Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems

Ensuring the reliability of cloud systems is critical for both cloud ven...
research
10/18/2021

EmbRace: Accelerating Sparse Communication for Distributed Training of NLP Neural Networks

Distributed data-parallel training has been widely used for natural lang...
research
08/30/2022

Analysis of Distributed Deep Learning in the Cloud

We aim to resolve this problem by introducing a comprehensive distribute...

Please sign up or login with your details

Forgot password? Click here to reset