Analysis of Distributed Deep Learning in the Cloud

08/30/2022
by   Aakash Sharma, et al.
0

We aim to resolve this problem by introducing a comprehensive distributed deep learning (DDL) profiler, which can determine the various execution "stalls" that DDL suffers from while running on a public cloud. We have implemented the profiler by extending prior work to additionally estimate two types of communication stalls - interconnect and network stalls. We train popular DNN models using the profiler to characterize various AWS GPU instances and list their advantages and shortcomings for users to make an informed decision. We observe that the more expensive GPU instances may not be the most performant for all DNN models and AWS may sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads up to 90 network-connected instances can suffer from up to 5x slowdown compared to training on a single instance. Further, we model the impact of DNN macroscopic features such as the number of layers and the number of gradients on communication stalls. Finally, we propose a measurement-based recommendation model for users to lower their public cloud monetary costs for DDL, given a time budget.

READ FULL TEXT
research
10/16/2019

Hyper: Distributed Cloud Processing for Large-Scale Deep Learning Tasks

Training and deploying deep learning models in real-world applications r...
research
11/29/2020

Srift: Swift and Thrift Cloud-Based Distributed Training

Cost-efficiency and training time are primary concerns in cloud-based di...
research
04/30/2022

MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud

Existing general purpose frameworks for gigantic model training, i.e., m...
research
12/27/2018

Stanza: Distributed Deep Learning with Small Communication Footprint

The parameter server architecture is prevalently used for distributed de...
research
10/20/2020

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

Distributed training techniques have been widely deployed in large-scale...
research
03/11/2023

OCCL: a Deadlock-free Library for GPU Collective Communication

Various distributed deep neural network (DNN) training technologies lead...
research
04/26/2022

Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs

DNN models across many domains continue to grow in size, resulting in hi...

Please sign up or login with your details

Forgot password? Click here to reset