Srift: Swift and Thrift Cloud-Based Distributed Training

11/29/2020
by   Liang Luo, et al.
0

Cost-efficiency and training time are primary concerns in cloud-based distributed training today. With many VM configurations to choose from, given a time constraint, what configuration achieves the lowest cost? Or, given a cost budget, which configuration leads to the highest throughput? We present a comprehensive throughput and cost-efficiency study across a wide array of instance choices in the cloud. With the insights from this study, we build Srift, a system that combines runtime instrumentation and learned performance models to accurately predict training performance and find the best choice of VMs to improve throughput and lower cost while satisfying user constraints. With Pytorch and EC2, we show Srift's choices of VM instances can lead to up to 2x better throughput and 1.6x lower cost per iteration compared to baseline choices across various DNN models in real-world scenarios, leveraging heterogeneous setups and spot instances.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/12/2023

Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training

While the pay-as-you-go nature of cloud virtual machines (VMs) makes it ...
research
06/05/2023

How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study

Training deep learning models in the cloud or on dedicated hardware is e...
research
08/30/2022

Analysis of Distributed Deep Learning in the Cloud

We aim to resolve this problem by introducing a comprehensive distribute...
research
07/02/2023

Automatic MILP Solver Configuration By Learning Problem Similarities

A large number of real-world optimization problems can be formulated as ...
research
03/12/2020

Machine Learning on Volatile Instances

Due to the massive size of the neural network models and training datase...
research
04/20/2022

Search-based Methods for Multi-Cloud Configuration

Multi-cloud computing has become increasingly popular with enterprises l...
research
12/27/2021

Automatic Configuration for Optimal Communication Scheduling in DNN Training

ByteScheduler partitions and rearranges tensor transmissions to improve ...

Please sign up or login with your details

Forgot password? Click here to reset