Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

04/07/2020
by   Shijian Li, et al.
0

Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large-scale datasets. However, it is challenging to determine the appropriate cluster configuration—e.g., server type and number—for different training workloads while balancing the trade-offs in training time, cost, and model accuracy. Adding to the complexity is the potential to reduce the monetary cost by using cheaper, but revocable, transient GPU servers. In this work, we analyze distributed training performance under diverse cluster configurations using CM-DARE, a cloud-based measurement and training framework. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models. Finally, we discuss potential use cases of our performance modeling such as detecting and mitigating performance bottlenecks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/28/2019

Speeding up Deep Learning with Transient Servers

Distributed training frameworks, like TensorFlow, have been proposed as ...
research
05/31/2020

Cloud-scale VM Deflation for Running Interactive Applications On Transient Servers

Transient computing has become popular in public cloud environments for ...
research
08/13/2021

Quantifying and Improving Performance of Distributed Deep Learning with Cloud Storage

Cloud computing provides a powerful yet low-cost environment for distrib...
research
06/27/2022

Active TLS Stack Fingerprinting: Characterizing TLS Server Deployments at Scale

Active measurements can be used to collect server characteristics on a l...
research
03/12/2023

Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training

While the pay-as-you-go nature of cloud virtual machines (VMs) makes it ...
research
11/16/2021

HyperNAT: Scaling Up Network AddressTranslation with SmartNICs for Clouds

Network address translation (NAT) is a basic functionality in cloud gate...
research
08/03/2021

Interpretable Trade-offs Between Robot Task Accuracy and Compute Efficiency

A robot can invoke heterogeneous computation resources such as CPUs, clo...

Please sign up or login with your details

Forgot password? Click here to reset