Speeding up Deep Learning with Transient Servers

02/28/2019
by   Shijian Li, et al.
0

Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating new model designs---they often come with significantly higher monetary costs due to sublinear scalability. In this paper, we investigate the feasibility of using training clusters composed of cheaper transient GPU servers to get the benefits of distributed training without the high costs. We conduct the first large-scale empirical analysis, launching more than a thousand GPU servers of various capacities, aimed at understanding the characteristics of transient GPU servers and their impact on distributed training performance. Our study demonstrates the potential of transient servers with a speedup of 7.7X with more than 62.9 configurations. We also identify a number of important challenges and opportunities for redesigning distributed training frameworks to be transient-aware. For example, the dynamic cost and availability characteristics of transient servers suggest the need for frameworks to dynamically change cluster configurations to best take advantage of current conditions.

READ FULL TEXT

page 2

page 5

page 9

research
04/07/2020

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Cloud GPU servers have become the de facto way for deep learning practit...
research
05/20/2023

Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching

Current techniques and systems for distributed model training mostly ass...
research
09/26/2019

Elastic deep learning in multi-tenant GPU cluster

Multi-tenant GPU clusters are common nowadays due to the huge success of...
research
05/24/2021

Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads

A graph neural network (GNN) enables deep learning on structured graph d...
research
01/26/2023

Odyssey: A Journey in the Land of Distributed Data Series Similarity Search

This paper presents Odyssey, a novel distributed data-series processing ...
research
10/25/2018

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

TensorFlow has been the most widely adopted Machine/Deep Learning framew...

Please sign up or login with your details

Forgot password? Click here to reset