Machine Learning on Volatile Instances

03/12/2020
by   Xiaoxi Zhang, et al.
8

Due to the massive size of the neural network models and training datasets used in machine learning today, it is imperative to distribute stochastic gradient descent (SGD) by splitting up tasks such as gradient evaluation across multiple worker nodes. However, running distributed SGD can be prohibitively expensive because it may require specialized computing resources such as GPUs for extended periods of time. We propose cost-effective strategies to exploit volatile cloud instances that are cheaper than standard instances, but may be interrupted by higher priority workloads. To the best of our knowledge, this work is the first to quantify how variations in the number of active worker nodes (as a result of preemption) affects SGD convergence and the time to train the model. By understanding these trade-offs between preemption probability of the instances, accuracy, and training time, we are able to derive practical strategies for configuring distributed SGD jobs on volatile instances such as Amazon EC2 spot instances and other preemptible cloud instances. Experimental results show that our strategies achieve good training performance at substantially lower cost.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/16/2017

Weighted parallel SGD for distributed unbalanced-workload training system

Stochastic gradient descent (SGD) is a popular stochastic optimization m...
research
06/12/2020

O(1) Communication for Distributed SGD through Two-Level Gradient Averaging

Large neural network models present a hefty communication challenge to d...
research
06/20/2019

Data Cleansing for Models Trained with SGD

Data cleansing is a typical approach used to improve the accuracy of mac...
research
05/14/2020

OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed Training

The training of modern deep learning neural network calls for large amou...
research
03/16/2021

Distributed Deep Learning Using Volunteer Computing-Like Paradigm

Use of Deep Learning (DL) in commercial applications such as image class...
research
10/01/2020

PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep Learning Clusters

DNN learning jobs are common in today's clusters due to the advances in ...
research
11/29/2020

Srift: Swift and Thrift Cloud-Based Distributed Training

Cost-efficiency and training time are primary concerns in cloud-based di...

Please sign up or login with your details

Forgot password? Click here to reset