How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study

06/05/2023
by   Alexander Isenko, et al.
0

Training deep learning models in the cloud or on dedicated hardware is expensive. A more cost-efficient option are hyperscale clouds offering spot instances, a cheap but ephemeral alternative to on-demand resources. As spot instance availability can change depending on the time of day, continent, and cloud provider, it could be more cost-efficient to distribute resources over the world. Still, it has not been investigated whether geo-distributed, data-parallel spot deep learning training could be a more cost-efficient alternative to centralized training. This paper aims to answer the question: Can deep learning models be cost-efficiently trained on a global market of spot VMs spanning different data centers and cloud providers? To provide guidance, we extensively evaluate the cost and throughput implications of training in different zones, continents, and clouds for representative CV and NLP models. To expand the current training options further, we compare the scalability potential for hybrid-cloud scenarios by adding cloud resources to on-premise hardware to improve training throughput. Finally, we show how leveraging spot instance pricing enables a new cost-efficient way to train models with multiple cheap VMs, trumping both more centralized and powerful hardware and even on-demand cloud offerings at competitive prices.

READ FULL TEXT
research
06/04/2020

An Automated Implementation of Hybrid Cloud for Performance Evaluation of Distributed Databases

A Hybrid cloud is an integration of resources between private and public...
research
11/29/2020

Srift: Swift and Thrift Cloud-Based Distributed Training

Cost-efficiency and training time are primary concerns in cloud-based di...
research
10/16/2019

Hyper: Distributed Cloud Processing for Large-Scale Deep Learning Tasks

Training and deploying deep learning models in real-world applications r...
research
07/27/2023

PredictChain: Empowering Collaboration and Data Accessibility for AI in a Decentralized Blockchain-based Marketplace

Limited access to computing resources and training data poses significan...
research
09/16/2018

A Cloud Controller for Performance-Based Pricing

New dynamic cloud pricing options are emerging with cloud providers offe...
research
11/07/2021

Varuna: Scalable, Low-cost Training of Massive Deep Learning Models

Systems for training massive deep learning models (billions of parameter...
research
11/08/2021

Accelerating GAN training using highly parallel hardware on public cloud

With the increasing number of Machine and Deep Learning applications in ...

Please sign up or login with your details

Forgot password? Click here to reset