Distributed Deep Learning Using Volunteer Computing-Like Paradigm

03/16/2021
by   Medha Atre, et al.
0

Use of Deep Learning (DL) in commercial applications such as image classification, sentiment analysis and speech recognition is increasing. When training DL models with large number of parameters and/or large datasets, cost and speed of training can become prohibitive. Distributed DL training solutions that split a training job into subtasks and execute them over multiple nodes can decrease training time. However, the cost of current solutions, built predominantly for cluster computing systems, can still be an issue. In contrast to cluster computing systems, Volunteer Computing (VC) systems can lower the cost of computing, but applications running on VC systems have to handle fault tolerance, variable network latency and heterogeneity of compute nodes, and the current solutions are not designed to do so. We design a distributed solution that can run DL training on a VC system by using a data parallel approach. We implement a novel asynchronous SGD scheme called VC-ASGD suited for VC systems. In contrast to traditional VC systems that lower cost by using untrustworthy volunteer devices, we lower cost by leveraging preemptible computing instances on commercial cloud platforms. By using preemptible instances that require applications to be fault tolerant, we lower cost by 70-90 security.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/01/2020

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Deep learning (DL) applications are increasingly being deployed on HPC s...
research
04/13/2020

Deep-Edge: An Efficient Framework for Deep Learning Model Update on Heterogeneous Edge

Deep Learning (DL) model-based AI services are increasingly offered in a...
research
10/01/2018

Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication

This paper presents FT-GAIA, a software-based fault-tolerant parallel an...
research
03/12/2020

Machine Learning on Volatile Instances

Due to the massive size of the neural network models and training datase...
research
11/30/2022

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

Modern Deep Learning (DL) models have grown to sizes requiring massive c...
research
11/24/2018

Hydra: A Peer to Peer Distributed Training & Data Collection Framework

The world needs diverse and unbiased data to train deep learning models....
research
06/01/2022

Distributed Training for Deep Learning Models On An Edge Computing Network Using ShieldedReinforcement Learning

Edge devices with local computation capability has made distributed deep...

Please sign up or login with your details

Forgot password? Click here to reset