Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs

04/26/2022
by   John Thorpe, et al.
0

DNN models across many domains continue to grow in size, resulting in high resource requirements for effective training, and unpalatable (and often unaffordable) costs for organizations and research labs across scales. This paper aims to significantly reduce training costs with effective use of preemptible instances, i.e., those that can be obtained at a much cheaper price while idle, but may be preempted whenever requested by priority users. Doing so, however, requires new forms of resiliency and efficiency to cope with the possibility of frequent preemptions - a failure model that is drastically different from the occasional failures in normal cluster settings that existing checkpointing techniques target. We present Bamboo, a distributed system that tackles these challenges by introducing redundant computations into the training pipeline, i.e., whereby one node performs computations over not only its own layers but also over some layers in its neighbor. Our key insight is that training large models often requires pipeline parallelism where "pipeline bubbles" naturally exist. Bamboo carefully fills redundant computations into these bubbles, providing resilience at a low cost. Across a variety of widely used DNN models, Bamboo outperforms traditional checkpointing by 3.7x in training throughput, and reduces costs by 2.4x compared to a setting where on-demand instances are used.

READ FULL TEXT
research
09/15/2023

Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates

Oobleck enables resilient distributed training of large DNN models with ...
research
07/02/2020

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU ...
research
10/09/2019

PipeMare: Asynchronous Pipeline Parallel DNN Training

Recently there has been a flurry of interest around using pipeline paral...
research
02/01/2019

Dataset Culling: Towards Efficient Training Of Distillation-Based Domain Specific Models

Real-time CNN based object detection models for applications like survei...
research
12/19/2019

Scalable Resilience Against Node Failures for Communication-Hiding Preconditioned Conjugate Gradient and Conjugate Residual Methods

The observed and expected continued growth in the number of nodes in lar...
research
08/30/2022

Analysis of Distributed Deep Learning in the Cloud

We aim to resolve this problem by introducing a comprehensive distribute...
research
10/23/2017

"Birds in the Clouds": Adventures in Data Engineering

Leveraging their eBird crowdsourcing project, the Cornell Lab of Ornitho...

Please sign up or login with your details

Forgot password? Click here to reset