Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates

09/15/2023
by   Insu Jang, et al.
0

Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least f+1 logically equivalent pipeline replicas to tolerate any f simultaneous failures. During execution, it relies on already-replicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after f or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to 13.9x.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/12/2018

On the Performance and Convergence of Distributed Stream Processing via Approximate Fault Tolerance

Fault tolerance is critical for distributed stream processing systems, y...
research
04/26/2022

Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs

DNN models across many domains continue to grow in size, resulting in hi...
research
04/05/2021

ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding

Deep-learning-based recommendation models (DLRMs) are widely deployed to...
research
06/16/2021

A Survey on Fault-tolerance in Distributed Optimization and Machine Learning

The robustness of distributed optimization is an emerging field of study...
research
02/13/2023

SWIFT: Expedited Failure Recovery for Large-scale DNN Training

As the size of deep learning models gets larger and larger, training tak...
research
05/16/2018

Verifying Programs Under Custom Application-Specific Execution Models

Researchers have recently designed a number of application-specific faul...
research
12/06/2021

End-to-end Adaptive Distributed Training on PaddlePaddle

Distributed training has become a pervasive and effective approach for t...

Please sign up or login with your details

Forgot password? Click here to reset