Fault Tolerance in Iterative-Convergent Machine Learning

10/17/2018
by   Aurick Qiao, et al.
0

Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterative-convergent nature. Recent systems exploit this property to achieve adaptability and efficiency in unreliable computing environments by relaxing the consistency of execution and allowing calculation errors to be self-corrected during training. However, the behavior of such systems are only well understood for specific types of calculation errors, such as those caused by staleness, reduced precision, or asynchronicity, and for specific types of training algorithms, such as stochastic gradient descent. In this paper, we develop a general framework to quantify the effects of calculation errors on iterative-convergent algorithms and use this framework to design new strategies for checkpoint-based fault tolerance. Our framework yields a worst-case upper bound on the iteration cost of arbitrary perturbations to model parameters during training. Our system, SCAR, employs strategies which reduce the iteration cost upper bound due to perturbations incurred when recovering from checkpoints. We show that SCAR can reduce the iteration cost of partial failures by 78 traditional checkpoint-based fault tolerance across a variety of ML models and training algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/12/2018

On the Performance and Convergence of Distributed Stream Processing via Approximate Fault Tolerance

Fault tolerance is critical for distributed stream processing systems, y...
research
04/30/2018

Improving Performance of Iterative Methods by Lossy Checkponting

Iterative methods are commonly used approaches to solve large, sparse li...
research
09/16/2021

On Misbehaviour and Fault Tolerance in Machine Learning Systems

Machine learning (ML) provides us with numerous opportunities, allowing ...
research
04/05/2021

ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding

Deep-learning-based recommendation models (DLRMs) are widely deployed to...
research
08/09/2020

Consistent High Dimensional Rounding with Side Information

In standard rounding, we want to map each value X in a large continuous ...
research
09/13/2023

Quantifying Masking Fault-Tolerance via Fair Stochastic Games

We introduce a formal notion of masking fault-tolerance between probabil...
research
07/05/2022

A Stochastic Game Approach to Masking Fault-Tolerance: Bisimulation and Quantification

We introduce a formal notion of masking fault-tolerance between probabil...

Please sign up or login with your details

Forgot password? Click here to reset