Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

07/08/2020
by   Carlos Pachajoa, et al.
0

As computers reach exascale and beyond, the incidence of faults will increase. Solutions to this problem are an active research topic. We focus on strategies to make the preconditioned conjugate gradient (PCG) solver resilient against node failures, specifically, the exact state reconstruction (ESR) method, which exploits redundancies in PCG. Reducing the frequency at which redundant information is stored lessens the runtime overhead. However, after the node failure, the solver must restart from the last iteration for which redundant information was stored, which increases recovery overhead. This formulation highlights the method's similarities to checkpoint-restart (CR). Thus, this method, which we call ESR with periodic storage (ESRP), can be considered a form of algorithm-based checkpoint-restart. The state is stored implicitly, by exploiting redundancy inherent to the algorithm, rather than explicitly as in CR. We also minimize the amount of data to be stored and retrieved compared to CR, but additional computation is required to reconstruct the solver's state. In this paper, we describe the necessary modifications to ESR to convert it into ESRP, and perform an experimental evaluation. We compare ESRP experimentally with previously-existing ESR and application-level in-memory CR. Our results confirm that the overhead for ESR is reduced significantly, both in the failure-free case, and if node failures are introduced. In the former case, the overhead of ESRP is usually lower than that of CR. However, CR is faster if node failures happen. We claim that these differences can be alleviated by the implementation of more appropriate preconditioners.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/30/2019

How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures

We study algorithmic approaches for recovering from the failure of sever...
research
12/19/2019

Scalable Resilience Against Node Failures for Communication-Hiding Preconditioned Conjugate Gradient and Conjugate Residual Methods

The observed and expected continued growth in the number of nodes in lar...
research
01/23/2021

Resilient Virtualized Systems Using ReHype

System-level virtualization introduces critical vulnerabilities to failu...
research
10/11/2019

Enabling Failure-resilient Intermittent Systems Without Runtime Checkpointing

Self-powered intermittent systems typically adopt runtime checkpointing ...
research
02/13/2023

SWIFT: Expedited Failure Recovery for Large-scale DNN Training

As the size of deep learning models gets larger and larger, training tak...
research
12/02/2018

Double and Triple Erasure-Correcting-Codes over Graphs

In this paper we study array-based codes over graphs for correcting mult...
research
11/05/2020

CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

The paper proposes and optimizes a partial recovery training system, CPR...

Please sign up or login with your details

Forgot password? Click here to reset