Near-zero Downtime Recovery from Transient-error-induced Crashes

03/09/2021
by   Chao Chen, et al.
0

Due to the system scaling, transient errors caused by external noises, e.g., heat fluxes and particle strikes, have become a growing concern for the current and upcoming extreme-scale high-performance-computing (HPC) systems. However, since such errors are still quite rare as compared to no-fault cases, desirable solutions call for low/no-overhead systems that do not compromise the performance under no-fault conditions and also allow very fast fault recovery to minimize downtime. In this paper, we present IterPro, a light-weight compiler-assisted resilience technique to quickly and accurately recover processes from transient-error-induced crashes. IterPro repairs the corrupted process states on-the-fly upon occurrences of errors, enabling applications to continue their executions instead of being terminated. IterPro also exploits side effects introduced by induction variable based code optimization techniques to improve its recovery capability. To this end, two new code transformation passes are introduced to expose the side effects for resilience purposes. We evaluated IterPro with 4 scientific workloads as well as the NPB benchmarks suite. During their normal execution, IterPro incurs almost zero runtime overhead and a small, fixed 27MB memory overhead. Meanwhile, IterPro can recover on an average 83.55 milliseconds with negligible downtime. With such an effective recovery mechanism, IterPro could tremendously mitigate the overheads and resource requirements of the resilience subsystem in future extreme-scale systems.

READ FULL TEXT

page 4

page 9

page 10

page 12

research
02/22/2018

Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing

Resiliency is the ability of large-scale high-performance computing (HPC...
research
10/25/2017

A Pattern Language for High-Performance Computing Resilience

High-performance computing systems (HPC) provide powerful capabilities f...
research
08/23/2017

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Reliability is a serious concern for future extreme-scale high-performan...
research
04/17/2018

Adaptive control in rollforward recovery for extreme scale multigrid

With the increasing number of compute components, failures in future exa...
research
07/16/2020

Soft Errors Detection and Automatic Recovery based on Replication combined with different Levels of Checkpointing

Handling faults is a growing concern in HPC. In future exascale systems,...
research
06/22/2019

ZOFI: Zero-Overhead Fault Injection Tool for Fast Transient Fault Coverage Analysis

The experimental evaluation of fault-tolerance studies relies on tools t...
research
08/21/2017

Entirely protecting operating systems against transient errors in space environment

In this article, we propose a mainly-software hardening technique to tot...

Please sign up or login with your details

Forgot password? Click here to reset