Doubt and Redundancy Kill Soft Errors – Towards Detection and Correction of Silent Data Corruption in Task-based Numerical Software

10/18/2021
by   Philipp Samfass, et al.
0

Resilient algorithms in high-performance computing are subject to rigorous non-functional constraints. Resiliency must not increase the runtime, memory footprint or I/O demands too significantly. We propose a task-based soft error detection scheme that relies on error criteria per task outcome. They formalise how “dubious” an outcome is, i.e. how likely it contains an error. Our whole simulation is replicated once, forming two teams of MPI ranks that share their task results. Thus, ideally each team handles only around half of the workload. If a task yields large error criteria values, i.e. is dubious, we compute the task redundantly and compare the outcomes. Whenever they disagree, the task result with a lower error likeliness is accepted. We obtain a self-healing, resilient algorithm which can compensate silent floating-point errors without a significant performance, I/O or memory footprint penalty. Case studies however suggest that a careful, domain-specific tailoring of the error criteria remains essential.

READ FULL TEXT
research
01/13/2020

SERAD: Soft Error Resilient Asynchronous Design using a Bundled Data Protocol

The risk of soft errors due to radiation continues to be a significant c...
research
10/07/2020

SDC Resilient Error-bounded Lossy Compressor

Lossy compression is one of the most important strategies to resolve the...
research
05/09/2023

Upgrade error detection to prediction with GRAND

Guessing Random Additive Noise Decoding (GRAND) is a family of hard- and...
research
03/21/2020

Reliability Assessment and Quantitative Evaluation of Soft-Error Resilient 3D Network-on-Chip Systems

Three-Dimensional Networks-on-Chips (3D-NoCs) have been proposed as an a...
research
09/26/2020

Lossy Checkpoint Compression in Full Waveform Inversion

This paper proposes a new method that combines check-pointing methods wi...
research
04/21/2022

Resilient robot teams: a review integrating decentralised control, change-detection, and learning

Purpose of review: This paper reviews opportunities and challenges for d...
research
03/04/2019

CodeNet: Training Large Scale Neural Networks in Presence of Soft-Errors

This work proposes the first strategy to make distributed training of ne...

Please sign up or login with your details

Forgot password? Click here to reset