FT-GCR: a fault-tolerant generalized conjugate residual elliptic solver

03/12/2021
by   Mike Gillard, et al.
0

With the steady advance of high performance computing systems featuring smaller and smaller hardware components, the systems and algorithms used for numerical simulations increasingly contend with disruptions caused by hardware failures and bit-levels misrepresentations of computing data. In numerical frameworks exploiting massive processing power, the solution of linear systems often represents the most computationally intensive component. Given the large amount of repeated operations involved, iterative solvers are particularly vulnerable to bit-flips. A new method named FT-GCR is proposed here that supplies the preconditioned Generalized Conjugate Residual Krylov solver with detection of, and recovery from, soft faults. The algorithm tests on the monotonic decrease of the residual norm and, upon failure, restarts the iteration within the local Krylov space. Numerical experiments on the solution of an elliptic problem arising from a stationary flow over an isolated hill on the sphere show the skill of the method in addressing bit-flips on a range of grid sizes and data loss scenarios, with best returns and detection rates obtained for larger corruption events. The simplicity of the method makes it easily extendable to other solvers and an ideal candidate for algorithmic fault tolerance within integrated model resilience strategies.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/16/2018

Influence of A-Posteriori Subcell Limiting on Fault Frequency in Higher-Order DG Schemes

Soft error rates are increasing as modern architectures require increasi...
research
03/21/2020

A low-overhead soft-hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems

The Network-on-Chip (NoC) paradigm has been proposed as a favorable solu...
research
04/01/2004

On the Practicality of Intrinsic Reconfiguration As a Fault Recovery Method in Analog Systems

Evolvable hardware combines the powerful search capability of evolutiona...
research
03/21/2020

Soft-Error and Hard-fault Tolerant Architecture and Routing Algorithm for Reliable 3D-NoC Systems

Network-on-Chip (NoC) paradigm has been proposed as an auspicious soluti...
research
09/21/2017

Convergence characteristics of the generalized residual cutting method

The residual cutting (RC) method has been proposed for efficiently solvi...
research
07/30/2019

How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures

We study algorithmic approaches for recovering from the failure of sever...
research
05/14/2020

Reproducibility of Parallel Preconditioned Conjugate Gradient in Hybrid Programming Environments

The Preconditioned Conjugate Gradient method is often employed for the s...

Please sign up or login with your details

Forgot password? Click here to reset