Adaptive control in rollforward recovery for extreme scale multigrid

04/17/2018
by   Markus Huber, et al.
0

With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is re-constructed by an asynchronous on-line recovery. The computations in both the faulty and healthy subdomains must be coordinated in a sensitive way, in particular, both under and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal re-coupling, we introduce a stopping criterion based on a mathematical error estimator. It involves hierarchical weighted sums of residuals within the context of uniformly refined meshes and is well-suited in the context of parallel high-performance computing. The re-coupling process is steered by local contributions of the error estimator. We propose and compare two criteria which differ in their weights. Failure scenarios when solving up to 6.9·10^11 unknowns on more than 245 766 parallel processes will be reported on a state-of-the-art peta-scale supercomputer demonstrating the robustness of the method.

READ FULL TEXT

page 14

page 15

page 25

research
08/07/2022

Asynchronous scalable version of the Global-Local non-invasive coupling

The Global-Local non-invasive coupling is an improvement of the submodel...
research
11/05/2019

Soft Error Resilience and Failure Recovery for Continuum Dynamics Applications

The persistently growing resilience concerns of large-scale computing sy...
research
03/09/2021

Near-zero Downtime Recovery from Transient-error-induced Crashes

Due to the system scaling, transient errors caused by external noises, e...
research
11/05/2019

Failure Analysis and Quantification for Contemporary and Future Supercomputers

Large-scale computing systems today are assembled by numerous computing ...
research
11/25/2022

Efficient a Posteriori Error Control of a Consistent Atomistic/Continuum Coupling Method for Two Dimensional Crystalline Defects

Adaptive atomistic/continuum (a/c) coupling method is an important metho...
research
05/18/2023

Stopping Criteria for the Conjugate Gradient Algorithm in High-Order Finite Element Methods

We introduce three new stopping criteria that balance algebraic and disc...
research
04/25/2022

NVM-ESR: Using Non-Volatile Memory in Exact State Reconstruction of Preconditioned Conjugate Gradient

HPC systems are a critical resource for scientific research and advanced...

Please sign up or login with your details

Forgot password? Click here to reset