Adaptive control in rollforward recovery for extreme scale multigrid

04/17/2018
by   Markus Huber, et al.
0

With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is re-constructed by an asynchronous on-line recovery. The computations in both the faulty and healthy subdomains must be coordinated in a sensitive way, in particular, both under and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal re-coupling, we introduce a stopping criterion based on a mathematical error estimator. It involves hierarchical weighted sums of residuals within the context of uniformly refined meshes and is well-suited in the context of parallel high-performance computing. The re-coupling process is steered by local contributions of the error estimator. We propose and compare two criteria which differ in their weights. Failure scenarios when solving up to 6.9·10^11 unknowns on more than 245 766 parallel processes will be reported on a state-of-the-art peta-scale supercomputer demonstrating the robustness of the method.

READ FULL TEXT

page 14

page 15

page 25

08/07/2022

Asynchronous scalable version of the Global-Local non-invasive coupling

The Global-Local non-invasive coupling is an improvement of the submodel...
11/05/2019

Soft Error Resilience and Failure Recovery for Continuum Dynamics Applications

The persistently growing resilience concerns of large-scale computing sy...
03/09/2021

Near-zero Downtime Recovery from Transient-error-induced Crashes

Due to the system scaling, transient errors caused by external noises, e...
11/05/2019

Failure Analysis and Quantification for Contemporary and Future Supercomputers

Large-scale computing systems today are assembled by numerous computing ...
06/09/2020

Error estimation and adaptivity for differential equations with multiple scales in time

We consider systems of ordinary differential equations with multiple sca...
04/25/2022

NVM-ESR: Using Non-Volatile Memory in Exact State Reconstruction of Preconditioned Conjugate Gradient

HPC systems are a critical resource for scientific research and advanced...
06/10/2020

A virtual element-based flux recovery on quadtree

In this paper, we introduce a simple local flux recovery for 𝒬_k finite ...