Adaptive control in rollforward recovery for extreme scale multigrid

by   Markus Huber, et al.

With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is re-constructed by an asynchronous on-line recovery. The computations in both the faulty and healthy subdomains must be coordinated in a sensitive way, in particular, both under and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal re-coupling, we introduce a stopping criterion based on a mathematical error estimator. It involves hierarchical weighted sums of residuals within the context of uniformly refined meshes and is well-suited in the context of parallel high-performance computing. The re-coupling process is steered by local contributions of the error estimator. We propose and compare two criteria which differ in their weights. Failure scenarios when solving up to 6.9·10^11 unknowns on more than 245 766 parallel processes will be reported on a state-of-the-art peta-scale supercomputer demonstrating the robustness of the method.


page 14

page 15

page 25


Asynchronous scalable version of the Global-Local non-invasive coupling

The Global-Local non-invasive coupling is an improvement of the submodel...

Soft Error Resilience and Failure Recovery for Continuum Dynamics Applications

The persistently growing resilience concerns of large-scale computing sy...

Near-zero Downtime Recovery from Transient-error-induced Crashes

Due to the system scaling, transient errors caused by external noises, e...

Failure Analysis and Quantification for Contemporary and Future Supercomputers

Large-scale computing systems today are assembled by numerous computing ...

Error estimation and adaptivity for differential equations with multiple scales in time

We consider systems of ordinary differential equations with multiple sca...

NVM-ESR: Using Non-Volatile Memory in Exact State Reconstruction of Preconditioned Conjugate Gradient

HPC systems are a critical resource for scientific research and advanced...

A virtual element-based flux recovery on quadtree

In this paper, we introduce a simple local flux recovery for 𝒬_k finite ...