Energy-efficient localised rollback after failures via data flow analysis

06/05/2018
by   Kiril Dichev, et al.
0

Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-level checkpoint and a global rollback to recover. In recent years, techniques reducing the number of rolling back processes have been implemented via message logging. However, the log-based approaches have weaknesses, such as being dependent on complex modifications within an MPI implementation, and the fact that a full restart may be required in the general case. To address the limitations of all log-based mechanisms, we return to checkpoint-only mechanisms, but advocate data-flow-driven recovery (DFR), a fundamentally different approach relying on analysis of the data flow of iterative codes, and the well-known concept of data-flow graphs. We demonstrate the effectiveness of DFR for an MPI stencil code to optimise rollback and reduce the overall energy consumption by 10-12 provide large-scale estimates for the energy savings of DFR compared to global rollback, which for stencil codes increase as n square for a process count n.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/14/2018

Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery

Efficient utilization of today's high-performance computing (HPC) system...
research
12/29/2020

Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement

HPC systems keep growing in size to meet the ever-increasing demand for ...
research
06/14/2017

Towards Adaptive Resilience in High Performance Computing

Failure rates in high performance computers rapidly increase due to the ...
research
02/13/2021

Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance

Scaling supercomputers comes with an increase in failure rates due to th...
research
05/31/2023

A Survey of Potential MPI Complex Collectives: Large-Scale Mining and Analysis of HPC Applications

Offload of MPI collectives to network devices, e.g., NICs and switches, ...
research
10/09/2018

Decoupled Strategy for Imbalanced Workloads in MapReduce Frameworks

In this work, we consider the integration of MPI one-sided communication...
research
05/08/2019

Implementing Efficient Message Logging Protocols as MPI Application Extensions

Message logging protocols are enablers of local rollback, a more efficie...

Please sign up or login with your details

Forgot password? Click here to reset