Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets

06/12/2019
by   Kai Keller, et al.
0

High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of checkpoint data also increases dramatically, sometimes causing an I/O bottleneck. Differential checkpointing (dCP) aims to minimize the checkpointing overhead by only writing data differences. This is typically implemented at the memory page level, sometimes complemented with hashing algorithms. However, such a technique is unable to cope with dynamic-size datasets. In this work, we present a novel dCP implementation with a new file format that allows fragmentation of protected datasets in order to support dynamic sizes. We identify dirty data blocks using hash algorithms. In order to evaluate the dCP performance, we ported the HPC applications xPic, LULESH 2.0 and Heat2D and analyze them regarding their potential of reducing I/O with dCP and how this data reduction influences the checkpoint performance. In our experiments, we achieve reductions of up to 62

READ FULL TEXT

page 1

page 8

research
05/27/2021

Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights

In recent years, the increasing complexity in scientific simulations and...
research
04/15/2019

Performance Models for Data Transfers: A Case Study with Molecular Chemistry Kernels

With increasing complexity of hardwares, systems with different memory n...
research
01/10/2023

Exploring the Use of WebAssembly in HPC

Containerization approaches based on namespaces offered by the Linux ker...
research
06/14/2017

Towards Adaptive Resilience in High Performance Computing

Failure rates in high performance computers rapidly increase due to the ...
research
06/24/2019

EasyCrash: Exploring Non-Volatility of Non-Volatile Memory for High Performance Computing Under Failures

Emerging non-volatile memory (NVM) is promising for building future HPC....
research
02/01/2020

SciChain: Trustworthy Scientific Data Provenance

The state-of-the-art for auditing and reproducing scientific application...
research
03/10/2020

The Locus Algorithm IV: Performance metrics of a grid computing system used to create catalogues of optimised pointings

This paper discusses the requirements for and performance metrics of the...

Please sign up or login with your details

Forgot password? Click here to reset