DeepAI AI Chat
Log In Sign Up

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets

by   Kai Keller, et al.
Barcelona Supercomputing Center

High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of checkpoint data also increases dramatically, sometimes causing an I/O bottleneck. Differential checkpointing (dCP) aims to minimize the checkpointing overhead by only writing data differences. This is typically implemented at the memory page level, sometimes complemented with hashing algorithms. However, such a technique is unable to cope with dynamic-size datasets. In this work, we present a novel dCP implementation with a new file format that allows fragmentation of protected datasets in order to support dynamic sizes. We identify dirty data blocks using hash algorithms. In order to evaluate the dCP performance, we ported the HPC applications xPic, LULESH 2.0 and Heat2D and analyze them regarding their potential of reducing I/O with dCP and how this data reduction influences the checkpoint performance. In our experiments, we achieve reductions of up to 62


page 1

page 8


Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights

In recent years, the increasing complexity in scientific simulations and...

Performance Models for Data Transfers: A Case Study with Molecular Chemistry Kernels

With increasing complexity of hardwares, systems with different memory n...

Towards Adaptive Resilience in High Performance Computing

Failure rates in high performance computers rapidly increase due to the ...

A Pattern Language for High-Performance Computing Resilience

High-performance computing systems (HPC) provide powerful capabilities f...

EasyCrash: Exploring Non-Volatility of Non-Volatile Memory for High Performance Computing Under Failures

Emerging non-volatile memory (NVM) is promising for building future HPC....

FlipTracker: Understanding Natural Error Resilience in HPC Applications

As high-performance computing systems scale in size and computational po...