Towards Aggregated Asynchronous Checkpointing

12/04/2021
by   Mikaila J. Gossman, et al.
0

High-Performance Computing (HPC) applications need to checkpoint massive amounts of data at scale. Multi-level asynchronous checkpoint runtimes like VELOC (Very Low Overhead Checkpoint Strategy) are gaining popularity among application scientists for their ability to leverage fast node-local storage and flush independently to stable, external storage (e.g., parallel file systems) in the background. Currently, VELOC adopts a one-file-per-process flush strategy, which results in a large number of files being written to external storage, thereby overwhelming metadata servers and making it difficult to transfer and access checkpoints as a whole. This paper discusses the viability and challenges of designing aggregation techniques for asynchronous multi-level checkpointing. To this end we implement and study two aggregation strategies, their limitations, and propose a new aggregation strategy specifically for asynchronous multi-level checkpointing.

READ FULL TEXT
research
03/03/2021

VELOC: VEry Low Overhead Checkpointing in the Age of Exascale

Checkpointing large amounts of related data concurrently to stable stora...
research
02/14/2020

Deploying large fixed file datasets with SquashFS and Singularity

Shared high-performance computing (HPC) platforms, such as those provide...
research
12/06/2022

DisTRaC: Accelerating High Performance Compute Processing for Temporary Data Storage

High Performance Compute (HPC) clusters often produce intermediate files...
research
05/14/2019

Knowledge-based multi-level aggregation for decision aid in the machining industry

In the context of Industry 4.0, data management is a key point for decis...
research
10/26/2021

BuffetFS: Serve Yourself Permission Checks without Remote Procedure Calls

The remote procedure call (a.k.a. RPC) latency becomes increasingly sign...
research
10/03/2022

HPC Storage Service Autotuning Using Variational-Autoencoder-Guided Asynchronous Bayesian Optimization

Distributed data storage services tailored to specific applications have...
research
08/03/2021

Energy Management in Data Centers with Server Setup Delay: A Semi-MDP Approximation

The energy management schemes in multi-server data centers with setup ti...

Please sign up or login with your details

Forgot password? Click here to reset