JASS: A Flexible Checkpointing System for NVM-based Systems

01/27/2023
by   Akshin Singh, et al.
0

NVM-based systems are naturally fit candidates for incorporating periodic checkpointing (or snapshotting). This increases the reliability of the system, makes it more immune to power failures, and reduces wasted work in especially an HPC setup. The traditional line of thinking is to design a system that is conceptually similar to transactional memory, where we log updates all the time, and minimize the wasted work or alternatively the MTTR (mean time to recovery). Such “instant recovery” systems allow the system to recover from a point that is quite close to the point of failure. The penalty that we pay is the prohibitive number of additional writes to the NVM. We propose a paradigmatically different approach in this paper, where we argue that in most practical settings such as regular HPC workloads or neural network training, there is no need for such instant recovery. This means that we can afford to lose some work, take periodic software-initiated checkpoints and still meet the goals of the application. The key benefit of our scheme is that we reduce write amplification substantially; this extends the life of NVMs by roughly the same factor. We go a step further and design an adaptive system that can minimize the WA given a target checkpoint latency, and show that our control algorithm almost always performs near-optimally. Our scheme reduces the WA by 2.3-96% as compared to the nearest competing work.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/14/2018

Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery

Efficient utilization of today's high-performance computing (HPC) system...
research
01/21/2019

Turning Privacy Constraints into Syslog Analysis Advantage

The mean time between failures (MTBF) of HPC systems is rapidly reducing...
research
02/13/2023

SWIFT: Expedited Failure Recovery for Large-scale DNN Training

As the size of deep learning models gets larger and larger, training tak...
research
06/06/2023

Evaluating the Potential of Disaggregated Memory Systems for HPC applications

Disaggregated memory is a promising approach that addresses the limitati...
research
09/19/2022

Rapid Recovery of Program Execution Under Power Failures for Embedded Systems with NVM

After power is switched on, recovering the interrupted program from the ...
research
07/17/2023

Adaptive Compliant Robot Control with Failure Recovery for Object Press-Fitting

Loading of shipping containers for dairy products often includes a press...

Please sign up or login with your details

Forgot password? Click here to reset