EasyCrash: Exploring Non-Volatility of Non-Volatile Memory for High Performance Computing Under Failures

06/24/2019
by   Jie Ren, et al.
0

Emerging non-volatile memory (NVM) is promising for building future HPC. Leveraging the non-volatility of NVM as main memory, we can restart the application using data objects remaining on NVM when the application crashes. This paper explores this solution to handle HPC under failures, based on the observation that many HPC applications have good enough intrinsic fault tolerance. To improve the possibility of successful recomputation with correct outcomes and ignorable performance loss, we introduce EasyCrash, a framework to decide how to selectively persist application data objects during application execution. Our evaluation shows that EasyCrash transforms 54 cannot correctly recompute into the correct computation while incurring a negligible performance overhead (1.5 application intrinsic fault tolerance, 82 recompute. When EasyCrash is used with a traditional checkpoint scheme, it enables up to 24

READ FULL TEXT
research
01/14/2018

Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery

Efficient utilization of today's high-performance computing (HPC) system...
research
10/18/2020

Fault Tolerance for Remote Memory Access Programming Models

Remote Memory Access (RMA) is an emerging mechanism for programming high...
research
01/20/2020

BAASH: Enabling Blockchain-as-a-Service on High-Performance Computing Systems

The state-of-the-art approach to manage blockchains is to process blocks...
research
10/26/2018

Online Fault Classification in HPC Systems through Machine Learning

As High-Performance Computing (HPC) systems strive towards exascale goal...
research
09/02/2019

Algorithm-Based Fault Tolerance for Parallel Stencil Computations

The increase in HPC systems size and complexity, together with increasin...
research
06/12/2019

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets

High-performance computing (HPC) requires resilience techniques such as ...
research
04/28/2020

Enabling EASEY deployment of containerized applications for future HPC systems

The upcoming exascale era will push the changes in computing architectur...

Please sign up or login with your details

Forgot password? Click here to reset