Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights

05/27/2021
by   Bo Fang, et al.
0

In recent years, the increasing complexity in scientific simulations and emerging demands for training heavy artificial intelligence models require massive and fast data accesses, which urges high-performance computing (HPC) platforms to equip with more advanced storage infrastructures such as solid-state disks (SSDs). While SSDs offer high-performance I/O, the reliability challenges faced by the HPC applications under the SSD-related failures remains unclear, in particular for failures resulting in data corruptions. The goal of this paper is to understand the impact of SSD-related faults on the behaviors of complex HPC applications. To this end, we propose FFIS, a FUSE-based fault injection framework that systematically introduces storage faults into the application layer to model the errors originated from SSDs. FFIS is able to plant different I/O related faults into the data returned from underlying file systems, which enables the investigation on the error resilience characteristics of the scientific file format. We demonstrate the use of FFIS with three representative real HPC applications, showing how each application reacts to the data corruptions, and provide insights on the error resilience of the widely adopted HDF5 file format for the HPC applications.

READ FULL TEXT

page 1

page 7

page 8

page 10

research
10/25/2017

A Pattern Language for High-Performance Computing Resilience

High-performance computing systems (HPC) provide powerful capabilities f...
research
06/12/2019

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets

High-performance computing (HPC) requires resilience techniques such as ...
research
04/25/2022

NVM-ESR: Using Non-Volatile Memory in Exact State Reconstruction of Preconditioned Conjugate Gradient

HPC systems are a critical resource for scientific research and advanced...
research
10/24/2020

LCFI: A Fault Injection Tool for Studying Lossy Compression Error Propagation in HPC Programs

Error-bounded lossy compression is becoming more and more important to t...
research
10/26/2018

Online Fault Classification in HPC Systems through Machine Learning

As High-Performance Computing (HPC) systems strive towards exascale goal...
research
07/27/2020

A Machine Learning Approach to Online Fault Classification in HPC Systems

As High-Performance Computing (HPC) systems strive towards the exascale ...
research
12/22/2021

Survey the storage systems used in HPC and BDA ecosystems

The advancement in HPC and BDA ecosystem demands a better understanding ...

Please sign up or login with your details

Forgot password? Click here to reset