LCFI: A Fault Injection Tool for Studying Lossy Compression Error Propagation in HPC Programs

10/24/2020
by   Baodi Shan, et al.
0

Error-bounded lossy compression is becoming more and more important to today's extreme-scale HPC applications because of the ever-increasing volume of data generated because it has been widely used in in-situ visualization, data stream intensity reduction, storage reduction, I/O performance improvement, checkpoint/restart acceleration, memory footprint reduction, etc. Although many works have optimized ratio, quality, and performance for different error-bounded lossy compressors, there is none of the existing works attempting to systematically understand the impact of lossy compression errors on HPC application due to error propagation. In this paper, we propose and develop a lossy compression fault injection tool, called LCFI. To the best of our knowledge, this is the first fault injection tool that helps both lossy compressor developers and users to systematically and comprehensively understand the impact of lossy compression errors on HPC programs. The contributions of this work are threefold: (1) We propose an efficient approach to inject lossy compression errors according to a statistical analysis of compression errors for different state-of-the-art compressors. (2) We build a fault injector which is highly applicable, customizable, easy-to-use in generating top-down comprehensive results, and demonstrate the use of LCFI. (3) We evaluate LCFI on four representative HPC benchmarks with different abstracted fault models and make several observations about error propagation and their impacts on program outputs.

READ FULL TEXT
research
07/26/2018

FINJ: A Fault Injection Tool for HPC Systems

We present FINJ, a high-level fault injection tool for High-Performance ...
research
05/27/2021

Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights

In recent years, the increasing complexity in scientific simulations and...
research
01/02/2022

Visilence: An Interactive Visualization Tool for Error Resilience Analysis

Soft errors have become one of the major concerns for HPC applications, ...
research
06/08/2020

Realistic Error Injection for System Calls

In this paper, we present a novel fault injection framework called Phoeb...
research
08/03/2018

Characterization and Comparison of Application Resilience for Serial and Parallel Executions

Soft error of exascale application is a challenge problem in modern HPC....
research
07/17/2023

Evaluating and Enhancing Robustness of Deep Recommendation Systems Against Hardware Errors

Deep recommendation systems (DRS) heavily depend on specialized HPC hard...

Please sign up or login with your details

Forgot password? Click here to reset