Characterization and Comparison of Application Resilience for Serial and Parallel Executions

08/03/2018
by   Kai Wu, et al.
0

Soft error of exascale application is a challenge problem in modern HPC. In order to quantify an application's resilience and vulnerability, the application-level fault injection method is widely adopted by HPC users. However, it is not easy since users need to inject a large number of faults to ensure statistical significance, especially for parallel version program. Normally, parallel execution is more complex and requires more hardware resources than its serial execution. Therefore, it is essential that we can predict error rate of parallel application based on its corresponding serial version. In this poster, we characterize fault pattern in serial and parallel executions. We find first there are same fault sources in serial and parallel execution. Second, parallel execution also has some unique fault sources compared with serial executions. Those unique fault sources are important for us to understand the difference of fault pattern between serial and parallel executions.

READ FULL TEXT
research
02/13/2021

MOARD: Modeling Application Resilience to Transient Faults on Data Objects

Understanding application resilience (or error tolerance) in the presenc...
research
04/29/2021

Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications

Due to the increasing size of HPC machines, the fault presence is becomi...
research
09/02/2019

Algorithm-Based Fault Tolerance for Parallel Stencil Computations

The increase in HPC systems size and complexity, together with increasin...
research
01/02/2022

Visilence: An Interactive Visualization Tool for Error Resilience Analysis

Soft errors have become one of the major concerns for HPC applications, ...
research
07/26/2018

FINJ: A Fault Injection Tool for HPC Systems

We present FINJ, a high-level fault injection tool for High-Performance ...
research
10/24/2020

LCFI: A Fault Injection Tool for Studying Lossy Compression Error Propagation in HPC Programs

Error-bounded lossy compression is becoming more and more important to t...
research
10/16/2018

Influence of A-Posteriori Subcell Limiting on Fault Frequency in Higher-Order DG Schemes

Soft error rates are increasing as modern architectures require increasi...

Please sign up or login with your details

Forgot password? Click here to reset