Visilence: An Interactive Visualization Tool for Error Resilience Analysis

01/02/2022
by   Shaolun Ruan, et al.
Singapore Management University
0

Soft errors have become one of the major concerns for HPC applications, as those errors can result in seriously corrupted outcomes, such as silent data corruptions (SDCs). Prior studies on error resilience have studied the robustness of HPC applications. However, it is still difficult for program developers to identify potential vulnerability to soft errors. In this paper, we present Visilence, a novel visualization tool to visually analyze error vulnerability based on the control-flow graph generated from HPC applications. Visilence efficiently visualizes the affected program states under injected errors and presents the visual analysis of the most vulnerable parts of an application. We demonstrate the effectiveness of Visilence through a case study.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

09/05/2018

FlipTracker: Understanding Natural Error Resilience in HPC Applications

As high-performance computing systems scale in size and computational po...
10/24/2020

LCFI: A Fault Injection Tool for Studying Lossy Compression Error Propagation in HPC Programs

Error-bounded lossy compression is becoming more and more important to t...
10/15/2018

Memory Vulnerability: A Case for Delaying Error Reporting

To face future reliability challenges, it is necessary to quantify the r...
10/16/2018

Influence of A-Posteriori Subcell Limiting on Fault Frequency in Higher-Order DG Schemes

Soft error rates are increasing as modern architectures require increasi...
02/18/2022

Lightweight Soft Error Resilience for In-Order Cores

Acoustic-sensor-based soft error resilience is particularly promising, s...
02/22/2020

HarDNN: Feature Map Vulnerability Evaluation in CNNs

As Convolutional Neural Networks (CNNs) are increasingly being employed ...
06/26/2021

Exploring Spatial Indexing for Accelerated Feature Retrieval in HPC

Despite the critical role that range queries play in analysis and visual...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Visilence

We roughly categorize the introduction of Visilence into two categories: overall workflow of Visilence and generation pipeline of visualization.

1.1 Overall workflow of Visilence

At a high level, Visilence

 needs three levels of abstractions: (a) a model that can keep the static and dynamic program states, (b) a format to allow systematic analysis of the program states, and (c) a visualization tool that offers a friendly interface to identify the code regions that are sensitive to the errors for the users. We define Loop Sensitive graph (LSG) generated from the dynamic traces and Critical Vector Graph (CVG) generated based on the accumulation of multiple LSGs. The workflow of

Visilence proceeds as follows: (i), it takes an HPC program as input and conducts a statistic fault injection campaign on the application to generate a set of dynamic execution traces; (ii), it creates LSGs/CVGs based on the obtained dynamic traces of the application, and (iii) it implements a novel visualization system that takes the LSGs/CVGs as the data source and provides a fine-grained representation of error propagation and resilience characteristic for the application.

1.2 Generation Pipeline of Visualization

Visilence has two modules, namely the function selecting module and the graph module, to support the collaborative design of basic-block like visualization. The pipeline is shown in Fig. 1. The visualization system has two separated stages for resilience graph generation, namely the layout simulation and the anomaly mapping.

Figure 1: The workflow of our visualization system

We implement a user-friendly interface to visualize error propagation and functions interactively (see Fig. Visilence: An Interactive Visualization Tool for Error Resilience Analysis). The interface consists of four parts:

  • Function View (a) is a sequence of functions which are represented by dots. These functions are placed in the order of where they are defined. A green dot means it matches exactly like the golden run‘s, or it would be rendered in red when they are different in weights. The triangle on the sequence is a marker labeling the function where the fault is injected.

  • Graph View (b) shows the Loop Sensitive Graph/Critical Vector graph. The vertices of the graph are basic blocks and the head (in yellow) and tail (in red) nodes are the entry and exit of the function respectively. The edges represent the connections between two basic blocks in the CFG, and the weights are the absolute values of the different executed times between the faulty traces and golden runs. The edge is gray when its weight is zero and is red otherwise. There are two options above: Global view and Filter.

  • Weight Threshold (c) is used to filter the edges. When we slide the bar in Weight Threshold, the value would be adjusted, and the edges with smaller weights below the threshold would be assigned into gray.

  • Function List (d) lists all the functions in the program with specific name in the same order in Function View. We can click on it to select the function to be shown in Graph View.

2 Case study

When soft errors occur in the running process of the program, this error may affect the subsequent control flow. Our tool can intuitively indicate how this error propagates.

Fig. Visilence: An Interactive Visualization Tool for Error Resilience Analysis shows the error propagation pattern along with the basic blocks of an example faulty run of CoMD [comd]. The series of dots at the top represents all the 157 functions of CoMD. The green dot indicates that the LSG generated for that function is consistent with the golden run, while the red dot indicates that they are inconsistent. The “marker of fault injection” indicates that the fault was injected in that function.

Fig. Visilence: An Interactive Visualization Tool for Error Resilience Analysis presents an example of LSG for the function ’setVcm_omp_fn.o’ in benchmark program CoMD. The function starts from the ‘head’ basic block and ends in the ‘tail’ basic block , in total 12 basic blocks. The weights are the difference in executed times between the golden run and the faulty run. The biggest difference in this function is 351 on the edges from basic block to . The path from the basic block to maps to the source code of ‘initAtoms.c’ at Lines 126 to 129 inside a for loop. We observed that 64 functions were affected by the injected fault.

3 conclusion

We proposed Visilence, a control-flow graph based visualization tool for error resilience analysis, which provides human analysts with detailed facets of error propagation for further decision making. Visilence addresses the issue of understanding how the applications are affected by the errors via a graph-based abstraction to represent the affected program states and the reason for the error propagation across different error scenarios.

References