Modern scientific discovery is increasingly being driven by computation . In a growing number of areas where experimentation is either impossible, dangerous or costly, computing is often the only viable alternative towards confirming existing theories or devising new ones. As such, High-Performance Computing (HPC) systems have become fundamental “instruments” for driving scientific discovery and industrial competitiveness. Exascale (10 to the power of 18 operations per second) is the moonshot for HPC systems. Reaching this goal is bound to produce significant advances in science and technology through higher-fidelity simulations, better predictive models and analysis of greater quantities of data, leading to vastly-improved manufacturing processes and breakthroughs in fundamental sciences ranging from particle physics to cosmology.
Future HPC systems will achieve exascale performance through a combination of faster processors and massive parallelism. With Moore’s Law having almost reached its limit, the only viable path towards higher performance has to consider switching from increased transistor density towards increased core count, thus increased sockets count. Achieving exascale performance by increasing the number of cores (and consequently, the number of sockets) presents several major obstacles. With everything else being equal, the fault rate of a system is directly proportional to the number of sockets used in its construction . But everything else is not equal: exascale HPC systems not only will have many more sockets, they will also use advanced low-voltage technologies that are much more prone to aging effects  together with system-level performance and power modulation techniques, such as dynamic voltage frequency scaling, all of which tend to increase fault rates . Economic forces that push for building HPC systems out of commodity components aimed at mass markets only add to the likelihood of more frequent unmasked hardware faults. Finally, complex system software, often built using open-source components, to deal with more complex and heterogeneous hardware, fault masking and energy management, coupled with legacy applications will significantly increase the potential for faults 
. It is estimated that large parallel jobs will encounter a wide range of failures as frequently as once every 30 minutes on exascale platforms. At these rates, failures will prevent applications from making progress. Consequently, exascale performance, although achieved nominally, cannot be sustained for the duration of most applications that often run for long periods.
In the rest of the paper, we adopt the following terminology. A fault is defined as an anomalous behavior at the hardware or software level that can lead to illegal system states, called errors and, in the worst case, to service interruptions, called failures . Future exascale HPC systems must include advanced automated mechanisms for either masking faults or recovering from them after they occur so that computations can continue with minimal disruptions. This in turn requires detecting and classifying them as soon as possible since they are the root causes of errors and failures.
I-B Related Work
Automated fault detection through system performance metrics and fault injection have been the subject of numerous studies. Machine learning-based methods using fine-grain (i.e. sampling once per second) monitored data are however more recent. Tuncer et al.  propose a machine learning-based framework for the diagnosis of performance anomalies in HPC systems; however, this work does not deal with faults that lead to errors and failures, which cause a disruption in the computation, but only with performance anomalies that result in longer runtimes for applications. Moreover, the data used to build the test dataset was not acquired continuously, but rather in small chunks related to single application runs. With such approaches, it is not possible to determine the feasibility of this method when dealing with streamed, continuous data from an online HPC system. A similar work is proposed by Baseman et al. , which focuses on identifying faults in HPC systems through sensor data, specifically temperature sensors. Ferreira et al.  instead analyze the impact of CPU interference on HPC applications by using a kernel-level noise injection framework. Both works deal with specific fault types, and are therefore limited in scope.
Other authors have focused on using coarser-grain (i.e. sampling once per minute) data or on reducing the dimension of data collected by monitoring frameworks, while retaining good detection accuracy. Bodik et al. 12]13, 14] propose works focused on finding the correlations between performance metrics and fault types through a most relevant principal components method. Wang et al.  propose a similar entropy-based outlier detection framework suitable for use in online systems. These frameworks, which are very similar to threshold-based methods, are not suitable for detecting the complex relationships that may exist between different performance metrics under certain fault scenarios.
One of the most notable works in threshold-based fault detection is the one proposed by Cohen et al. , in which probabilistic models are used to estimate threshold values for system performance metrics in order to detect outliers. This type of approach requires constant human intervention to tune threshold values, and lacks flexibility.
In this paper, we propose and evaluate a supervised machine learning fault classification method suitable for online deployment in HPC systems. Our approach relies on a collection of performance metrics that are readily available in most HPC systems; from these metrics we extract a set of features that serve as input to our classifiers. The experimental results show that our method can classify almost perfectly several types of faults, ranging from hardware malfunctions and misconfigurations to software issues and bugs. Furthermore, classification can be achieved with little computational overhead and with minimal delay, thus meeting real time requirements. In our experiments we reproduce realistic operating conditions found in HPC systems where streamed live data is fed to online fault classifiers, both for training and for detection.
Our evaluation is based on a dataset that we acquired from an experimental HPC system (called Antarex) where we injected faults using a tool we previously developed . Making the Antarex dataset publicly available to other researchers is another contribution of this paper. Acquiring our own dataset for this study was made necessary by the fact that commercial HPC system operators are very reluctant to share trace data containing information about faults in their systems .
The remainder of the paper is organized as follows. In Section II, we describe the Antarex dataset that forms the basis for our study. In Section III, we discuss how we extracted features suitable for machine learning-based fault detection from the acquired data. In Section IV, we present our experimental results, while Section V concludes the paper.
Ii The Antarex Dataset
The Antarex dataset contains trace data collected from an homonymous experimental HPC system located at ETH Zurich while it was subjected to fault injections. The dataset is publicly available111https://zenodo.org/record/1453949
for use by the community. In this section, after we give an overview of the dataset, we first describe the experimental set up associated with data acquisition followed by a discussion of the features extracted from the dataset.
Ii-a Dataset Overview
In order to acquire data, we executed benchmark applications and at the same time injected faults in the system at specific times via dedicated programs, so as to trigger anomalies in the behaviour of the applications. One type of data in the dataset refers to a series of CSV files, each containing a set of system performance metrics sampled through an HPC monitoring framework. Another type refers to the log files detailing the status of the system (i.e. currently running benchmark applications or injected fault programs) at each time point in the dataset. Such a structure enables researchers to perform a wide range of studies on the dataset. Moreover, since we collected the dataset by streaming continuous data, any study based on it will easily be reproducible on a real HPC system, in an online way.
The dataset is divided in two parts. The first part includes only the CPU and memory-related benchmark applications and fault programs, while the second is strictly hard drive-related. We executed each part in both single-core and multi-core variants. In the former, we executed all benchmark applications and fault programs on one specific core with one thread. In the latter, conversely, we executed benchmark applications with multiple threads on 8 of the 16 cores of the system, and executed fault programs freely on any of them. This structure resulted in 4 blocks of nearly 20GB of data in total, each block being obtained at different execution times, during an acquisition period of 32 days. The dataset structure is summarized in Table I. The related benchmark applications and fault programs will be explained in the following subsections.
|Block I||Block III||Block II||Block IV|
|Duration||12 days||4 days|
Ii-B Experimental Setup for Data Acquisition
The Antarex HPC node used for data acquisition consists of two Intel Xeon E5-2630 v3 CPUs, 128GB of RAM, a Seagate ST1000NM0055-1V4 1TB hard drive and runs the CentOS 7.3 operating system. The node has a default Tier-1 computing system configuration. In order to schedule the execution of the benchmark applications and to inject faults, we used the FINJ tool  in a Python 3.4 environment, with its fault-injecting engine running on the target machine itself, and its orchestrating controller running on a remote host. In order to collect performance metrics from the target system for the duration of the experiment, we used the Lightweight Distributed Metric Service (LDMS) framework . Like FINJ, the sampler component of LDMS was running on the target node, while the collector component was running on a remote host. We configured LDMS to sample a variety of metrics at each second, which come from the following seven different plug-ins:
meminfo collects general information on RAM usage;
perfevent collects CPU performance counters;
procinterrupts collects information on hardware and software interrupts;
procdiskstats collects statistics about hard drive usage;
procsensors collects metrics about CPU temperature and frequency;
procstat collects general metrics about CPU usage;
vmstat collects information about virtual memory usage.
This configuration resulted in a total of 2094 metrics collected at each second. Some of the metrics are system-wide, and describe the status of the system as a whole, others instead are core-specific and describe the status of a specific CPU core. Since there are 16 cores in our system, these metrics will have 16 instances as well, one for each core.
In order to minimize noise and bias in the sampled data, we chose to analyze, execute benchmarks and inject faults into only 8 of the 16 cores available in the system, and therefore used only one CPU. On the other CPU of the system, instead, we executed the FINJ and LDMS tools, which rendered their CPU overhead negligible.
Ii-C Features of the Dataset
The FINJ tool orchestrates the execution of benchmark applications and the injection of faults by means of a workload file, which contains a list of benchmark and fault-triggering tasks to be executed at certain times, on certain cores, for certain durations . For this purpose, we used several FINJ workload files, which were generated using FINJ’s built-in workload generator, one for each block of the dataset.
We used two statistical distributions in the workload generator to create the durations and inter-arrival times of the benchmark and fault-triggering tasks. We define the inter-arrival time as the interval between the start of two consecutive tasks. The benchmark tasks are characterized by rather simple duration and inter-arrival features. By using normal distributions, we achieved that 75% of the dataset’s duration is spent running benchmark applications. This resulted in regular benchmark tasks, having an average duration of 30 minutes, and average inter-arrival times of nearly 40 minutes.
The fault-triggering tasks are modeled in a more complex way. In order to achieve a realistic behavior, we chose to generate faults in the workload using statistical distributions fitted from real historical data, rather than specifying them analytically. For this purpose, we used the Grid5000 trace available on Fault Trace Archive , which includes the host failure records of the Grid5000 large-scale cluster  belonging to the period of May 2005 to November 2006. We extracted from this trace the inter-arrival times of the host failures. Such data was then scaled and shifted to obtain an average of 10 minutes, while retaining the shape of the distribution. We then fitted the data using an exponentiated Weibull distribution, which is commonly used to characterize failure inter-arrival times . To model durations, we extracted for all hosts the time intervals between successive absent and alive records. We then fitted a Johnson SU distribution over a cluster of the data present at the 5 minutes point, which required no alteration in the original data. This particular type of distribution was chosen because of the quality of the fitting.
In Figure 1, we show the histograms for the durations (a) and inter-arrival times (b) of the fault tasks in the workload files, together with the original distributions fitted from the Grid5000 data. We observe that the histograms differ slightly at the peaks, compared to the respective reference distributions. This is because the workload generator is allowed to manipulate the durations and inter-arrival times to ensure that faults cannot overlap in time.
We used a series of well-known benchmark applications to load the Antarex HPC node while acquiring the dataset, stressing different parts of the system and providing a diverse environment for fault injection. Since we limit our analysis to a single machine, we use versions of the benchmarks that rely on shared-memory parallelism, for example through the OpenMP library. The benchmark applications are the following:
DGEMM measures matrix-to-matrix multiplication performance ;
HPC Challenge (HPCC) is a collection of benchmarks that stress both the CPU and memory bandwidth of an HPC system ;
Intel distribution for High-Performance Linpack (HPL) measures performance in solving a system of linear equations ;
STREAM measures a system’s memory bandwidth ;
Bonnie++ measures HDD read-write performance ;
IOZone measures HDD read-write performance .
We now discuss the fault programs that we implemented and used to reproduce anomalous conditions in the analyzed HPC system, which are available at the FINJ Github repository . As in , each fault program can operate in a high or low-intensity mode, thus doubling the number of possible fault conditions. The programs, together with the generated 8 distinct faults and their effects, are the following:
leak periodically allocates 16MB arrays which are never released. In low-intensity mode, 4MB arrays are allocated . This program produces a memory leak fault, which leads to memory fragmentation and severe system slowdown when memory saturation is reached;
memeater allocates a 36MB array which is filled with integers. The size of the array is then periodically increased and new elements are filled in. The application restarts after 10 iterations. In low-intensity mode, an 18MB array is used . This program produces a memory interference fault by saturating memory bandwidth, resulting in degraded performance for running applications;
ddot repeatedly calculates the dot product between two equal-size matrices. The sizes of the matrices change periodically between 0.9, 5 and 10 times the cache’s size. In low-intensity mode, the size of the matrices is halved . This program produces a CPU and cache interference fault, resulting in degraded performance for all applications running on the same CPU as the program;
dial repeatedly generates random floating-point numbers and performs numerical operations over them. In low-intensity mode, the program sleeps for 0.5 seconds for each second of operation . This program produces an ALU interference fault, resulting in degraded performance for applications running on the same core as the program;
cpufreq decreases the maximum allowed CPU frequency by 50% of its original value through the Linux Intel P-State driver . In low-intensity mode, the maximum frequency is reduced by 30%. This program simulates a system misconfiguration or failing CPU fault, resulting in degraded performance for running applications;
makes any page allocation request fail with 50% probability, by using the Linux kernel’s fault injection framework. In low-intensity mode, page allocations fail with 25% probability. This program simulates a system misconfiguration or hardware malfunction fault, causing performance degradation and stalling of running applications;
ioerr triggers errors upon hard-drive I/O operations, again using the Linux kernel’s fault injection framework . One out of 500 I/O operations fails with 20% probability in high-intensity mode, and with 10% probability in low-intensity mode. This program simulates a failing hard drive fault, causing degraded performance for I/O-bound applications, as well as potential errors and crashes;
copy repeatedly writes and then reads back a 400MB file from a hard drive. After such a cycle, the program sleeps for 2 seconds. In low-intensity mode, a 200MB file is used . This program simulates an I/O interference or failing hard drive fault by saturating I/O bandwidth, and results in degraded performance for I/O-bound applications. Unlike ioerr, copy does not cause any I/O operations to fail and cause errors, but only slows them down, thus reproducing a different anomalous condition.
The faults triggered by our programs can be grouped in three categories according to their nature. The interference faults (i.e. leak, memeater, ddot, dial and copy) occur when orphan processes are left running in the system, saturating resources and slowing down the other processes. The misconfiguration faults occur when a component’s behavior is outside of its specification, due to a misconfiguration by the users or administrators (i.e. cpufreq). Finally, the hardware faults are related to physical components in the system that are about to fail, and trigger various kinds of errors (i.e. pagefail or ioerr). We note that some faults may belong to multiple categories, as they can be triggered by different factors in the system.
Iii Creation of Features
In this section, we describe the post-processing of the Antarex dataset in order to create features suitable for fault detection purposes. These features were fed to a machine learning-based fault detection system, whose architecture will also be defined here. All the scripts used to process the data are available on the FINJ Github repository .
Iii-a Post-Processing of Data and Structure of Feature Vectors
A set of features describing the state of the system for classification purposes were obtained from the metrics collected by LDMS. Firstly, a post-processing step was required. We removed all constant metrics: in fact, many acquired metrics never change (i.e., the amount of total memory in the system), and are therefore redundant. Additionally, we replaced the metrics captured by the perfevent and procinterrupts plug-ins with their first-order derivatives: since these metrics correspond to counters that can only increase over time, their derivatives indicate how many instances of the counted event occurred over the sampling period, resulting in a better representation. Furthermore, we created an allocated metric, at both CPU-core and node level, and integrated it in the original set. This metric has a binary value, and defines whether there is a benchmark allocated on the system or not. We name the metric as allocated, as we cannot know whether the benchmark is actively using any resource or not. Using such a metric is reasonable, since in any HPC system there is always knowledge of which jobs have computational resources currently allocated to them. Lastly, for each metric above, at each time point, we added its first-order derivative to the dataset, similar to .
Feature vectors were then created by aggregating the post-processed LDMS metrics. Each feature vector corresponds to a 60-second aggregation window and is related to a specific CPU core. The step between feature vectors is of 10 seconds. Such a short aggregation window allows for high granularity and quick response times to faults. For each metric, we computed several indicators of the distribution of the values measured within the aggregation window, similar to. These are the average, standard deviation, median, minimum, maximum, skewness, kurtosis, and finally the 5th, 25th, 75th and 95th percentiles. This results in a total of 22 statistical features, including also those related to the first-order derivatives, for each metric in the dataset. The final feature vectors contain thus a total of 3168 elements. It should be noted that this number does not include the metrics collected by the procinterrupts plugin. After a preliminary test, we found out that these metrics are irrelevant for the detection of our faults, and therefore we decided to exclude them from feature vectors.
In order to be able to train classifiers on the task of distinguishing between faulty and normal system states, we needed to assign labels to each feature vector above. For this, we used the FINJ output log, which includes status changes in the target system (i.e. a task starts or ends). This was converted into a label
file mapping each timestamp in the dataset to the fault and benchmark tasks that were running in each core of the system at that moment. We then labeled the generated feature vectors according to the fault program running on the system within the respective aggregation window, or with the “healthy” label if no fault was running. Since the Antarex dataset is continuous, with faults and benchmarks executing freely, a single aggregation window may capture multiple system states, making labeling not trivial. For example, a feature vector may contain “healthy” time points that are before and after the start of a fault, or could even include two different fault types. We define these feature vectors asambiguous. By using a short aggregation window of 60 seconds, we aim to minimize the impact of such ambiguous system states on fault detection. However, these cannot be completely removed, so we experiment with the following two labelling methods for our feature vectors.
mode: all the labels from the label file that appear in the 60-second time window are considered. Their distribution is examined and the label appearing the most is used as the label for the feature vector. This leads to robust feature vectors, whose label is always representative of the aggregated data;
recent: the label is given by the state of the system at the most recent time point in the aggregation window. This could correspond to a fault type or could be “healthy”. Such an approach may lead to a more responsive fault detection system, where what is detected is the system state at the moment, rather than the state over the last 60 seconds.
Iii-B Detection System Architecture
For our fault detection system, we adopted an architecture based on an array of classifiers, as displayed in Figure 2. Each classifier corresponds to a specific computing resource type in the system, such as CPU cores, GPUs, MICs, etc. Each classifier is then trained with feature vectors related to all resource units of that type, and is able to perform fault diagnoses for all of them. To achieve this, the feature vectors for each classifier contain all system-wide metrics for the system, together with resource-specific metrics for the resource unit being considered. This kind of architecture relies on the assumption that resource units of the same type behave in the same way, and that the respective feature vectors can be combined in a coherent set. However, users can also opt to use separate classifiers for each resource unit of the same type, overcoming this limitation, without any alteration to the feature vectors themselves.
Using an array of classifiers compared to using one single classifier for the entire system gives us much more granularity and detail regarding the status of the system. In fact, having detailed information about the status of single components in an HPC node allows for much more intelligent management decisions, rendering possible for example fault-aware dispatching of jobs. In our case, the system only contains CPU cores. Therefore, we have one classifier trained with feature vectors that contain system-wide data together with core-level data, for one core at a time.
Iv Experimental Results
In this section, we present our experimental results for fault classification using the Antarex dataset. We first provide insight over the experimental methodology, and then present results in a variety of conditions.
Iv-a Experimental Methodology
In order to understand the effectiveness of our fault detection method, we perform tests on the Antarex dataset with a variety of classifiers, trying to correctly detect faults that were injected in the system. The environment we used is Python3.4, with the Scikit-learn package. We built the test dataset by combining the sets of feature vectors from different cores in a single set, which by default is in time-stamp order. In particular, to obtain feasible sizes, for each time point we pick the feature vector of only one randomly-selected core. The classifier will thus be trained with data from all cores, and will be able to compute fault diagnoses for any of them.
We chose the cross-validation methodology for the training and evaluation of classifiers. The dataset is split into 5 continuous folds of data, one of which is used for testing and the others for training. The process is repeated for all possible combinations of test and training sets, and average results are returned. The trained classifiers are evaluated for each class in terms of F-score, which corresponds to the harmonic mean between the precision and recall metrics.
When not specified, feature vectors are read in time-stamp order, and only a small subset of the tests uses data shuffling for comparative purposes. Shuffling is widely used in machine learning as it can improve the quality of training data by removing the temporal correlations between successive samples. Moreover, performing shuffling often improves class balance among the different data folds. However, shuffling is not well suited to our fault detection framework which is designed for an online HPC system. Training of classifiers in an online system can be often performed using only continuous, streamed, and potentially unbalanced data as it is acquired; at the same time, the training must be robust enough to correctly detect faults in the near future – hence it is very important to assess the detection accuracy without data shuffling. We reproduce this realistic, online scenario by performing cross-validation on the Antarex dataset using feature vectors in time-stamp order. Most importantly, time-stamp ordering results in folds each containing data from only one specific time frame. The impact of the ordering criteria for feature vectors on the generation of data folds used by the cross-validation algorithm is depicted in Figure 3.
Iv-B Comparison of Classifiers
We first compare different classifiers. For this experiment, we preserved the time-stamp order of the feature vectors and used the mode
labeling method. The classifiers included in the comparison were Random Forest (RF), Decision Tree (DT), Linear Support Vector Classifier (SVC) and Neural Network (MLP) with two hidden layers, each having 1000 neurons. The results for each classifier and for each class are presented in Figure4. In addition, the overall F-score is highlighted for each classifier. It can be seen that all classifiers show very good performance, with F-scores that are well above the 0.9 threshold. RF is the best classifier, with an overall F-score of 0.98, followed by MLP and SVC scoring 0.93, the by DT with an F-score of 0.92. The critical point for all classifiers is represented by the pagefail and ioerr faults, which have substantially worse scores than the others.
It is clear from this experiment that RF would be the ideal classifier for an online fault detection system, due to its 5% better detection accuracy, in terms of F-score, over the other classifiers. There are other reasons as well motivating this choice: first, random forests are computationally efficient, and therefore would be suitable for use in online environments with strict overhead requirements. Moreover, it should be noted that unlike the MLP and SVC classifiers, RF and DT did not require data normalization. Normalization in an online environment is hard to achieve, as many metrics do not have well-defined upper bounds. To address this issue, a rolling window-based dynamic normalization approach was used in . This approach is unfeasible for machine learning classification, as it can lead to quickly-degrading detection accuracy and to the necessity of frequent training. Hence, in the following experiments we will use the RF classifier.
Iv-C Comparison of Labeling Methods and Impact of Shuffling
Here we evaluate the two different labeling methods we implemented by using the RF classifier. The results for classification without data shuffling can be seen in Figures 5a for mode and 5b for recent, with overall F-scores of 0.98 and 0.96 respectively, being close to the ideal values. Once again, in both cases the ioerr and pagefail faults perform substantially worse than the others. This is probably because both faults have an intermittent nature, with their effects depending on the hard drive I/O (ioerr) and memory allocation (pagefail) patterns of the underlying applications, and are therefore more difficult to detect than the other faults.
An interesting behavior can be observed with the copy fault program, which shows worse F-scores when using the recent method in 5b. As shown in Section IV-E, this fault is detected through the hard drive read rate metric, which is a comparatively slowly-changing metric. For this reason, a feature vector may be labeled as copy as soon as the program is started, before the metric has been updated to reflect the new system status. This in turn makes classification more difficult and leads to degraded accuracy. This leads us to conclude that recent may not be well suited for the faults whose effects cannot be detected immediately.
In Figures 5c and 5d, the results with data shuffling enabled are presented for the mode and recent methods, respectively. Adding data shuffling produces a sensible improvement in detection accuracy for both of the labeling methods, which show almost ideal performance for all fault programs, and overall F-scores of 0.99. Similar results were observed with the other classifiers presented in Section IV-B, not shown here for space reasons. It can also be seen that in this scenario, recent labeling performs slightly better for some fault types. This is likely due to the highly reactive nature of such labeling method, which can capture system status changes more quickly than the mode method. The greater accuracy (higher F-score) improvement obtained with data shuffling with recent labeling, compared to mode, indicates that the former is more sensible to temporal correlations in the data, which may lead to erroneous classifications.
Iv-D Impact of Ambiguous Feature Vectors
Here we give insights on the impact of ambiguous feature vectors in the dataset on the classification process, by using again the RF classifier. As mentioned in Section III-A, ambiguous feature vectors are computed from aggregation windows that capture multiple system states, for example before and after the start of a fault program or an HPC application. In Figure 6, the results when excluding ambiguous feature vectors from the training and test sets can be seen, using both time-stamp (a) and shuffled ordering (b). In this case the used labeling method is irrelevant, since the feature vector’s label is uniquely defined. It can be seen that excluding ambiguous feature vectors leads to a slightly better classification performance compared to the full dataset, reaching an overall F-score of 0.99. In particular, data shuffling enables ideal classification accuracy for most fault programs.
In the Antarex dataset, around 20% of the feature vectors are ambiguous. With respect to this relatively large proportion, the performance gap described above is small, which proves the robustness of our detection method. In general, the proportion of ambiguous feature vectors in a dataset depends primarily on the length of the aggregation window, and on the frequency of state changes in the underlying HPC system. More and more feature vectors will be ambiguous as the aggregation window’s length increases, leading to more pronounced adverse effects on the classification accuracy. Thus, as a practical guideline, we advise to use short aggregation windows, such as the 60-second one we employed, to perform online fault detection.
A more concrete example of the behavior of ambiguous feature vectors can be seen in Figure 7, where we show the scatter plots of two important metrics (which will be discussed in Section IV-E) for the feature vectors related to the ddot, cpufreq and memeater fault programs, respectively. The “healthy” points, marked in green, and the fault-affected points, marked in red, are distinctly clustered in all cases. On the other hand, the points representing the ambiguous feature vectors, marked in blue, are sparse, and often fall right between the “healthy” and faulty clusters. This is particularly evident with the cpufreq fault program in Figure 7b, where the ambiguous points form a sparse cloud, outlining the transitions between the “healthy” and faulty states.
Iv-E Estimation of the Most Important Metrics
We now give insights on the most important system performance metrics for the detection of fault programs. As shown in 
, using methods like principal component analysis, thus keeping only the metrics that account for most of the variance in the data, may leave out certain important metrics for fault detection. For this reason, we identify such metrics by using a DT classifier trained on the Antarex dataset and show them in TableII, along with their source LDMS plug-ins. While the metrics marked in bold are per-core, the others are system-wide. The choice of a DT over an RF classifier is due to the latter being prone to reporting as important the same metric many times, with different statistical indicators. This can be attributed to its ensemble nature and to the subtle differences in the estimators that compose it.
It can be seen that the metrics from most of the available plug-ins are used, and some of these can be immediately associated to the effects of our faults. For instance, the context_switches metric is tied to the dial and ddot programs, as CPU interference results in an anomalous number of context switches. In general, we observe that the first-order derivatives (marked with the “der” suffix) are widely used by the classifier, thus proving their usefulness, and that the statistical indicators like the skewness and kurtosis are never seen, leading us to conclude that simple features may be sufficient for machine learning-based fault detection.
Iv-F Remarks on Overhead
Quantifying the overhead of our fault detection framework is fundamental to prove its feasibility on a real online HPC system. The LDMS tool is proven to have a very low overhead at high sampling rates , and is therefore expected to work well in an online context. We also assume that the generation of feature vectors and the classification are performed locally in each node, and that only the resulting fault diagnoses are sent externally.
We calculated that generating a set of feature vectors, one for each of the 8 analyzed cores in our test system, at a given time point for an aggregation window of 60 seconds takes on average 340 milliseconds by using a single thread. This value includes the I/O overhead related to reading and parsing LDMS CSV files, and to writing the output feature vectors, and is therefore expected to be even lower in a real system with direct access to streamed data. In our case, we consider a step of 10 seconds between feature vectors, but users can tune this value to obtain finer granularity, given the low overhead.
Performing classification for one feature vector using the RF classifier takes on average 2 milliseconds. This results in a total overhead of 342 milliseconds for generating and classifying feature vectors for each 60-seconds aggregation window, using a single thread. This value is extremely low and compatible for use in an online system, without altering the computation of HPC applications.
We have presented a fault detection and classification method based on machine learning techniques, targeted at HPC systems. Our method is designed for streamed, online data obtained from a monitoring framework, which is then processed and fed to classifiers. Due to the scarcity of public datasets containing detailed information about faults in HPC systems, we acquired the Antarex dataset and evaluated our method based on it. The dataset was generated by using the FINJ tool  which injected a wide variety of faults in an HPC node, including misconfiguration, hardware and software issues. The HPC node was at the same time monitored using the LDMS framework . The Antarex dataset is made publicly available with this paper, and can thus be used in future resiliency studies.
Results of our study show almost perfect classification accuracy for all injected fault types, with negligible computational overhead for HPC nodes. The detection delay is lower than one minute, thus allowing for quick corrective actions before the faults translate into system failures. Moreover, our study reproduces the operating conditions that could be found in a real online system, in particular those related to ambiguous system states and data imbalance in the training and test sets. We show that, while the impact of such factors on the classification accuracy is not negligible, it is small enough to warrant the feasibility of our machine learning method in a real, online HPC system.
As future work, we plan to deploy our fault detection framework in a large-scale real HPC system. This will involve the development of tools to aid online training of machine learning models, as well as the integration in a lightweight and holistic monitoring framework such as Examon . We also need to better understand the behavior of our system in an online scenario. Specifically, since training is performed before the HPC nodes move into production (i.e. in a test environment) we need to characterize how often re-training is needed, and devise a procedure to perform this when needed, like when the nodes are taken offline for maintenance.
Acknowledgements. A. Netti has been supported by a research fellowship from the Oprecomp-Open Transprecision Computing project. A. Sîrbu has been partially funded by the EU project SoBigData Research Infrastructure — Big Data and Social Mining Ecosystem (grant agreement 654024). We thank the Integrated Systems Laboratory of ETH Zurich for granting us control of their Antarex HPC node during this study.
-  O. Villa, D. R. Johnson, M. Oconnor, E. Bolotin, D. Nellans, J. Luitjens et al., “Scaling the power wall: a path to exascale,” in Proc. of SC 2014. IEEE, 2014, pp. 830–841.
-  F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer, and M. Snir, “Toward exascale resilience: 2014 update,” Supercomputing frontiers and innovations, vol. 1, no. 1, pp. 5–28, 2014.
-  K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, J. Hiller et al., “Exascale computing study: Technology challenges in achieving exascale systems,” Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep, vol. 15, 2008.
-  C. Engelmann and S. Hukerikar, “Resilience design patterns: A structured approach to resilience at extreme scale,” Supercomputing Frontiers and Innovations, vol. 4, no. 3, 2017.
-  W. M. Jones, J. T. Daly, and N. DeBardeleben, “Application monitoring and checkpointing in hpc: looking towards exascale systems,” in Proc. of ACM-SE 2012. ACM, 2012, pp. 262–267.
-  M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson et al., “Addressing failures in exascale computing,” The International Journal of High Performance Computing Applications, vol. 28, no. 2, pp. 129–173, 2014.
-  A. Gainaru and F. Cappello, “Errors and faults,” in Fault-Tolerance Techniques for High-Performance Computing. Springer, 2015, pp. 89–144.
-  O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun, “Diagnosing performance variations in hpc applications using machine learning,” in Proc. of ISC 2017. Springer, 2017, pp. 355–373.
E. Baseman, S. Blanchard, N. DeBardeleben, A. Bonnie, and A. Morrow, “Interpretable anomaly detection for monitoring of high performance computing systems,” inProc. of ACM SIGKDD Workshop 2016, 2016.
-  K. B. Ferreira, P. Bridges, and R. Brightwell, “Characterizing application sensitivity to os interference using kernel-level noise injection,” in Proc. of SC 2008. IEEE Press, 2008, p. 19.
-  P. Bodik, M. Goldszmidt, A. Fox, D. B. Woodard, and H. Andersen, “Fingerprinting the datacenter: automated classification of performance crises,” in Proc. of EuroSys 2010. ACM, 2010, pp. 111–124.
-  Z. Lan, Z. Zheng, and Y. Li, “Toward automated anomaly identification in large-scale systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 21, no. 2, pp. 174–187, 2010.
-  Q. Guan, C.-C. Chiu, and S. Fu, “Cda: A cloud dependability analysis framework for characterizing system dependability in cloud computing infrastructures,” in Proc. of PRDC 2012. IEEE, 2012, pp. 11–20.
-  Q. Guan and S. Fu, “Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures,” in Proc. of SRDS 2013. IEEE, 2013, pp. 205–214.
-  C. Wang, V. Talwar, K. Schwan, and P. Ranganathan, “Online detection of utility cloud anomalies using metric distributions,” in Proc. of NOMS 2010. IEEE, 2010, pp. 96–103.
-  I. Cohen, J. S. Chase, M. Goldszmidt, T. Kelly, and J. Symons, “Correlating instrumentation data to system states: A building block for automated diagnosis and control,” in OSDI, vol. 4, 2004, pp. 16–16.
-  A. Netti, Z. Kiziltan, O. Babaoglu, A. Sirbu, A. Bartolini, and A. Borghesi, “FINJ: A fault injection tool for HPC systems,” in Proc. of Resilience Workshop 2018. Springer, 2018. [Online]. Available: https://github.com/AlessioNetti/fault_injector
-  D. Kondo, B. Javadi, A. Iosup, and D. Epema, “The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems,” in Proc. of CCGRID 2010. IEEE, 2010, pp. 398–407.
-  A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gentile, S. Monk, N. Naksinehaboon, J. Ogden et al., “The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications,” in Proc. of SC 2014. IEEE, 2014, pp. 154–165.
-  “The fault trace archive.” [Online]. Available: http://fta.scem.uws.edu.au/
-  R. Bolze, F. Cappello, E. Caron, M. Daydé, F. Desprez, E. Jeannot et al., “Grid’5000: A large scale and highly reconfigurable experimental grid testbed,” The International Journal of High Performance Computing Applications, vol. 20, no. 4, pp. 481–494, 2006.
-  “The DGEMM benchmark.” [Online]. Available: http://www.nersc.gov/research-and-development/apex/apex-benchmarks/dgemm/
-  “The HPC challenge (HPCC) benchmark.” [Online]. Available: http://icl.cs.utk.edu/hpcc/
-  “The intel distribution for the LINPACK benchmark.” [Online]. Available: https://software.intel.com/en-us/mkl-windows-developer-guide-overview-of-the-intel-distribution-for-linpack-benchmark
-  “The STREAM benchmark.” [Online]. Available: https://www.cs.virginia.edu/stream/
-  “The bonnie++ benchmark.” [Online]. Available: https://www.coker.com.au/bonnie++/
-  “The IOzone benchmark.” [Online]. Available: http://www.iozone.org
-  “The intel p-state driver for linux.” [Online]. Available: https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt
-  “The linux fault injection infrastructure.” [Online]. Available: https://www.kernel.org/doc/Documentation/fault-injection/fault-injection.txt
-  F. Beneventi, A. Bartolini, C. Cavazzoni, and L. Benini, “Continuous learning of hpc infrastructure models using big data analytics and in-memory processing tools,” in Proc. of DATE 2017. IEEE, 2017, pp. 1038–1043.