DeepAI
Log In Sign Up

RUAD: unsupervised anomaly detection in HPC systems

08/28/2022
by   Martin Molan, et al.
17

The increasing complexity of modern high-performance computing (HPC) systems necessitates the introduction of automated and data-driven methodologies to support system administrators' effort toward increasing the system's availability. Anomaly detection is an integral part of improving the availability as it eases the system administrator's burden and reduces the time between an anomaly and its resolution. However, current state-of-the-art (SoA) approaches to anomaly detection are supervised and semi-supervised, so they require a human-labelled dataset with anomalies - this is often impractical to collect in production HPC systems. Unsupervised anomaly detection approaches based on clustering, aimed at alleviating the need for accurate anomaly data, have so far shown poor performance. In this work, we overcome these limitations by proposing RUAD, a novel Recurrent Unsupervised Anomaly Detection model. RUAD achieves better results than the current semi-supervised and unsupervised SoA approaches. This is achieved by considering temporal dependencies in the data and including long-short term memory cells in the model architecture. The proposed approach is assessed on a complete ten-month history of a Tier-0 system (Marconi100 from CINECA with 980 nodes). RUAD achieves an area under the curve (AUC) of 0.763 in semi-supervised training and an AUC of 0.767 in unsupervised training, which improves upon the SoA approach that achieves an AUC of 0.747 in semi-supervised training and an AUC of 0.734 in unsupervised training. It also vastly outperforms the current SoA unsupervised anomaly detection approach based on clustering, achieving the AUC of 0.548.

READ FULL TEXT VIEW PDF
09/11/2019

Anomaly Detection with Inexact Labels

We propose a supervised anomaly detection method for data with inexact a...
04/28/2021

PANDA : Perceptually Aware Neural Detection of Anomalies

Semi-supervised methods of anomaly detection have seen substantial advan...
03/29/2021

Elsa: Energy-based learning for semi-supervised anomaly detection

Anomaly detection aims at identifying deviant instances from the normal ...
10/25/2017

Unsupervised and Semi-supervised Anomaly Detection with LSTM Neural Networks

We investigate anomaly detection in an unsupervised framework and introd...
02/13/2018

Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding

As spacecraft send back increasing amounts of telemetry data, improved a...
11/19/2021

UN-AVOIDS: Unsupervised and Nonparametric Approach for Visualizing Outliers and Invariant Detection Scoring

The visualization and detection of anomalies (outliers) are of crucial i...
06/11/2019

Anomaly Detection in High Performance Computers: A Vicinity Perspective

In response to the demand for higher computational power, the number of ...

1 Introduction

Recent trends in the development of high-performance computing (HPC) systems (such as heterogeneous architecture and higher-power integration density) have increased the complexity of their management and maintenance Shin et al. (2021)

. A typical contemporary HPC system consists of thousands of interconnected nodes; each node usually contains multiple different accelerators such as graphical processors, FPGAs, and tensor cores

Milojicic et al. (2021). Monitoring the health of all those subsystems is an increasingly daunting task for system administrators. To simplify this monitoring task and reduce the time between anomaly insurgency and response by the administrators, automatic anomaly detection systems have been introduced in recent years Netti et al. (2021).

Anomalies that result in downtime or unavailability of the system are expensive events. Their cost is primarily associated with the time when the HPC system cannot accept new compute jobs. Since HPC systems are costly and have a limited service lifespan Parnell et al. (2019), it is in the interest of the system’s operator to reduce unavailability times. Anomaly detection helps in this regard as it can significantly reduce the time between the fault and the response by the system administrator, compared to manual reporting of faulty nodes Borghesi et al. (2022).

Modern supercomputers are endowed with monitoring systems that give the system administrators a holistic view of the system Netti et al. (2021)

. Data collected by these monitoring systems and historical data describing system availability are the basis for Machine Learning anomaly detection approaches

Borghesi et al. (2019a, e); Netti et al. (2019a, b); Du et al. (2017), which build data-driven models of the supercomputer and its computing nodes. In this work, we focus on CINECA Tier0 HPC system (Marconi100 Iannone et al. (2018); Beske (2020) ranked 9th in Jun. 2020 Top500 list 51), which employs a holistic monitoring system called EXAMON Bartolini et al. (2019).

Production HPC systems are reliable machines that generally have very few downtime events - for instance, in Marconi100 at CINECA, timestamps corresponding to faulty events represent, on average, only of all data. However, although anomalies are rare events, they still significantly impact the system’s overall availability - during the observation period, there was at least one active anomaly (unavailable node)

of the time. State-of-the-art (SoA) methods for anomaly detection on HPC systems are based on supervised and semi-supervised approaches from the Deep Learning (DL) field

Borghesi et al. (2022); for this reason, these methods require a training set with accurately annotated periods of downtime (or anomalies). In turn, this requires the monitoring infrastructure to track downtime events; in some instances, this can be done with specific software tools (e.g., Nagios Barth (2008)), but properly configuring these tools is a complex and time-consuming task for system administrators.

So far, the challenges of anomaly detection on HPC systems have been approached by deploying anomaly reporting tools by training the models in a supervised or semi-supervised fashion Borghesi et al. (2022); Molan et al. (2021); Tuncer et al. (2018); Netti et al. (2019a). The need for an accurately labelled training set is the main limitation of current approaches as it is expensive, in terms of time and effort of the system administrators, to be applied in practice. Downtime tracking also has to be able to record failures with the same granularity as the other monitoring services. Some methods in production HPC systems only record downtime events by date Shin et al. (2021); Milojicic et al. (2021); Netti et al. (2021). In most production HPC systems, accurate anomaly detection is thus not readily achievable. For this reason, the majority of the methods from the literature were tested on historical or synthetic data or in supercomputers where faults were injected in a carefully controlled fashion Netti et al. (2018). Another limitation for the curation of an accurately labeled anomaly dataset is the short lifetime of most HPC systems. In the HPC sector, a given computing node and system technology have a lifetime of between three and five years. Short lifetime means, in practice, that the vendor has no time to create a dataset for training an anomaly detection model before the system is deployed to the customer site.

A completely unsupervised anomaly detection approach could be deployed on a new node or even on an entirely new HPC system. It would then learn online and without any interaction with the system administrators. Additionally, such a system would be easier to deploy as it would require no additional framework to report and record anomalous events (in addition to the monitoring infrastructure needed to build the data-driven model of the target supercomputer - a type of infrastructure which is becoming more and more widespread in current HPC facilities Netti et al. (2021)).

Unsupervised anomaly detection approaches for HPC systems exist such as Dani et al. (2017); Morrow et al. (2016); Bursic et al. (2019). They either work on log or sensor data. Approaches based on log data Dani et al. (2017); Bursic et al. (2019), while useful, can only offer a post-mortem and restricted view of the supercomputer state. The SoA for anomaly detection on sensor data Morrow et al. (2016) is based on clustering, which requires a degree of manual analysis from system administrators and offers poor performance compared to semi-supervised methods. The semi-supervised methods Borghesi et al. (2022, 2019a, 2019b)

, based on the dense autoencoders, which are trained to reproduce their input, could be trained in an unsupervised fashion. However, none of the presented works has explored this possibility. According to the SoA, the models would perform worse as the dense autoencoder is also capable of learning the characteristics of the anomalies

Borghesi et al. (2022, 2019a, 2019b).

The primary motivation for this work is to propose a novel approach that relies only on the fact that the anomalies are rare events and works at least equally well when trained in an unsupervised manner as it does when trained in semi-supervised manner - this has not been the case in the current SoA. In this work, we propose an unsupervised approach: RUAD (Recurrent Unsupervised Anomaly Detection) that works on sensor data and outperforms all other approaches, including the current SoA semi-supervised approach Borghesi et al. (2022) and SoA unsupervised approach Morrow et al. (2016)

. RUAD achieves that by taking into account temporal dependencies in the data. We achieve that by using Long Short-Term Memory (LSTM) cells in the proposed neural network model structure, which explicitly take into consideration the temporal dimension of observed phenomena. We also show that the RUAD model, comprising of LSTM layers, is capable of learning the characteristic of the normal operation even if the anomalous data is present in the test set - the RUAD model is thus able to be trained in an

unsupervised manner. RUAD targets single HPC computing nodes: we have different anomaly detection models for each computing node. The motivation behind this is scalability: in this way, each node can be used to train its own model with minimal overhead - moreover, this strategy would work in larger supercomputers as well, as if the number of nodes increases, we just have to add new detection models.

1.1 Contributions of the paper

To recap, in this paper, we propose an anomaly detection framework that can handle complex system monitoring data, scale to large-scale HPC systems, and be trained even if no labelled dataset is available. The key contributions presented in this paper are:

  • We propose a completely unsupervised anomaly detection approach (RUAD) that exploits the fact that the anomalies are rare and explicitly considers the temporal dependencies in the data by using LSTM cells in an autoencoder network. The resulting Deep Learning model outperforms the previous state-of-the-art semi-supervised approach Borghesi et al. (2022), based on time-unaware autoencoder networks. On the dataset presented and analysed in this paper (collected from the Marconi100 supercomputer), the previous approach achieves an Area-Under-the-Curve (ACU) test set score of . In contrast, our unsupervised approach achieves the best test set AUC score of . To the best of our knowledge, this work is the first time such an approach has been applied to the field of HPC system monitoring and anomaly detection.

  • We have conducted a very large-scale experimental evaluation of our methods. We have trained four different deep learning models for each of the 980+ nodes of Marconi100. To the best of our knowledge, this is the largest scale experiment relating to anomaly detection in HPC systems, both in terms of the number of considered nodes and length of time. Previous works only evaluate the models on a subset of nodes with a short observation time (Borghesi et al. (2022) paper, for instance, only analyzed 20 nodes of the HPC system over two months). Per-node training of models also demonstrates the feasibility of per node models for large HPC systems. The training time for the individual model was under minutes on a single NVIDIA Volta V100 GPU.

1.2 Structure of the paper

We present the current state-of-the-art and position our paper in Section 2. The machine learning approaches used for anomaly detection, including our novel approach, are described in section 3. The experimental setting for empirical validation of our results is detailed in Section 4.1 and our results are discussed in the rest of Section 4. Finally, Section 5 offers some concluding remarks.

2 Related Works

The drive to detect events or instances that deviate from the norm (i.e. operational anomalies) is present across many industrial applications. One of the earliest applications of anomaly detection models was credit card fraud detection in the financial industry Moschini et al. (2020); Ahmed et al. (2016). Recently, anomaly detection (and associated predictive maintenance) has become relevant in manufacturing industries Lee et al. (2017); Rosa et al. (2021), internet of things (IoT) Martins et al. (2022); Cauteruccio et al. (2021); Xu et al. (2020), energy sector Fu et al. (2021), medical diagnostics Zhang et al. (2018); Astillo et al. (2022), IT security Salman et al. (2017), and even in complex physics experiments Molan (2020).

Typically, anomalies in an HPC system refer to periods of (and leading to) suboptimal operating modes, faults that lead to failed or incorrectly completed jobs, or node and other components hardware failures. While HPC systems have several possible failure mitigation strategies Gamell et al. (2017) and fault tolerance strategies Meneses et al. (2015), anomalies of this type still significantly reduce the amount of compute time available to users Boixaderas et al. (2020). The transition towards Exascale and the increasing heterogeneity of hardware components will only exacerbate the issues stemming from failures, and anomalous conditions that already plague HPC machines Shin et al. (2021); Netti et al. (2021); Iuhasz and Petcu (2019)

. A DARPA study estimates that the failures in future exascale HPC systems could occur as frequently as once every 35-39 minutes

Bergman et al. (2008), thus significantly impacting the supercomputing availability and system administrator load.

However, when looking at specific components and not at the entire HPC system (e.g., considering a single computing node), faults remain very rare events, thus falling under the area of anomaly detection, which can be seen as an extreme case of supervised learning on unbalanced classes

Pang et al. (2020). Because data regarding normal operation far exceeds data regarding anomalies, classical supervised learning approaches tend to overfit the normal data and give a sub-optimal performance on the anomalous data Pang et al. (2021). In order to mitigate the problem of unbalanced classes, the anomaly detection problem is typically approached from two angles. Approaches found in the State-of-Art (SoA) that address the class imbalance either modify the data Lemaître et al. (2017) or use specialized techniques that work well on anomaly detection problems Borghesi et al. (2022). Data manipulation approaches address the dataset imbalance either by decreasing the data belonging to normal operation (under sampling the majority class) or by oversampling or even generating anomalous data (over sampling minority class) Lemaître et al. (2017). Data manipulation for anomaly detection in HPC systems has not yet been thoroughly studied. Conversely, most existing approaches rely on synthetic data generation, e.g., injection of anomalies in real (non-production) supercomputers or HPC simulators Borghesi et al. (2022).

Another research avenue exploits the abundance of normal data from HPC systems using a different learning strategy,, namely semi-supervised ML models. Instead of learning on a dataset containing multiple classes – and consequently learning the characteristics of all classes – semi-supervised models are trained only on the normal data. Hence, they are trained to learn the characteristics of the of the normal class (the majority class in the dataset). Anomalies are then recognized as anything that does not correspond to the learned characteristic of the normal class Pang et al. (2020); Borghesi et al. (2019a, d, b); Wu et al. (2021).

Regarding the type of data used to develop and deploy anomaly detection systems, we can identify two macro-classes: system monitoring data collected by holistic monitoring systems (i.e. Examon Bartolini et al. (2019)) and log data. This data is then annotated with information about the system or node-level availability, thus creating a label associated with the data points. The label encodes whether the system is operating normally or experiencing an anomaly. Since it is expensive and time-consuming to obtain labelled system monitoring data, a labelled dataset for supervised learning can be obtained by ”injecting” anomalies into the HPC system (like Netti et al. (2018)). Labels are important for both supervised, semi-supervised and unsupervised approaches. In the first case, they are used to compute the loss, in the second case to identify the training dataset and validation, and in the third case, only for validation. This data can then be used in a supervised learning task directly or after processing new features (feature construction). Examples of this approach are Tuncer et al. (2017, 2018); Aksar et al. (2021a)

where authors use supervised ML approaches to classify the performance variations and job-level faults in HPC systems. For fault detection,

Netti et al. (2019a, 2018)

propose a supervised approach based on Random Forest (an ensemble method based on decision trees) to classify faults in an HPC system. All mentioned approaches use synthetic anomalies injected into the HPC system to train a supervised classification model. Approaches

Borghesi et al. (2022) and Molan et al. (2021) are among the few that leverage real anomalies collected from production HPC systems (as opposed to injected anomalies). In this paper, we are interested in real anomalies, and thus, we will not include methods using synthetic/simulated data or injected anomalies in our quantitative comparisons.

All mentioned approaches do not take into account temporal dependencies of data (models are not trained on time series but on tabular data containing no temporal information). System monitoring data approach Aksar et al. (2021c) is the first to take into account temporal dependencies in data by calculating statistical features on temporal dimension (aggregation, sliding window statistics, lag features). Most approaches that deal with time series anomaly detection do so on system log data. Labelled anomalies are either analyzed with log parsers Baseman et al. (2016) or detected with deep learning methods. Deep learning methods for anomaly detection are based on LSTM neural networks as they are a proven approach in other text processing fields.

Compared to labelled training sets, much less work has been done on unlabelled datasets - despite this case being much more common in practice. So far, all research on unlabelled datasets has focused on system log data. Dani et al. (2017) propose a

-means based unsupervised learning approach that does not take into account temporal dynamics of the log data. A clustering-based approach on sensor data is proposed by

Morrow et al. (2016). This approach will serve as one of the baselines in the experimental section (as it is the only unsupervised approach on the sensor and not on log data). An approach Bursic et al. (2019)

works on time series data in an unsupervised manner. It uses the LSTM-based autoencoder and is trained on the existing log data dataset. The proposed anomaly detector achieves the AUC (area under the receiver-operator characteristic curve) of

. Although it works on a drastically different type of dataset (log data as opposed to system monitoring data), it is the closest existing work to the scope of the research presented in this paper. As we show later in the paper, we can achieve much better results than the one reported for the log data models Bursic et al. (2019) by deploying an unsupervised anomaly detection approach on system monitoring data on a per-node basis. Table 1 summarizes the most relevant approaches described in this section, focusing on the training set and temporal dependencies.

Tabular data Time series
Supervised Aksar et al. (2021b); Netti et al. (2019b) Aksar et al. (2021c); Baseman et al. (2016); Du et al. (2017)

Semi-supervised
Borghesi et al. (2022, 2019a, 2019d, 2019b)

Unsupervised
Dani et al. (2017); Morrow et al. (2016) Bursic et al. (2019)
Table 1: Summary of anomaly detection approaches on HPC systems

The novelty of this paper is, in relation to the existing works, threefold:

  • it introduces an unsupervised time-series based anomaly detection model named RUAD;

  • it proposes a deep learning architecture that captures time dependency;

  • the approach is evaluated on a large scale production dataset with real anomalies – this is the largest scale evaluation ever conducted on this kind of problem, to the best of our knowledge.

3 Methodology

In this section, we describe the proposed approach for unsupervised anomaly detection. We do not directly introduce the proposed method (the LSTM autoencoder deep network) as we want to show how it is a significant extension to the current state-of-the-art; thus, we start by introducing three baseline methods, i) exponential smoothing (serving as the most basic method for comparison), ii) unsupervised clustering and iii) the dense autoencoder used in Borghesi et al. (2022). We then describe our approach in detail and highlight its key strengths (the unsupervised training regime and the explicit inclusion of the temporal dimension).

3.1 Node anomaly labeling

We aim to recognize the severe malfunctioning of a node that prevents it from executing regular compute jobs. This malfunctioning does not necessarily coincide with removing a node for the production, as reported by Nagios. In our discussions with system administrators of CINECA, we have concluded that the best proxy for node availability is the most critical state, as reported by Nagios. For this reason, we have created a new label called node anomaly that has a value 1 if any subsystem reported by Nagios reports a critical state. From these events (reported anomalies), we then filter out known false positive events based on reporting tests or configurations in Jira Wikipedia (2021b). Jira logs are supplied by CINECA. The labels used in our previous work Borghesi et al. (2022) do not apply to M100 as they were extensively used to denote nodes being removed from production for testing and calibration. In this work, we are examining the early period of the HPC machine life-cycle, when several rounds of re-configuration were performed, thus partially disrupting the normal production flow of the system. Comparing the two labelling strategies in table 2, we can see that the overlap between the two is minimal. Additionally, there are far fewer anomalies as reported by the node anomaly mainly because the M100 went through substantial testing periods in the first ten months of operation where nodes are marked as removed from production while still functioning normally. In the remainder of the paper, class or class will always refer to the value of node anomaly being or respectively. Normal data is all data where node anomaly has value 0 and anomalies are instances where node anomaly has value 1.

Node anomaly
0 1
Removed from production: False 12 139 560 4 280
Removed form production: True 15 783 12
Table 2: Comparison between removed from production and node availability. The anomalies studied in this work (node availability) significantly differ (and are more reliable) from anomalies studied in previous works. The new labels also mark much fewer events as anomalous.

3.2 Reconstruction error and result evaluation

The problem of anomaly detection can be formally stated as a problem of training the model

that estimates the probability

that a sequence of vectors of length

ending at time represents an anomaly at time :

(1)

Vector collects all feature values at time ; the features are the sensor measurements collected from the computing nodes. is the size of the past window that the model takes as input. If the model does not take past values into account - like the dense model implemented as a baseline Borghesi et al. (2022) - and the window size is , the problem can be simplified as estimating:

(2)

In the case of autoencoders, model is composed of two parts: autoencoder (a neural network) and the anomaly score, which is computed using the reconstruction error of the autoencoder. The reconstruction error is calculated by comparing the output of autoencoder model and the real value vector . The task of model is to reconstruct the last element of its input sequence:

(3)

Vector the reconstruction of vector . As in Eq. 2, window size can be . The model outputs normalized data . The reconstruction error is calculated as the sum of the absolute difference between the output of model and the normalized input value for each feature: where is the number of features and is the output of the model . The error is then normalized by dividing it by the maximum error on the training set: . We estimate the probability for class 1 (anomaly) as

(4)

Based on probability , the classifier makes the prediction whether the sequence belongs to class (anomaly) of class (normal operation). This prediction depends on a threshold , which is a tunable parameter:

(5)

To avoid selecting a specific threshold , we introduce the Receiver-Operator Characteristic curve (ROC curve) as a performance metric. It allows us to evaluate the performance of the classification approach for all possible decision thresholds 46. The receiver-operator characteristic curve plots the true-positive rate in relation to the false-positive rate. The random decision represents a linear relationship between the two – for a classifier to make sense, the ROC curve needs to be above the diagonal line. For each specific point on the curve, the better classifier is the one whose ROC curve is above the other. The overall performance of the classifier can be quantitatively computed as the Area Under the ROC Curve (AUC); a classifier making random decisions has the AUC equal to 0.5. AUC scores below 0.5 designate classifiers that are worse than random choice. The best possible AUC score is 1, which is achieved by a classifier that would achieve a true-positive rate equal to 1 while having a false-positive rate equal to 0 (broadly speaking, this is only achievable on trivial datasets or very simple learning tasks).

3.3 Trivial baseline: exponential smoothing

Exponential smoothing is implemented as a trivial baseline comparison. It is a simple and computationally inexpensive method that detects rapid changes (jumps) in values. If the anomalies were simply rapid changes in values with no correlation between features, a simple exponential smoothing method would be able to discriminate them. Therefore, we chose exponential smoothing as a first baseline as it is computationally inexpensive and requires no training set. Additionally, if exponential smoothing performs poorly, this underlines that we are indeed solving a non-trivial anomaly detection problem, for which more powerful models are needed.

For the baseline, we choose to implement exponential smoothing per feature independently. Exponential smoothing for feature at time is calculated as:

(6)

where is an estimate of at time and is a parameter of the method. We do this for all features in set . The estimate at the beginning of the observation is equal to the actual value at time :

3.4 Unsupervised baseline: clustering

A possible approach to unsupervised anomaly detection is to use standard unsupervised machine learning techniques such as k-means clustering proposed by

Morrow et al. (2016). The clusters are determined on the train set; each new instance belonging to the test set is associated with one of the pre-trained clusters. We opted for this particular unsupervised technique for the comparison as it is the only unsupervised method found in the literature (to the best of our knowledge) which uses sensor data and not logs - and thus, we guarantee a fair comparison. It has to be noted, however, that clustering, while belonging to the field of unsupervised machine learning cannot detect anomalies in an unsupervised manner - for each of the clusters determined on the train set, the probability for the anomaly has to be calculated. This probability can only be calculated using the labels.

In this work, the clustering approach inspired by Morrow et al. (2016) is implemented to prove the validity of the obtained results. We have used K-means clustering Dani et al. (2017) like it has been proposed in Morrow et al. (2016). We have trained the clusters on the train set. Based on the silhouette score111the Silhouette score is a measure of performance for a clustering method. It measures how similar an instance is to others in its own cluster compared to instances from the other clusters Shahapure and Nicholas (2020). It is calculated as where is the mean inter-cluster distance, and is the mean nearest cluster distance for each sample. on the train set, we have determined the optimal number of clusters for each node222Optimal number of clusters is the number of clusters that produces the highest silhouette score on the train set.. The percentage of instances that belong to class 1 is calculated for each of the determined clusters. We use this percentage of anomalous instances as the anomaly probability for each instance assigned to a specific cluster. The train and test set split is the same as in all other evaluated methods.

3.5 Semi-supervised baseline: dense autoencoder

The competitive baseline method is based on the current state-of-the-art dense autoencoder model proposed by Borghesi et al. (2022). Autoencoders are types of neural networks (NN) trained to reproduce their input. The network is split into two (most often symmetric) parts: encoder and decoder. The role of the encoder is to compress the input into a more condensed representation. This representation is called the latent layer. To prevent the network from learning a simple identity function, we choose the latent layer to be smaller than the original input size (number of input features) Borghesi et al. (2019a). The role of the decoder is to reconstruct the original input using the latent representation.

Dense autoencoders are a common choice for anomaly detection since we can restrict their expressive power by acting on the size of the latent layer. Compressing the latent dimension forces the encoder to extract the most salient characteristics from the input data; unless the input data is highly redundant, the autoencoder cannot correctly learn to recreate its input after a certain latent size reduction. In the current state-of-the-art for anomaly detection in production supercomputers (Borghesi et al. (2022)) the dense autoencoder is used in a semi-supervised fashion, meaning that the network is trained using only data points corresponding to the normal operation of the supercomputer nodes (Class 0). Semi-supervised training is doable as the normal points are the vast majority and thus are readily available; however, this requires having labelled data or at least a certainty that the HPC system was operating in normal conditions for a sufficiently long period of time. Once the autoencoder has been trained using only normal data, it will be able to recognize similar but previously unseen points. Conversely, it will struggle to reconstruct new points which do not follow the learned normal behaviour, that is, the anomalies we are looking for; hence, the reconstruction error will be higher. The structure of the autoencoder model is presented in Figure 0(a)

. The dense autoencoder does not take into account the temporal dynamics of the data – its input and target output are the same vector:

(7)
(a) Structure of baseline model - the dense autoencoder.
(b) Structure of the proposed RUAD model consisting of the LSTM encoder and dense decoder.
Figure 1: The proposed approach replaces the encoder of the baseline model (0(a)) with the LSTM autoencoder (0(b)). The last layer of LSTM encoder returns a vector (not a temporal sequence) which is then passed to the fully connected decoder. is the window size, is the size of the input data, is the size of the latent layer and and are sizes of encoder and decoder layer respectively. Chosen parameters for , , and are listed in Section 4.3.

3.6 Recurrent unsupervised anomaly detection: RUAD

Moving beyond the state-of-the-art model, we propose a different approach, RUAD. It takes as input a sequence of vectors and then tries to reconstruct only the last vector in the sequence:

(8)

The input sequence length is a tunable parameter that specifies the size of the observation window . The idea of the proposed approach is similar to the dense autoencoder in principle, but with a couple of significant extensions: 1) we are encoding an input sequence into a more efficient representation (latent layer) and 2) we train the autoencoder in an unsupervised fashion (thus removing the requirement of labelled data). The key insight in the first innovation is that while the data describing supercomputing nodes is composed of multi-variate time series, the state-of-the-art does not explicitly consider the temporal dimension – the dense autoencoder has no notion of time nor of sequence of data points. To overcome this limitation, our approach works by encoding the sequence of values leading up to the anomaly. The encoder network is composed of Long Short-Term Memory (LSTM) layers, which have been often proved to be well suited to the context where the temporal dimension is relevant Lindemann et al. (2021). An LSTM layer consists of recurrent cells that have an input from the previous timestamp and from the long-term memory.

To address the scale of current pre-exascale and future exascale HPC systems that will consist of thousands of nodes Netti et al. (2021), we want a scalable anomaly detection approach. The most scalable approach currently for anomaly detection on a whole supercomputer is a node-specific approach as each compute node can train its own model. Still, we want to achieve this by minimally impacting the regular operation of the HPC system. This is why it is important for the proposed solution to have a small overhead. Additionally, since we want to train a per-node model, we want the method to be data-efficient. To address these requirements, we choose not to make the decoder symmetric to the encoder. The proposed approach is thus comprised of a Dense decoder and an LSTM encoder. LSTM encoder output is passed into a dense decoder trained by reproducing the final vector in an input sequence. The decoder network is thus composed of fully connected dense layers. The architecture of the proposed approach is compared to the state-of-the-art approach in Figure 1.

The reduced complexity of training allows us to train a separate model for each compute node. As shown previously (Borghesi et al. (2019c)), node-specific models provide better results than a single model trained on all data. We decided to adopt this scheme (one model per node) after a preliminary empirical analysis showed no significant accuracy loss while the training time was vastly reduced (by approximately ); this is very important in our case as we trained one DL model for each of the nodes of Marconi 100 (980+), definitely a non-negligible computational effort.

3.7 Data pre-processing

As introduced in Section 3.6 our proposed methodology consists of training a model for each node. Thus, the data from each node is first split into training and test sets. The training set contains of data, and the test set contains the last of data (roughly the last two months of data). It is important to stress that we have chosen to have two not overlapping datasets for the training and test. This avoids the cross-transferring of information when dealing with sequencing. Moreover, the causality of the testing is preserved. (No in-the-future data are used to train a model). This makes the results valid for in-practice usage.

For semi-supervised training, the training set is filtered by removing anomalous events (anomalous events are identified by the node anomaly label as described in Section 3.1). We name this filter the semi-supervised filter, as depicted in Figure 2

. For unsupervised learning, the training set is not filtered. For both the cases (unsupervised and semi-supervised learning), labels are used to evaluate the results. After filtering, a scaler is fitted to training data. A scaler is a transformer that scales the data to the

interval. In the experimental part, a min/max scaler is used on each feature Pedregosa et al. (2011). After fitting to the training data, the scaler is applied to the test data - for rescaling the test set, min and max values of the training set are used (as it is standard practice in DL methods). After scaling, both training and test sets are filtered out to ensure time consistency: the data is split into sequences without missing chunks (missing chunks are the result of the semi-supervised filter). The sequences that are smaller than are dropped. Finally, sequences are transformed into batches of sequences with length . Figure 2 describes the whole data pre-processing pipeline.

Figure 2: Data processing schema. Data flow is represented by green (training set) and orange (testing set). Scaling is trained on training set and applied on testing set to avoid contaminating the testing set. Semi-supervised and time consistency filters are optional and applied only when required by the modeling approach as indicated in Table 3

3.8 Summary of evaluated methods

We compare our proposed approach RUAD against established semi-supervised and unsupervised baselines. Summary of pre-processing filters is presented in Table 3. The semi-supervised filter is applied to all semi-supervised approaches. A time consistency filter is applied to methods that explicitly consider the temporal dimension of the data: Exponential smoothing and RUAD. RUAD and the current SoA anomaly detection approach based on dense autoencoders (Borghesi et al. (2022)) is evaluated in both semi-supervised and unsupervised version.

Filters
Model Semi-supervised Time consistency Name
Trivial baseline: exponential smoothing NO YES
Unsupervised baseline: clustering NO NO
DENSE autoencoder baseline semi-supervised YES NO
DENSE autoencoder baseline unsupervised NO NO
RUAD semi-supervised YES YES
RUAD unsupervised NO YES
Table 3: Short names and training strategies for examined methods. is the current SoA Borghesi et al. (2022).
Method Training set required Post-training
Unlabeled dataset No action required
Morrow et al. (2016) Unlabeled dataset Assigning anomaly probability to clusters
Borghesi et al. (2022) Labeled dataset No action required
Unlabeled dataset No action required
Table 4: Caparison of implemented approaches relating to the training set requirements.

We wish to highlight that, unlike the unsupervised learning baseline Morrow et al. (2016), our proposed method RUAD requires no additional action after the training of the model. The approach RUAD, proposed in this work, works on an unlabeled dataset and requires no additional post training analysis. A summary of approaches relating to training set requirements is presented in Table 4.

4 Experimental results

4.1 Experimental setting

The focus of the experimental part of this work is Marconi 100 (M100) HPC system, located in the CINECA supercomputing centre. It is a tier-0 HPC system that consists of 980 compute nodes organized into three rows of racks. Each compute node has 32 core CPU, 256 GB of RAM and 4 Nvidia V100 GPUs. In this work, nodes of the HPC system will be considered independent. This is also in line with the current SoA works Netti et al. (2018); Borghesi et al. (2019a, 2022) where anomaly detection is performed per node. Future works will investigate inter-node dependencies in the anomaly detection task.

The monitoring system in an HPC setting typically consists of hardware monitoring sensors, job status and availability monitoring, and server room sensors. In the case of M100, hardware monitoring is performed by ExamonBartolini et al. (2019), and system availability is provided by system administratorsBarth (2008). This raw information provided by Nagios, however, contains many false-positive anomalies. For this reason, we have constructed a new anomaly label called node anomaly described in Section 3.1.

For each of the 980 nodes of M100, a separate dataset was created. Dataset details are explained in Section 4.2. and models were trained and evaluated on the node-specific training and test sets for each node. The training set consisted of the first eight months of system operation, and the test set comprised the remaining two months. Such testing split ensures a fair evaluation of the model as described in Section 4.2. For the baseline, the exponential smoothing operation (defined in equation (6)) was applied only over the test set (as the approach requires no training). For each node, the scaler (for min and max scaling) was trained on training data and applied to test data. All results discussed in this section are combined results from all 980 nodes of M100.

The dense autoencoder and the RUAD model were trained in two different regimes: semi-supervised and unsupervised. For the semi-supervised training, the semi-supervised filter was applied that removed all data points corresponding to anomalies. In the unsupervised case, no such filtering was performed. It can hence be noticed one of the key advantages of the unsupervised approach: no data pre-processing needs to be done and no preliminary knowledge about the computing nodes condition is required.

For all three approaches (exponential smoothing, dense autoencoder and the ), the probability for an anomaly (class 1) was estimated from reconstruction error as explained in Section 3.2. The probabilities from the test sets of all nodes from a single modelling approach (e.g. RUAD with observation window of length ) were collected together to plot the Receiver Operator Characteristic (ROC) curve that is a characteristic for the modelling approach across all nodes. For clustering baseline and exponential smoothing (worst performing baselines), the ROC curve is compared against a dummy classifier which randomly chooses the class.

4.2 Dataset

Source Features
Hardware monitoring
ambient temp., dimm[0-15] temp.,
fan[0-7] speed, fan disk power,
GPU[0-3] core temp. ,
GPU[0-3] mem temp. ,
gv100card[0-3], core[0-3] temp. ,
p[0-1] io power,
p[0-1] mem power,
p[0-1] power, p[0-1] vdd temp. ,
part max used,
ps[0-1] input power,
ps[0-1] input voltage,
ps[0-1] output current,
ps[0-1] output voltage, total power
System monitoring
CPU system, bytes out, CPU idle,
proc. run, mem. total,
pkts. out, bytes in, boot time,
CPU steal, mem. cached, stamp,
CPU speed, mem. free, CPU num.,
swap total, CPU user, proc. total,
pkts. in, mem. buffers, CPU idle,
CPU nice, mem. shared, PCIe,
CPU wio, swap free
Table 5: An anomaly detection model is created only on hardware and application monitoring features. More granular information regarding individual jobs is not collected to ensure the privacy of the HPC system users.

The dataset used in this work consists of a combination of information recorded by Nagios (the system administration tool used to visually check the health status of the computing nodes) and the Examon monitoring systems; the data encompasses the first ten months of operation of the M100 system. The procedure for obtaining a node anomaly label is described in Section 3.1. The features collected in the dataset are listed in table 5. The data covers 980 compute nodes and five login nodes. Login nodes have the same hardware as the compute nodes but are reserved primarily for job submission and accounting. Thus we removed them from our analysis. The data is collected by the University of Bologna with approval from CINECA333 CINECA is a public university consortium and the main supercomputing centre in ItalyWikipedia (2021a)..

In order to align different sampling rates of different reporting services (each of the sensors used has a different sampling frequency), minute aggregates of data points were created. minute interval was chosen as it is the native sampling frequency of the Nagios monitoring service (where our labels come from). Four values were calculated for each

minute period and each feature: minimum, maximum, average, and variance.

4.3 Hyperparameters

Hyper-parameters for all methods discussed in this paper were determined based on initial exploration on the set of

nodes. Chosen parameters performed best on the test from the initial exploration nodes (they achieved the highest AUC score on the test set). Results from the initial exploration set are excluded from the results discussed further in the chapter. Tuned hyperparameters include the structure of the neural nets (number and size of layers) and the smoothing factor of the exponential smoothing:

  • Exponential smoothing: smoothing factor

  • Clustering: hyper-parameter (number of clusters) is trained on a train set for each node independently.

  • Dense autoencoder: Structure of the network consists of layers of shapes: (*,462), (*,16), (*,8), (*16), (*462).

  • RUAD (LSTM encoder, dense decoder): Structure of the network consists of layers of shapes: (*,W,462), (*,W,16), (*,W,8), (*,16), (*,462). is the length of the observation window. Chosen window lengths were: .

4.4 Exponential smoothing

As mentioned in the methodology, exponential smoothing (EXP) is implemented to demonstrate that the anomalies we observe are not simply unexpected spikes in the data signal. Furthermore, exponential smoothing is applied to each feature independently of other features. As shown in Figure 3, exponential smoothing performs even worse than a dummy classifier (random choice). Poor performance of exponential smoothing shows that the anomalies we are searching for are more complex than simple jumps in values for a feature.

Figure 3: Combined ROC curve from all 980 nodes of M100 for the exponential smoothing baseline. Exponential smoothing performs even worse than the dummy classifier - anomaly detection based on exponential smoothing is completely unusable.

4.5 Clustering

The simple clustering baseline performs better than the exponential smoothing baseline and better than the dummy classifier, as seen in Figure 4. However, as we will illustrate in the following sections, it performs worse than any other autoencoder method. This demonstrates that the problem we are addressing (anomaly detection on an HPC system) requires more advanced methodologies like semi-supervised and unsupervised autoencoders.

Figure 4: Combined ROC curve from all 980 nodes of M100 for the simple clustering baseline. This baseline performs only marginally better than the dummy classifier.

4.6 Dense autoencoder

Figure 5: Combined ROC curve from all 980 nodes of M100 for the Dense autoencoder model. In the area interesting for practical application - True Positive Rate between and - semi-supervised approach outperforms unsupervised approach.

We consider now the dense autoencoder. We train a different network for each computing node of Marconi 100. The optimal network topology was determined during a preliminary exploration done on the sub-sample of the nodes of the system and following the guidelines provided by Borghesi et al.Borghesi et al. (2019c). In line with the existing workBorghesi et al. (2022), the semi-supervised learning approach slightly outperforms the unsupervised learning approach as seen in Figure 5. The better performance in the semi-supervised case is due to the nature of the autoencoder learning model - its capability to reconstruct its input. For example, suppose the autoencoder is fed with anomalous input during the training phase, as in the unsupervised case. In that case, anomalous examples in the training data constitute a type of “noise” that renders the autoencoder partially capable of reconstructing the anomalous examples in the test set.

4.7 Ruad

[Window length 5] [Window length 10]
[Window length 20] [Window length 40]

Figure 6: Combined results from all 980 nodes of M100. Comparison of different window lengths for the RUAD model. For all window lengths, performances of semi-supervised and unsupervised approaches are similar. Performance of the proposed model (red and blue line) is compared to the state-of-the-art baseline semi-supervised autoencoder proposed by Borghesi et al.Borghesi et al. (2022).

This section examines the experimental results obtained with the RUAD model (unsupervised LSTM autoencoder). The most important parameter is the length of the input sequence that is passed to the model. This parameter encodes our expectation of the length of the dependencies within the data. Since each data point represents 15 minutes of node operation, the actual period we observe consists of . In this set of experiments, we selected the following time window sizes: 5 (75 minutes), 10 (2h30), 20 (5h), 40 (10h). These period lengths were obtained after a preliminary empirical evaluation; moreover, these time frames are in line with the typical duration of HPC workloads, which tend to span between dozens of minutes to a few hoursCalzarossa et al. (2016). We have trained the model in both semi-supervised and unsupervised fashion for each selected window length. Results across all the nodes are collected in Figure 6.

4.8 Comparison of all approaches

The main metric for evaluating model performance is the area under the ROC curve (AUC). This metric estimates the classifiers’ overall performance without the need to set a discrimination threshold 46. The closer the AUC value is to 1, the better the classifier performs. AUC scores for implemented methods are collected in table 6. From the lower table in table 6 (rows correspond to different training regimes and columns to window size for network) and upper table in 6 (rows correspond to the performance of different implemented baselines), we see that the proposed approach outperforms the existing baselines. The highest AUC achieved by the previous baselines is (achieved by the . This is outperformed by for all window sizes. The best performance of is achieved by selecting the windows size 10 where it achieves an AUC of 7.672. This result clearly shows that some temporal dynamics contribute to the appearance of anomalies.

The final consideration is the impact of observation window length on the performance of the RUAD model. One might expect that considering longer time sequences would bring benefits, as more information is provided to the model to recreate the time series. This is, however, not the case (as seen in table 6) as the

achieves the best performance of 0.7672 with window size 10. The performance then reduces sharply with window size 40, only achieving an AUC of 0.7473. Several factors might explain this phenomenon. For instance, in tens of hours, the workload on a given node might change drastically. Considering longer time series might thus force the RUAD model to concentrate on multiple workloads, hindering its learning task. Finally, an issue stems from the fact that there are gaps (periods of missing measurements) in the collected data (a very likely problem in many real-world scenarios). Longer sequences mean that more data has to be cut from the training set to ensure time-consistent sequences; this is because we are not applying gap-filling techniques at the moment

444We decided not to consider such techniques for the moment, as we wanted to focus on the modelling approach and gap-filling methods tend to require additional assumptions and to introduce noise in the data., thus, sub-sequences missing some points need to be removed from the data set. Combining these two factors contributes to the model’s decline in performance with longer observation periods.

Considering all discussed factors, the optimal approach is to use the proposed model architecture with window size (i.e. 2 hours and 30 minutes), trained in an unsupervised manner. This configuration outperforms semi-supervised as well as the dense autoencoder. As mentioned in the related work (Section 2), labelled datasets are expensive to obtain in the HPC setting. Good unsupervised performance is why this result is promising - it shows us that if the anomalies represented a small fraction of all data, we could train an anomaly detection model even on an unlabeled dataset (in an unsupervised manner). Such a model not only achieves the state-of-the-art performance but outperforms semi-supervised approaches. The best AUC, achieved by the previous SoA , is 0.7470. The best AUC score achieved by is 0.7672. Moreover, unsupervised training makes this anomaly detection model more applicable to a typical HPC (or even datacentre) system.

Method Combined ROC score
0.4276
0.5478
0.7470
0.7344
Method Combined ROC score
Sequence length 5 10 20 40
0.7632 0.7582 0.7602 0.7446
0.7651 0.7672 0.7655 0.7473
Table 6: According to expectations, the semi-supervised dense autoencoder outperforms the unsupervised dense one (highlighted by the higher AUC score). and outperform all previous baselines. In contrast to the dense autoencoders, the proposed approach performs best in unsupervised manner.

5 Conclusions

The paper presents an anomaly detection approach for HPC systems (RUAD) that outperforms the current state-of-the-art approach based on the dense autoencoders Borghesi et al. (2022). Improving upon state-of-the-art is achieved by deploying a neural network architecture that considers the temporal dependencies within the data. The proposed model architecture achieves the highest AUC of compared to , which is the highest AUC achieved by the dense autoencoders (on our dataset).

Another contribution of this paper is that the proposed method – unlike the previous work Borghesi et al. (2022); Molan et al. (2021); Tuncer et al. (2018); Netti et al. (2019a) – achieves the best results in an unsupervised training case. Unsupervised training is instrumental as it offers a possibility of deploying an anomaly detection model to the cases where (accurately) labelled dataset is unavailable. The only stipulation for the deployment of unsupervised anomaly detection models is that the anomalies are rare – in our work, the anomalies accounted for only of the data. The necessity to have a few anomalies in the training set, however, is not a significant limitation as HPC systems are already highly reliable machines with low anomaly rates Dongarra (2020); Shin et al. (2021).

To illustrate the capabilities of the approach proposed in this work, we have collected an extensive and accurately labelled dataset describing the first months of operation of the Marconi100 system in CINECA Wikipedia (2021a). The creation of accurately labelled dataset was necessary to compare the performance of different models on the data rigorously. Because of the high quality and large scale of the available dataset, we can conclude that for the model proposed in the paper, the unsupervised model outperforms semi-supervised model even if accurate anomaly labels are available. This is the first experiment of this type and magnitude conducted on a real, in-production datacentre (both in terms of the number of computing nodes considered and the length of the observation period).

In future works, we will further explore the problem of anomaly detection in HPC systems, in particular, discovering the root causes of the anomalies - e.g., why a computing node is entering a failure state? We also have plans to further extend and refine the collected dataset and make it available to the public, in accordance with the facility owners and regulations about users’ personal data (albeit not considered in this work, information about the users submitting the jobs to the HPC system can indeed be collected). Moreover, in this work, we focused on node-level anomalies; this was done to be comparable with the state-of-the-art and for scalability purposes; in the future, we will explore the possibility of detecting systemic anomalies as well, i.e., anomalies involving multiple nodes at the same time. In this direction, the natural follow-up to the present work is to build hierarchical approaches which generate anomaly signals based on the composition of the signals generated by the node-specific detection models.

6 Acknowledgments

This research was partly supported by the EuroHPC EU PILOT project (g.a. 101034126), the EuroHPC EU Regale project (g.a. 956560), EU H2020-ICT-11-2018-2019 IoTwins project (g.a. 857191), and EU Pilot for exascale EuroHPC EUPEX (g. a. 101033975). We also thank CINECA for the collaboration and access to their machines and Francesco Beneventi for maintaining Examon.

References

  • M. Ahmed, A. N. Mahmood, and Md. R. Islam (2016) A survey of anomaly detection techniques in financial domain. 55, pp. 278–288. External Links: ISSN 0167-739X, Document, Link Cited by: §2.
  • B. Aksar, B. Schwaller, O. Aaziz, V. J. Leung, J. Brandt, M. Egele, and A. K. Coskun (2021a) E2EWatch: an end-to-end anomaly diagnosis framework for production hpc systems. In European Conference on Parallel Processing, pp. 70–85. Cited by: §2.
  • B. Aksar, B. Schwaller, O. Aaziz, V. J. Leung, J. Brandt, M. Egele, and A. K. Coskun (2021b) E2EWatch: an end-to-end anomaly diagnosis framework for production hpc systems. In Euro-Par 2021: Parallel Processing, L. Sousa, N. Roma, and P. Tomás (Eds.), Cham, pp. 70–85. External Links: ISBN 978-3-030-85665-6 Cited by: Table 1.
  • B. Aksar, Y. Zhang, E. Ates, B. Schwaller, O. Aaziz, V. J. Leung, J. Brandt, M. Egele, and A. K. Coskun (2021c) Proctor: a semi-supervised performance anomaly diagnosis framework for production hpc systems. In High Performance Computing, B. L. Chamberlain, A. Varbanescu, H. Ltaief, and P. Luszczek (Eds.), Cham, pp. 195–214. External Links: ISBN 978-3-030-78713-4 Cited by: Table 1, §2.
  • P. V. Astillo, D. G. Duguma, H. Park, J. Kim, B. Kim, and I. You (2022) Federated intelligence of anomaly detection agent in iotmd-enabled diabetes management control system. 128, pp. 395–405. External Links: ISSN 0167-739X, Document, Link Cited by: §2.
  • W. Barth (2008) Nagios: system and network monitoring. No Starch Press. Cited by: §1, §4.1.
  • A. Bartolini, F. Beneventi, A. Borghesi, D. Cesarini, A. Libri, L. Benini, and C. Cavazzoni (2019) Paving the way toward energy-aware and automated datacentre. In Proceedings of the 48th International Conference on Parallel Processing: Workshops, ICPP 2019, New York, NY, USA, pp. 1–8. External Links: ISBN 9781450371964, Link, Document Cited by: §1, §2, §4.1.
  • E. Baseman, S. Blanchard, N. DeBardeleben, A. Bonnie, and A. Morrow (2016) Interpretable anomaly detection for monitoring of high performance computing systems. In Outlier Definition, Detection, and Description on Demand Workshop at ACM SIGKDD. San Francisco (Aug 2016), pp. 1–27. Cited by: Table 1, §2.
  • K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, et al. (2008) Exascale computing study: technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep 15. Cited by: §2.
  • N. Beske (2020) UG3.2: marconi100 userguide. Note: Accessed: 2020-08-17 External Links: Link Cited by: §1.
  • I. Boixaderas, D. Zivanovic, and et al. (2020) Cost-aware prediction of uncorrected dram errors in the field. In 2020 SC20: International Conference for HPC, Networking, Storage and Analysis (SC), Vol. , Los Alamitos, CA, USA, pp. 1–15. External Links: ISSN Cited by: §2.
  • A. Borghesi, A. Bartolini, and et al. (2019a) Anomaly detection using autoencoders in hpc systems. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    pp. 24–32. Cited by: §1, §1, Table 1, §2, §3.5, §4.1.
  • A. Borghesi, A. Libri, and et al. (2019b) Online anomaly detection in hpc systems. In 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems, pp. 229–233. Cited by: §1, Table 1, §2.
  • A. Borghesi, A. Bartolini, M. Lombardi, M. Milano, and L. Benini (2019c) A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems. Engineering Applications of Artificial Intelligence 85, pp. 634–644. External Links: ISSN 0952-1976, Document, Link Cited by: §3.6, §4.6.
  • A. Borghesi, A. Bartolini, M. Lombardi, M. Milano, and L. Benini (2019d) A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems. Engineering Applications of Artificial Intelligence 85, pp. 634–644. Cited by: Table 1, §2.
  • A. Borghesi, M. Milano, and L. Benini (2019e) Frequency assignment in high performance computing systems. In International Conference of the Italian Association for Artificial Intelligence, pp. 151–164. Cited by: §1.
  • A. Borghesi, M. Molan, M. Milano, and A. Bartolini (2022) Anomaly detection and anticipation in high performance computing systems. IEEE Transactions on Parallel and Distributed Systems 33 (4), pp. 739–750. External Links: Document Cited by: item 1, item 2, §1, §1, §1, §1, §1, Table 1, §2, §2, §3.1, §3.2, §3.5, §3.5, §3.8, Table 3, Table 4, §3, Figure 6, §4.1, §4.6, §5, §5.
  • S. Bursic, A. D’Amelio, and V. Cuculo (2019) Anomaly detection from log files using unsupervised deep learning. Cited by: §1, Table 1, §2.
  • M. C. Calzarossa, L. Massari, and D. Tessera (2016) Workload characterization: a survey revisited. ACM Computing Surveys (CSUR) 48 (3), pp. 1–43. Cited by: §4.7.
  • F. Cauteruccio, L. Cinelli, E. Corradini, G. Terracina, D. Ursino, L. Virgili, C. Savaglio, A. Liotta, and G. Fortino (2021) A framework for anomaly detection and classification in multiple iot scenarios. 114, pp. 322–335. External Links: ISSN 0167-739X, Document, Link Cited by: §2.
  • M. Dani, H. Doreau, and S. Alt (2017) K-means application for anomaly detection and log classification in hpc. In Lecture Notes in Computer Science book series (LNAI,volume 10351), pp. 201–210. External Links: ISBN 978-3-319-60044-4, Document Cited by: §1, Table 1, §2, §3.4.
  • J. Dongarra (2020) Report on the fujitsu fugaku system. University of Tennessee-Knoxville Innovative Computing Laboratory, Tech. Rep. ICLUT-20-06. Cited by: §5.
  • M. Du, F. Li, G. Zheng, and V. Srikumar (2017) DeepLog: anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, New York, NY, USA, pp. 1285–1298. External Links: ISBN 9781450349468, Link, Document Cited by: §1, Table 1.
  • S. Fu, S. Zhong, L. Lin, and M. Zhao (2021) A re-optimized deep auto-encoder for gas turbine unsupervised anomaly detection. 101, pp. 104199. External Links: ISSN 0952-1976, Document, Link Cited by: §2.
  • M. Gamell, K. Teranishi, and et al. (2017) Modeling and simulating multiple failure masking enabled by local recovery for stencil-based applications at extreme scales. IEEE Transactions on Parallel and Distributed Systems 28 (10). Cited by: §2.
  • F. Iannone, G. Bracco, C. Cavazzoni, and et al. (2018) MARCONI-fusion: the new high performance computing facility for european nuclear fusion modelling. Fusion Engineering and Design 129, pp. 354–358. Cited by: §1.
  • G. Iuhasz and D. Petcu (2019) Monitoring of exascale data processing. In 2019 IEEE International Conference on Advanced Scientific Computing (ICASC), Vol. , pp. 1–5. External Links: Document Cited by: §2.
  • K. B. Lee, S. Cheon, and C. O. Kim (2017)

    A convolutional neural network for fault classification and diagnosis in semiconductor manufacturing processes

    .
    IEEE Transactions on Semiconductor Manufacturing 30 (2), pp. 135–142. Cited by: §2.
  • G. Lemaître, F. Nogueira, and C. K. Aridas (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research 18 (17), pp. 1–5. Cited by: §2.
  • B. Lindemann, T. Müller, H. Vietz, N. Jazdi, and M. Weyrich (2021) A survey on long short-term memory networks for time series prediction. Procedia CIRP 99, pp. 650–655. Cited by: §3.6.
  • I. Martins, J. S. Resende, P. R. Sousa, S. Silva, L. Antunes, and J. Gama (2022) Host-based ids: a review and open issues of an anomaly detection system in iot. 133, pp. 95–113. External Links: ISSN 0167-739X, Document, Link Cited by: §2.
  • E. Meneses, X. Ni, and et al. (2015) Using migratable objects to enhance fault tolerance schemes in supercomputers. IEEE Transactions on Parallel and Distributed Systems 26 (7), pp. 2061–2074. External Links: Document Cited by: §2.
  • D. Milojicic, P. Faraboschi, N. Dube, and D. Roweth (2021) Future of hpc: diversifying heterogeneity. In 2021 Design, Automation Test in Europe Conference Exhibition (DATE), Vol. , pp. 276–281. External Links: Document Cited by: §1, §1.
  • M. Molan, A. Borghesi, F. Beneventi, M. Guarrasi, and A. Bartolini (2021) An explainable model for fault detection in hpc systems. In High Performance Computing, H. Jagode, H. Anzt, H. Ltaief, and P. Luszczek (Eds.), Cham, pp. 378–391. External Links: ISBN 978-3-030-90539-2 Cited by: §1, §2, §5.
  • M. Molan (2020) Pre-processing for Anomaly Detection on Linear Accelerator. CERN openlab online summer intern project presentations. Cited by: §2.
  • A. Morrow, E. Baseman, and S. Blanchard (2016) Ranking anomalous high performance computing sensor data using unsupervised clustering. In 2016 International Conference on Computational Science and Computational Intelligence (CSCI), Vol. , pp. 629–632. External Links: Document Cited by: §1, §1, Table 1, §2, §3.4, §3.4, §3.8, Table 4.
  • G. Moschini, R. Houssou, J. Bovay, and S. Robert-Nicoud (2020) Anomaly and fraud detection in credit card transactions using the arima model. External Links: 2009.07578 Cited by: §2.
  • A. Netti, Z. Kiziltan, O. Babaoglu, A. Sîrbu, A. Bartolini, and A. Borghesi (2019a) A machine learning approach to online fault classification in hpc systems. Future Generation Computer Systems. Cited by: §1, §1, §2, §5.
  • A. Netti, Z. Kiziltan, O. Babaoglu, A. Sîrbu, A. Bartolini, and A. Borghesi (2019b) Online fault classification in hpc systems through machine learning. In European Conference on Parallel Processing, pp. 3–16. Cited by: §1, Table 1.
  • A. Netti, Z. Kiziltan, and et al. (2018) FINJ: a fault injection tool for hpc systems. In European Conference on Parallel Processing, pp. 800–812. Cited by: §1, §2, §4.1.
  • A. Netti, W. Shin, M. Ott, T. Wilde, and N. Bates (2021) A conceptual framework for hpc operational data analytics. In 2021 IEEE International Conference on Cluster Computing (CLUSTER), Vol. , pp. 596–603. External Links: Document Cited by: §1, §1, §1, §1, §2, §3.6.
  • G. Pang, C. Shen, L. Cao, and A. V. D. Hengel (2020) Deep learning for anomaly detection: a review. ACM Comput. Surv.. Note: ACM Computing Surveys, 2020 External Links: Document Cited by: §2, §2.
  • G. Pang, C. Shen, L. Cao, and A. V. D. Hengel (2021) Deep learning for anomaly detection: a review. ACM Comput. Surv. 54 (2). External Links: ISSN 0360-0300, Link, Document Cited by: §2.
  • L. A. Parnell, D. W. Demetriou, V. Kamath, and E. Y. Zhang (2019) Trends in high performance computing: exascale systems and facilities beyond the first wave. In 2019 18th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), Vol. , pp. 167–176. External Links: Document Cited by: §1.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §3.7.
  • [46] (2021-11) Receiver operating characteristic. Wikimedia Foundation. External Links: Link Cited by: §3.2, §4.8.
  • L. Rosa, T. Cruz, M. B. de Freitas, P. Quitério, J. Henriques, F. Caldeira, E. Monteiro, and P. Simões (2021) Intrusion and anomaly detection for the next-generation of industrial automation and control systems. 119, pp. 50–67. External Links: ISSN 0167-739X, Document, Link Cited by: §2.
  • T. Salman, D. Bhamare, A. Erbad, R. Jain, and M. Samaka (2017) Machine learning for anomaly detection and categorization in multi-cloud environments. 2017 IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud). Note: CSCLOUD 2017 External Links: 1812.05443, Document Cited by: §2.
  • K. R. Shahapure and C. Nicholas (2020) Cluster quality analysis using silhouette score. In

    2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)

    ,
    Vol. , pp. 747–748. External Links: Document Cited by: footnote 1.
  • W. Shin, V. Oles, A. M. Karimi, J. A. Ellis, and F. Wang (2021) Revealing power, energy and thermal dynamics of a 200pf pre-exascale supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, pp. 1–14. External Links: ISBN 9781450384421, Link, Document Cited by: §1, §1, §2, §5.
  • [51] (2020) Top500List. Note: https://www.top500.org/lists/top500/2020/06/ Cited by: §1.
  • O. Tuncer, E. Ates, and et al. (2017) Diagnosing performance variations in hpc applications using machine learning. In International Supercomputing Conference, pp. 355–373. Cited by: §2.
  • O. Tuncer, E. Ates, and Y. e. al. et Zhang (2018) Online diagnosis of performance variation in hpc systems using machine learning. IEEE Transactions on Parallel and Distributed Systems. Cited by: §1, §2, §5.
  • Wikipedia (2021a) CINECA — Wikipedia, the free encyclopedia. Note: http://en.wikipedia.org/w/index.php?title=CINECA&oldid=954269846[Online; accessed 04-December-2021] Cited by: §5, footnote 3.
  • Wikipedia (2021b) Jira (software) — Wikipedia, the free encyclopedia. Note: http://en.wikipedia.org/w/index.php?title=Jira\%20(software)&oldid=1052315603[Online; accessed 04-December-2021] Cited by: §3.1.
  • P. Wu, C. A. Harris, G. Salavasidis, A. Lorenzo-Lopez, I. Kamarudzaman, A. B. Phillips, G. Thomas, and E. Anderlini (2021)

    Unsupervised anomaly detection for underwater gliders using generative adversarial networks

    .
    Engineering Applications of Artificial Intelligence 104, pp. 104379. External Links: ISSN 0952-1976, Document, Link Cited by: §2.
  • R. Xu, Y. Cheng, Z. Liu, Y. Xie, and Y. Yang (2020) Improved long short-term memory based anomaly detection with concept drift adaptive method for supporting iot services. 112, pp. 228–242. External Links: ISSN 0167-739X, Document, Link Cited by: §2.
  • C. Zhang, D. Song, Y. Chen, X. Feng, C. Lumezanu, W. Cheng, J. Ni, B. Zong, H. Chen, and N. V. Chawla (2018) A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. CoRR abs/1811.08055. External Links: 1811.08055 Cited by: §2.