Anomaly Detection in High Performance Computers: A Vicinity Perspective

06/11/2019
by   Siavash Ghiasvand, et al.
0

In response to the demand for higher computational power, the number of computing nodes in high performance computers (HPC) increases rapidly. Exascale HPC systems are expected to arrive by 2020. With drastic increase in the number of HPC system components, it is expected to observe a sudden increase in the number of failures which, consequently, poses a threat to the continuous operation of the HPC systems. Detecting failures as early as possible and, ideally, predicting them, is a necessary step to avoid interruptions in HPC systems operation. Anomaly detection is a well-known general purpose approach for failure detection, in computing systems. The majority of existing methods are designed for specific architectures, require adjustments on the computing systems hardware and software, need excessive information, or pose a threat to users' and systems' privacy. This work proposes a node failure detection mechanism based on a vicinity-based statistical anomaly detection approach using passively collected and anonymized system log entries. Application of the proposed approach on system logs collected over 8 months indicates an anomaly detection precision between 62

READ FULL TEXT

page 2

page 3

research
01/21/2019

Turning Privacy Constraints into Syslog Analysis Advantage

The mean time between failures (MTBF) of HPC systems is rapidly reducing...
research
06/14/2017

Towards Adaptive Resilience in High Performance Computing

Failure rates in high performance computers rapidly increase due to the ...
research
02/22/2019

Online Anomaly Detection in HPC Systems

Reliability is a cumbersome problem in High Performance Computing System...
research
01/19/2023

ClusterLog: Clustering Logs for Effective Log-based Anomaly Detection

With the increasing prevalence of scalable file systems in the context o...
research
02/23/2021

Robust and Transferable Anomaly Detection in Log Data using Pre-Trained Language Models

Anomalies or failures in large computer systems, such as the cloud, have...
research
07/21/2019

Early Anomaly Detection in Power Systems Based on Random Matrix Theory

It is important for detecting the anomaly in power systems before it exp...
research
05/16/2018

Towards Malware Detection via CPU Power Consumption: Data Collection Design and Analytics (Extended Version)

This paper presents an experimental design and data analytics approach a...

Please sign up or login with your details

Forgot password? Click here to reset