Online Anomaly Detection in HPC Systems

02/22/2019
by   Andrea Borghesi, et al.
0

Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to large scale supercomputers and facilities: automated methods to detect faults and unhealthy conditions is needed. Our method uses a type of neural network called autoncoder trained to learn the normal behavior of a real, in-production HPC system and it is deployed on the edge of each computing node. We obtain a very good accuracy (values ranging between 90 that the approach can be deployed on the supercomputer nodes without negatively affecting the computing units performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/11/2019

Anomaly Detection in High Performance Computers: A Vicinity Perspective

In response to the demand for higher computational power, the number of ...
research
11/13/2018

Anomaly Detection using Autoencoders in High Performance Computing Systems

Anomaly detection in supercomputers is a very difficult problem due to t...
research
07/27/2020

A Machine Learning Approach to Online Fault Classification in HPC Systems

As High-Performance Computing (HPC) systems strive towards the exascale ...
research
07/01/2019

Automatic Real-time Anomaly Detection for Autonomous Aerial Vehicles

The recent increase in the use of aerial vehicles raises concerns about ...
research
02/25/2019

Anomaly Detection for an E-commerce Pricing System

Online retailers execute a very large number of price updates when compa...
research
09/26/2018

dynamicMF: A Matrix Factorization Approach to Monitor Resource Usage in High Performance Computing Systems

High performance computing (HPC) facilities consist of a large number of...
research
12/14/2020

Prediction of High-Performance Computing Input/Output Variability and Its Application to Optimization for System Configurations

Performance variability is an important measure for a reliable high perf...

Please sign up or login with your details

Forgot password? Click here to reset