Online Fault Classification in HPC Systems through Machine Learning

10/26/2018
by   Alessio Netti, et al.
0

As High-Performance Computing (HPC) systems strive towards exascale goals, studies suggest that they will experience excessive failure rates, mainly due to the massive parallelism that they require. Long-running exascale computations would be severely affected by a variety of failures, which could occur as often as every few minutes. Therefore, detecting and classifying faults in HPC systems as they occur and initiating corrective actions through appropriate resiliency techniques before they can transform into failures will be essential for operating them. In this paper, we propose a fault classification method for HPC systems based on machine learning and designed for live streamed data. Our solution is cast within realistic operating constraints, especially those deriving from the desire to operate the classifier in an online manner. Our results show that almost perfect classification accuracy can be reached for different fault types with low computational overhead and minimal delay. Our study is based on a dataset, now publicly available, that was acquired by injecting faults to an in-house experimental HPC system.

READ FULL TEXT

page 7

page 8

research
07/27/2020

A Machine Learning Approach to Online Fault Classification in HPC Systems

As High-Performance Computing (HPC) systems strive towards the exascale ...
research
05/27/2021

Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights

In recent years, the increasing complexity in scientific simulations and...
research
01/14/2018

Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery

Efficient utilization of today's high-performance computing (HPC) system...
research
07/26/2018

FINJ: A Fault Injection Tool for HPC Systems

We present FINJ, a high-level fault injection tool for High-Performance ...
research
06/24/2019

EasyCrash: Exploring Non-Volatility of Non-Volatile Memory for High Performance Computing Under Failures

Emerging non-volatile memory (NVM) is promising for building future HPC....
research
03/29/2023

A Spatially Correlated Competing Risks Time-to-Event Model for Supercomputer GPU Failure Data

Graphics processing units (GPUs) are widely used in many high-performanc...
research
09/02/2021

Habitual and Reflective Control in Hierarchical Predictive Coding

In cognitive science, behaviour is often separated into two types. Refle...

Please sign up or login with your details

Forgot password? Click here to reset