dynamicMF: A Matrix Factorization Approach to Monitor Resource Usage in High Performance Computing Systems

09/26/2018
by   Niyazi Sorkunlu, et al.
0

High performance computing (HPC) facilities consist of a large number of interconnected computing units (or nodes) that execute highly complex scientific simulations to support scientific research. Monitoring such facilities, in real-time, is essential to ensure that the system operates at peak efficiency. Such systems are typically monitored using a variety of measurement and log data which capture the state of the various components within the system at regular intervals of time. As modern HPC systems grow in capacity and complexity, the data produced by current resource monitoring tools is at a scale that it is no longer feasible to be visually monitored by analysts. We propose a method that transforms the multi-dimensional output of resource monitoring tools to a low dimensional representation that facilitates the understanding of the behavior of a High Performance Computing (HPC) system. The proposed method automatically extracts the low-dimensional signal in the data which can be used to track the system efficiency and identify performance anomalies. The method models the resource usage data as a three dimensional tensor (capturing resource usage of all compute nodes for difference resources over time). A dynamic matrix factorization algorithm, called dynamicMF, is proposed to extract a low-dimensional temporal signal for each node, which is subsequently fed into an anomaly detector. Results on resource usage data collected from the Lonestar 4 system at the Texas Advanced Computing Center show that the identified anomalies are correlated with actual anomalous events reported in the system log messages.

READ FULL TEXT

page 6

page 7

research
05/30/2017

Tracking System Behaviour from Resource Usage Data

Resource usage data, collected using tools such as TACC Stats, capture t...
research
06/13/2018

The importance and need for system monitoring and analysis in HPC operations and research

In this work, system monitoring and analysis are discussed in terms of t...
research
09/25/2019

MPCDF HPC Performance Monitoring System: Enabling Insight via Job-Specific Analysis

This paper reports on the design and implementation of the HPC performan...
research
07/06/2021

Sustaining Performance While Reducing Energy Consumption: A Control Theory Approach

Production high-performance computing systems continue to grow in comple...
research
02/22/2019

Online Anomaly Detection in HPC Systems

Reliability is a cumbersome problem in High Performance Computing System...
research
06/15/2023

A Multi-Level, Multi-Scale Visual Analytics Approach to Assessment of Multifidelity HPC Systems

The ability to monitor and interpret of hardware system events and behav...

Please sign up or login with your details

Forgot password? Click here to reset