A Multi-Level, Multi-Scale Visual Analytics Approach to Assessment of Multifidelity HPC Systems

06/15/2023
by   Shilpika, et al.
0

The ability to monitor and interpret of hardware system events and behaviors are crucial to improving the robustness and reliability of these systems, especially in a supercomputing facility. The growing complexity and scale of these systems demand an increase in monitoring data collected at multiple fidelity levels and varying temporal resolutions. In this work, we aim to build a holistic analytical system that helps make sense of such massive data, mainly the hardware logs, job logs, and environment logs collected from disparate subsystems and components of a supercomputer system. This end-to-end log analysis system, coupled with visual analytics support, allows users to glean and promptly extract supercomputer usage and error patterns at varying temporal and spatial resolutions. We use multiresolution dynamic mode decomposition (mrDMD), a technique that depicts high-dimensional data as correlated spatial-temporal variations patterns or modes, to extract variation patterns isolated at specified frequencies. Our improvements to the mrDMD algorithm help promptly reveal useful information in the massive environment log dataset, which is then associated with the processed hardware and job log datasets using our visual analytics system. Furthermore, our system can identify the usage and error patterns filtered at user, project, and subcomponent levels. We exemplify the effectiveness of our approach with two use scenarios with the Cray XC40 supercomputer.

READ FULL TEXT
research
08/23/2017

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale

Today's high-performance computing (HPC) systems are heavily instrumente...
research
04/12/2012

Enabling Semantic Analysis of User Browsing Patterns in the Web of Data

A useful step towards better interpretation and analysis of the usage pa...
research
09/25/2019

MPCDF HPC Performance Monitoring System: Enabling Insight via Job-Specific Analysis

This paper reports on the design and implementation of the HPC performan...
research
08/14/2020

Loghub: A Large Collection of System Log Datasets towards Automated Log Analytics

Logs have been widely adopted in software system development and mainten...
research
09/26/2018

dynamicMF: A Matrix Factorization Approach to Monitor Resource Usage in High Performance Computing Systems

High performance computing (HPC) facilities consist of a large number of...
research
06/25/2020

Survey on Visual Analysis of Event Sequence Data

Event sequence data record series of discrete events in the time order o...
research
11/06/2019

Reducing Honeypot Log Storage Capacity Consumption – Cron Job with Perl-Script Approach

Honeypot is a decoy computer system that is used to attract and monitor ...

Please sign up or login with your details

Forgot password? Click here to reset