MPCDF HPC Performance Monitoring System: Enabling Insight via Job-Specific Analysis

09/25/2019
by   Luka Stanisic, et al.
0

This paper reports on the design and implementation of the HPC performance monitoring system deployed to continuously monitor performance metrics of all jobs on the HPC systems at the Max Planck Computing and Data Facility (MPCDF). Thereby it reveals important information to various stakeholders, in particular to users, application support, system administrators, and management. On each compute node, hardware and software performance monitoring data is collected by our newly developed lightweight open-source hpcmd middleware which builds upon standard Linux tools. The data is transported via rsyslog, and aggregated and processed by a Splunk system, enabling detailed per-cluster and per-job interactive analysis in a web browser. Additionally, performance reports are provided to the users as PDF files. Finally, we report on practical experience and benefits from large-scale deployments on MPCDF HPC systems, demonstrating how our solution can be useful to any HPC center.

READ FULL TEXT

page 5

page 9

research
07/15/2023

PSI/J: A Portable Interface for Submitting, Monitoring, and Managing Jobs

It is generally desirable for high-performance computing (HPC) applicati...
research
06/13/2018

The importance and need for system monitoring and analysis in HPC operations and research

In this work, system monitoring and analysis are discussed in terms of t...
research
06/10/2019

LASSi: Metric based I/O analytics for HPC

LASSi is a tool aimed at analyzing application usage and contention caus...
research
04/28/2020

Enabling EASEY deployment of containerized applications for future HPC systems

The upcoming exascale era will push the changes in computing architectur...
research
06/18/2018

AccaSim: a Customizable Workload Management Simulator for Job Dispatching Research in HPC Systems

We present AccaSim, a simulator for workload management in HPC systems. ...
research
09/26/2018

dynamicMF: A Matrix Factorization Approach to Monitor Resource Usage in High Performance Computing Systems

High performance computing (HPC) facilities consist of a large number of...
research
06/15/2023

A Multi-Level, Multi-Scale Visual Analytics Approach to Assessment of Multifidelity HPC Systems

The ability to monitor and interpret of hardware system events and behav...

Please sign up or login with your details

Forgot password? Click here to reset