Why do Users Kill HPC Jobs?

Given the cost of HPC clusters, making best use of them is crucial to improve infrastructure ROI. Likewise, reducing failed HPC jobs and related waste in terms of user wait times is cru- cial to improve HPC user productivity (aka human ROI). While most efforts (e.g.,debugging HPC programs) explore technical aspects to improve ROI of HPC clusters, we hypothesize non-technical (human) aspects are worth exploring to make non-trivial ROI gains; specifically, understanding non-technical aspects and how they contribute to the failure of HPC jobs. In this regard, we conducted a case study in the context of Beocat cluster at Kansas State University. The purpose of the study was to learn the reasons why users terminate jobs and to quantify wasted computations in such jobs in terms of system utilization and user wait time. The data from the case study helped identify interesting and actionable reasons why users terminate HPC jobs. It also helped confirm that user terminated jobs may be associated with non-trivial amount of wasted computation, which if reduced can help improve the ROI of HPC clusters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/06/2023

Applying Process Mining on Scientific Workflows: a Case Study

Computer-based scientific experiments are becoming increasingly data-int...
research
06/22/2020

Multiverse: Dynamic VM Provisioning for Virtualized High Performance Computing Clusters

Traditionally, HPC workloads have been deployed in bare-metal clusters; ...
research
11/01/2022

Using Unused: Non-Invasive Dynamic FaaS Infrastructure with HPC-Whisk

Modern HPC workload managers and their careful tuning contribute to the ...
research
01/17/2021

Ten Simple Rules for Success with HPC, i.e. Responsibly BASHing that Linux Cluster

High-performance computing (HPC) clusters are widely used in-house at sc...
research
12/19/2022

Pseudonymization at Scale: OLCF's Summit Usage Data Case Study

The analysis of vast amounts of data and the processing of complex compu...
research
11/27/2019

Serverless seismic imaging in the cloud

This abstract presents a serverless approach to seismic imaging in the c...
research
05/29/2019

Evaluation of pilot jobs for Apache Spark applications on HPC clusters

Big Data has become prominent throughout many scientific fields and, as ...

Please sign up or login with your details

Forgot password? Click here to reset