The MIT Supercloud Workload Classification Challenge

04/12/2022
by   Benny J. Tang, et al.
MIT
0

High-Performance Computing (HPC) centers and cloud providers support an increasingly diverse set of applications on heterogenous hardware. As Artificial Intelligence (AI) and Machine Learning (ML) workloads have become an increasingly larger share of the compute workloads, new approaches to optimized resource usage, allocation, and deployment of new AI frameworks are needed. By identifying compute workloads and their utilization characteristics, HPC systems may be able to better match available resources with the application demand. By leveraging datacenter instrumentation, it may be possible to develop AI-based approaches that can identify workloads and provide feedback to researchers and datacenter operators for improving operational efficiency. To enable this research, we released the MIT Supercloud Dataset, which provides detailed monitoring logs from the MIT Supercloud cluster. This dataset includes CPU and GPU usage by jobs, memory usage, and file system logs. In this paper, we present a workload classification challenge based on this dataset. We introduce a labelled dataset that can be used to develop new approaches to workload classification and present initial results based on existing approaches. The goal of this challenge is to foster algorithmic innovations in the analysis of compute workloads that can achieve higher accuracy than existing methods. Data and code will be made publicly available via the Datacenter Challenge website : https://dcc.mit.edu.

READ FULL TEXT VIEW PDF

Authors

page 1

08/04/2021

The MIT Supercloud Dataset

Artificial intelligence (AI) and Machine learning (ML) workloads are an ...
11/14/2018

Anomaly Analysis for Co-located Datacenter Workloads in the Alibaba Cluster

In warehouse-scale cloud datacenters, co-locating online services and of...
01/12/2018

A Workload Analysis of NSF's Innovative HPC Resources Using XDMoD

Workload characterization is an integral part of performance analysis of...
05/10/2020

Optimizing Deep Learning Recommender Systems' Training On CPU Cluster Architectures

During the last two years, the goal of many researchers has been to sque...
08/26/2020

Optimising AI Training Deployments using Graph Compilers and Containers

Artificial Intelligence (AI) applications based on Deep Neural Networks ...
08/28/2021

Compiler-Driven FPGA Virtualization with SYNERGY

FPGAs are increasingly common in modern applications, and cloud provider...
09/12/2021

Hybrid Workload Scheduling on HPC Systems

Traditionally, on-demand, rigid, and malleable applications have been sc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

References