The MIT Supercloud Dataset

08/04/2021
by   Siddharth Samsi, et al.
0

Artificial intelligence (AI) and Machine learning (ML) workloads are an increasingly larger share of the compute workloads in traditional High-Performance Computing (HPC) centers and commercial cloud systems. This has led to changes in deployment approaches of HPC clusters and the commercial cloud, as well as a new focus on approaches to optimized resource usage, allocations and deployment of new AI frame- works, and capabilities such as Jupyter notebooks to enable rapid prototyping and deployment. With these changes, there is a need to better understand cluster/datacenter operations with the goal of developing improved scheduling policies, identifying inefficiencies in resource utilization, energy/power consumption, failure prediction, and identifying policy violations. In this paper we introduce the MIT Supercloud Dataset which aims to foster innovative AI/ML approaches to the analysis of large scale HPC and datacenter/cloud operations. We provide detailed monitoring logs from the MIT Supercloud system, which include CPU and GPU usage by jobs, memory usage, file system logs, and physical monitoring data. This paper discusses the details of the dataset, collection methodology, data availability, and discusses potential challenge problems being developed using this data. Datasets and future challenge announcements will be available via https://dcc.mit.edu.

READ FULL TEXT
research
04/12/2022

The MIT Supercloud Workload Classification Challenge

High-Performance Computing (HPC) centers and cloud providers support an ...
research
06/22/2020

Multiverse: Dynamic VM Provisioning for Virtualized High Performance Computing Clusters

Traditionally, HPC workloads have been deployed in bare-metal clusters; ...
research
01/12/2023

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC Perlmutter

The resource demands of HPC applications vary significantly. However, it...
research
08/26/2020

Optimising AI Training Deployments using Graph Compilers and Containers

Artificial Intelligence (AI) applications based on Deep Neural Networks ...
research
05/10/2020

Optimizing Deep Learning Recommender Systems' Training On CPU Cluster Architectures

During the last two years, the goal of many researchers has been to sque...
research
02/25/2021

TELESTO: A Graph Neural Network Model for Anomaly Classification in Cloud Services

Deployment, operation and maintenance of large IT systems becomes increa...
research
08/17/2020

AIPerf: Automated machine learning as an AI-HPC benchmark

The plethora of complex artificial intelligence (AI) algorithms and avai...

Please sign up or login with your details

Forgot password? Click here to reset