A Simulation Platform for Multi-tenant Machine Learning Services on Thousands of GPUs

01/10/2022
by   Ruofan Liang, et al.
0

Multi-tenant machine learning services have become emerging data-intensive workloads in data centers with heavy usage of GPU resources. Due to the large scale, many tuning parameters and heavy resource usage, it is usually impractical to evaluate and benchmark those machine learning services on real clusters. In this demonstration, we present AnalySIM, a cluster simulator that allows efficient design explorations for multi-tenant machine learning services. Specifically, by trace-driven cluster workload simulation, AnalySIM can easily test and analyze various scheduling policies in a number of performance metrics such as GPU resource utilization. AnalySIM simulates the cluster computational resource based on both physical topology and logical partition. The tool has been used in SenseTime to understand the impact of different scheduling policies with the trace from a real production cluster of over 1000 GPUs. We find that preemption and migration are able to significantly reduce average job completion time and mitigate the resource fragmentation problem.

READ FULL TEXT

page 3

page 4

research
01/17/2019

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

With widespread advances in machine learning, a number of large enterpri...
research
08/10/2023

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Distributed machine learning (DML) technology makes it possible to train...
research
06/22/2020

PipeSim: Trace-driven Simulation of Large-Scale AI Operations Platforms

Operationalizing AI has become a major endeavor in both research and ind...
research
04/20/2018

Bayesian Admission Policies for Cloud Computing Clusters

Cloud computing providers must handle customer workloads that wish to sc...
research
12/22/2018

Bioinformatics Computational Cluster Batch Task Profiling with Machine Learning for Failure Prediction

Motivation: Traditional computational cluster schedulers are based on us...
research
04/20/2018

The Power of Machine Learning and Market Design for Cloud Computing Admission Control

Cloud computing providers must handle customer workloads that wish to sc...
research
07/23/2022

MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant Systems for Machine Learning

GPU technology has been improving at an expedited pace in terms of size ...

Please sign up or login with your details

Forgot password? Click here to reset