MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant Systems for Machine Learning

07/23/2022
by   Baolin Li, et al.
0

GPU technology has been improving at an expedited pace in terms of size and performance, empowering HPC and AI/ML researchers to advance the scientific discovery process. However, this also leads to inefficient resource usage, as most GPU workloads, including complicated AI/ML models, are not able to utilize the GPU resources to their fullest extent – encouraging support for GPU multi-tenancy. We propose MISO, a technique to exploit the Multi-Instance GPU (MIG) capability on the latest NVIDIA datacenter GPUs (e.g., A100, H100) to dynamically partition GPU resources among co-located jobs. MISO's key insight is to use the lightweight, more flexible Multi-Process Service (MPS) capability to predict the best MIG partition allocation for different jobs, without incurring the overhead of implementing them during exploration. Due to its ability to utilize GPU resources more efficiently, MISO achieves 49 lower average job completion time than the unpartitioned and optimal static GPU partition schemes, respectively.

READ FULL TEXT
research
09/13/2022

Deep Learning Training on Multi-Instance GPUs

Deep learning training is an expensive process that extensively uses GPU...
research
10/12/2021

Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

Training Deep Neural Networks (DNNs) is a widely popular workload in bot...
research
01/10/2022

A Simulation Platform for Multi-tenant Machine Learning Services on Thousands of GPUs

Multi-tenant machine learning services have become emerging data-intensi...
research
09/26/2019

Elastic deep learning in multi-tenant GPU cluster

Multi-tenant GPU clusters are common nowadays due to the huge success of...
research
02/07/2018

Zorua: Enhancing Programming Ease, Portability, and Performance in GPUs by Decoupling Programming Models from Resource Management

The application resource specification--a static specification of severa...
research
03/11/2019

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

High performance multi-GPU computing becomes an inevitable trend due to ...
research
10/03/2018

Simulating the weak death of the neutron in a femtoscale universe with near-Exascale computing

The fundamental particle theory called Quantum Chromodynamics (QCD) dict...

Please sign up or login with your details

Forgot password? Click here to reset