Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

01/17/2019
by   Myeongjae Jeon, et al.
0

With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this paper we present a detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in a large enterprise. By correlating scheduler logs with logs from individual jobs, we study three distinct issues that affect cluster utilization for DNN training workloads on multi-tenant clusters: (1) the effect of gang scheduling and locality constraints on queuing, (2) the effect of locality on GPU utilization, and (3) failures during training. Based on our experience running a large-scale operation, we provide design guidelines pertaining to next-generation cluster schedulers for DNN training workloads.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/10/2022

A Simulation Platform for Multi-tenant Machine Learning Services on Thousands of GPUs

Multi-tenant machine learning services have become emerging data-intensi...
research
07/02/2019

Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

Modern distributed machine learning (ML) training workloads benefit sign...
research
06/07/2022

Exploration of Systolic-Vector Architecture with Resource Scheduling for Dynamic ML Workloads

As artificial intelligence (AI) and machine learning (ML) technologies d...
research
05/24/2022

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Deep learning (DL) shows its prosperity in a wide variety of fields. The...
research
11/20/2021

Doing More by Doing Less: How Structured Partial Backpropagation Improves Deep Learning Clusters

Many organizations employ compute clusters equipped with accelerators su...
research
08/14/2023

Symphony: Optimized Model Serving using Centralized Orchestration

The orchestration of deep neural network (DNN) model inference on GPU cl...
research
10/27/2021

SOAR: Minimizing Network Utilization with Bounded In-network Computing

In-network computing via smart networking devices is a recent trend for ...

Please sign up or login with your details

Forgot password? Click here to reset