Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

07/02/2019
by   Kshiteej Mahajan, et al.
0

Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to fairly apportion GPUs across workloads while ensuring overall cluster efficiency. We find that established cluster scheduling disciplines that provide instantaneous fair share of resources are a poor fit because of ML workloads' unique attributes. ML jobs are typically long running, have coarse grained tasks that need to be gang-scheduled, and their performance is sensitive to tasks' relative placement. These properties cannot be captured by existing fair sharing schemes. We propose Themis, a new scheduling framework for ML training workloads. It's GPU allocation policy enforces that ML workloads complete in a finish-time fair manner, a new notion we introduce. To capture placement sensitivity and ensure efficiency, Themis uses a two-level scheduling architecture where ML workloads bid on available resources that are offered in an auction run by a central arbiter. Our auction design allocates GPUs to winning bids by trading off efficiency for fairness in the short term but compensating for finish-time fairness in the long term. Our evaluation on a number of machine learning models shows that Themis can ensure greater fairness while providing more efficient allocations compared to state-of-the-art schedulers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/17/2019

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

With widespread advances in machine learning, a number of large enterpri...
research
09/30/2022

Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning

Dynamic adaptation has become an essential technique in accelerating dis...
research
02/01/2023

Task Placement and Resource Allocation for Edge Machine Learning: A GNN-based Multi-Agent Reinforcement Learning Paradigm

Machine learning (ML) tasks are one of the major workloads in today's ed...
research
02/13/2018

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Training machine learning (ML) models with large datasets can incur sign...
research
12/07/2019

BoPF: Mitigating the Burstiness-Fairness Tradeoff in Multi-Resource Clusters

Simultaneously supporting latency- and throughout-sensitive workloads in...
research
08/24/2017

Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads

We present ease.ml, a declarative machine learning service platform we b...
research
08/20/2020

Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads

Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs hav...

Please sign up or login with your details

Forgot password? Click here to reset