Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning

09/30/2022
by   Pengfei Zheng, et al.
0

Dynamic adaptation has become an essential technique in accelerating distributed machine learning (ML) training. Recent studies have shown that dynamically adjusting model structure (e.g., lottery ticket hypothesis) or hyperparameters (e.g., batch size) can significantly accelerate training without sacrificing accuracy. However, existing ML cluster schedulers are not designed to handle dynamic adaptation. We show that existing schemes fail to provide fairness and degrade system efficiency when the training throughput changes over time under dynamic adaptation. We design Shockwave, a scheduler with future planning that builds on two key ideas. First, Shockwave extends classic market theory from static settings to dynamic settings to co-optimize efficiency and fairness. Second, Shockwave utilizes stochastic dynamic programming to handle dynamic changes. We build a system for Shockwave and validate its performance with both trace-driven simulation and cluster experiments. Results show that for traces of ML jobs with dynamic adaptation, Shockwave improves makespan by 1.3X and fairness by 2X when compared with existing fair scheduling schemes.

READ FULL TEXT

page 11

page 12

research
07/02/2019

Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

Modern distributed machine learning (ML) training workloads benefit sign...
research
09/13/2019

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

More and more companies have deployed machine learning (ML) clusters, wh...
research
02/13/2018

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Training machine learning (ML) models with large datasets can incur sign...
research
08/19/2023

Revitalising the Single Batch Environment: A 'Quest' to Achieve Fairness and Efficiency

In the realm of computer systems, efficient utilisation of the CPU (Cent...
research
06/09/2020

Fair Bayesian Optimization

Given the increasing importance of machine learning (ML) in our lives, a...
research
06/22/2020

PipeSim: Trace-driven Simulation of Large-Scale AI Operations Platforms

Operationalizing AI has become a major endeavor in both research and ind...
research
12/19/2013

Structure-Aware Dynamic Scheduler for Parallel Machine Learning

Training large machine learning (ML) models with many variables or param...

Please sign up or login with your details

Forgot password? Click here to reset