SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

02/13/2018
by   Haoyu Zhang, et al.
0

Training machine learning (ML) models with large datasets can incur significant resource contention on shared clusters. This training typically involves many iterations that continually improve the quality of the model. Yet in exploratory settings, better models can be obtained faster by directing resources to jobs with the most potential for improvement. We describe SLAQ, a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality. When allocating cluster resources, SLAQ explores the quality-runtime trade-offs across multiple jobs to maximize system-wide quality improvement. To do so, SLAQ leverages the iterative nature of ML training algorithms, by collecting quality and resource usage information from concurrent jobs, and then generating highly-tailored quality-improvement predictions for future iterations. Experiments show that SLAQ achieves an average quality improvement of up to 73 to 44 schedulers.

READ FULL TEXT

page 1

page 2

page 3

research
03/24/2019

TonY: An Orchestrator for Distributed Machine Learning Jobs

Training machine learning (ML) models on large datasets requires conside...
research
08/01/2023

CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters

We present CASSINI, a network-aware job scheduler for machine learning (...
research
12/22/2022

Comparison of Three Job Mapping Algorithms for Supercomputer Resource Managers

Performance of supercomputer depends on the quality of resource manager,...
research
09/30/2022

Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning

Dynamic adaptation has become an essential technique in accelerating dis...
research
05/17/2023

Defining a canonical unit for accounting purposes

Compute resource providers often put in place batch compute systems to m...
research
07/02/2019

Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

Modern distributed machine learning (ML) training workloads benefit sign...
research
07/30/2019

DeepPlace: Learning to Place Applications in Multi-Tenant Clusters

Large multi-tenant production clusters often have to handle a variety of...

Please sign up or login with your details

Forgot password? Click here to reset