Toward Efficient Online Scheduling for Distributed Machine Learning Systems

08/06/2021
by   Menglu Yu, et al.
0

Recent years have witnessed a rapid growth of distributed machine learning (ML) frameworks, which exploit the massive parallelism of computing clusters to expedite ML training. However, the proliferation of distributed ML frameworks also introduces many unique technical challenges in computing system design and optimization. In a networked computing cluster that supports a large number of training jobs, a key question is how to design efficient scheduling algorithms to allocate workers and parameter servers across different machines to minimize the overall training time. Toward this end, in this paper, we develop an online scheduling algorithm that jointly optimizes resource allocation and locality decisions. Our main contributions are three-fold: i) We develop a new analytical model that considers both resource allocation and locality; ii) Based on an equivalent reformulation and observations on the worker-parameter server locality configurations, we transform the problem into a mixed packing and covering integer program, which enables approximation algorithm design; iii) We propose a meticulously designed approximation algorithm based on randomized rounding and rigorously analyze its performance. Collectively, our results contribute to the state of the art of distributed ML system optimization and algorithm design.

READ FULL TEXT

page 14

page 15

research
07/16/2022

On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention

Powered by advances in deep learning (DL) techniques, machine learning a...
research
05/28/2021

A Sum-of-Ratios Multi-Dimensional-Knapsack Decomposition for DNN Resource Scheduling

In recent years, to sustain the resource-intensive computational needs f...
research
02/02/2022

GADGET: Online Resource Optimization for Scheduling Ring-All-Reduce Learning Jobs

Fueled by advances in distributed deep learning (DDL), recent years have...
research
01/03/2018

Online Job Scheduling in Distributed Machine Learning Clusters

Nowadays large-scale distributed machine learning systems have been depl...
research
02/03/2020

Dynamic Parameter Allocation in Parameter Servers

To keep up with increasing dataset sizes and model complexity, distribut...
research
01/09/2019

Interim Report on Adaptive Event Dispatching in Serverless Computing Infrastructures

Serverless computing is an emerging service model in distributed computi...
research
08/10/2023

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Distributed machine learning (DML) technology makes it possible to train...

Please sign up or login with your details

Forgot password? Click here to reset