Scheduling Multi-Server Jobs with Sublinear Regrets via Online Learning

05/11/2023
by   Hailiang Zhao, et al.
0

Nowadays, multi-server jobs, which request multiple computing devices and hold onto them during their execution, dominate modern computing clusters. When allocating computing devices to them, it is difficult to make the tradeoff between the parallel computation gains and the internal communication overheads. Firstly, the computing gain does not increase linearly with computing devices. Secondly, the device type which dominates the communication overhead is various to different job types. To achieve a better gain-overhead tradeoff, we formulate an accumulative reward maximization program and design an online algorithm, i.e., OGASched, to schedule multi-server jobs. The reward of a job is formulated as the parallel computation gain aggregated over the allocated computing devices minus the penalty on the dominant communication overhead. OGASched allocates computing devices to each arrived job in the ascending direction of the reward gradients. OGASched has a best-so-far regret with concave rewards, which grows sublinearly with the number of job types and the time slot length. OGASched has several parallel sub-procedures to accelerate its computation, which greatly reduces the complexity. We conduct extensive trace-driven simulations to validate the performance of OGASched. The results demonstrate that OGASched outperforms widely used heuristics by 11.33%, 7.75%, 13.89%, and 13.44%, respectively.

READ FULL TEXT

page 9

page 11

research
04/09/2022

Learning to Dispatch Multi-Server Jobs in Bipartite Graphs with Unknown Service Rates

Multi-server jobs are imperative in modern cloud computing systems. A mu...
research
08/24/2018

Hybrid Job-driven Scheduling for Virtual MapReduce Clusters

It is cost-efficient for a tenant with a limited budget to establish a v...
research
12/13/2021

Scheduling Servers with Stochastic Bilinear Rewards

In this paper we study a multi-class, multi-server queueing system with ...
research
08/14/2019

Resolvable Designs for Speeding up Distributed Computing

Distributed computing frameworks such as MapReduce are often used to pro...
research
08/01/2023

CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters

We present CASSINI, a network-aware job scheduler for machine learning (...
research
10/05/2016

10-millisecond Computing

Despite computation becomes much complex on data with an unprecedented s...

Please sign up or login with your details

Forgot password? Click here to reset