Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters

10/28/2020
by   Qiong Wu, et al.
0

Large-scale interactive web services and advanced AI applications make sophisticated decisions in real-time, based on executing a massive amount of computation tasks on thousands of servers. Task schedulers, which often operate in heterogeneous and volatile environments, require high throughput, i.e., scheduling millions of tasks per second, and low latency, i.e., incurring minimal scheduling delays for millisecond-level tasks. Scheduling is further complicated by other users' workloads in a shared system, other background activities, and the diverse hardware configurations inside datacenters. We present Rosella, a new self-driving, distributed approach for task scheduling in heterogeneous clusters. Our system automatically learns the compute environment and adjust its scheduling policy in real-time. The solution provides high throughput and low latency simultaneously, because it runs in parallel on multiple machines with minimum coordination and only performs simple operations for each scheduling decision. Our learning module monitors total system load, and uses the information to dynamically determine optimal estimation strategy for the backends' compute-power. Our scheduling policy generalizes power-of-two-choice algorithms to handle heterogeneous workers, reducing the max queue length of O(log n) obtained by prior algorithms to O(loglog n). We implement a Rosella prototype and evaluate it with a variety of workloads. Experimental results show that Rosella significantly reduces task response times, and adapts to environment changes quickly.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/12/2020

RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers (Technical Report)

Low-latency online services have strict Service Level Objectives (SLOs) ...
research
07/22/2022

A Hardware-based HEFT Scheduler Implementation for Dynamic Workloads on Heterogeneous SoCs

Non-uniform performance and power consumption across the processing elem...
research
07/22/2023

Online Container Scheduling for Low-Latency IoT Services in Edge Cluster Upgrade: A Reinforcement Learning Approach

In Mobile Edge Computing (MEC), Internet of Things (IoT) devices offload...
research
01/31/2023

Scheduling Inference Workloads on Distributed Edge Clusters with Reinforcement Learning

Many real-time applications (e.g., Augmented/Virtual Reality, cognitive ...
research
11/09/2020

Dynamic Power Control for Time-Critical Networking with Heterogeneous Traffic

Future wireless networks will be characterized by heterogeneous traffic ...
research
06/02/2023

ODIN: Overcoming Dynamic Interference in iNference pipelines

As an increasing number of businesses becomes powered by machine-learnin...
research
05/23/2022

Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters

Many scientific workflow scheduling algorithms need to be informed about...

Please sign up or login with your details

Forgot password? Click here to reset