A Case for Sampling Based Learning Techniques in Coflow Scheduling

08/25/2021
by   Akshay Jajoo, et al.
0

Coflow scheduling improves data-intensive application performance by improving their networking performance. State-of-the-art online coflow schedulers in essence approximate the classic Shortest-Job-First (SJF) scheduling by learning the coflow size online. In particular, they use multiple priority queues to simultaneously accomplish two goals: to sieve long coflows from short coflows, and to schedule short coflows with high priorities. Such a mechanism pays high overhead in learning the coflow size: moving a large coflow across the queues delays small and other large coflows, and moving similar-sized coflows across the queues results in inadvertent round-robin scheduling. We propose Philae, a new online coflow scheduler that exploits the spatial dimension of coflows, i.e., a coflow has many flows, to drastically reduce the overhead of coflow size learning. Philae pre-schedules sampled flows of each coflow and uses their sizes to estimate the average flow size of the coflow. It then resorts to Shortest Coflow First, where the notion of shortest is determined using the learned coflow sizes and coflow contention. We show that the sampling-based learning is robust to flow size skew and has the added benefit of much improved scalability from reduced coordinator-local agent interactions. Our evaluation using an Azure testbed, a publicly available production cluster trace from Facebook shows that compared to the prior art Aalo, Philae reduces the coflow completion time (CCT) in average (P90) cases by 1.50x (8.00x) on a 150-node testbed and 2.72x (9.78x) on a 900-node testbed. Evaluation using additional traces further demonstrates Philae's robustness to flow size skew.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/16/2021

Saath: Speeding up CoFlows by Exploiting the Spatial Dimension

Coflow scheduling improves data-intensive application performance by imp...
research
08/24/2021

The Case for Task Sampling based Learning for Cluster Job Scheduling

The ability to accurately estimate job runtime properties allows a sched...
research
07/10/2019

Scheduling With Inexact Job Sizes: The Merits of Shortest Processing Time First

It is well known that size-based scheduling policies, which take into ac...
research
06/26/2020

QCluster: Clustering Packets for FlowScheduling

Flow scheduling is crucial in data centers, as it directly influences us...
research
03/01/2022

An Adaptable and Agnostic Flow Scheduling Approach for Data Center Networks

Cloud applications have reshaped the model of services and infrastructur...
research
03/24/2022

Size-based scheduling vs fairness for datacenter flows: a queuing perspective

Contrary to the conclusions of a recent body of work where approximate s...
research
08/20/2023

Eventually-Consistent Federated Scheduling for Data Center Workloads

Data center schedulers operate at unprecedented scales today to accommod...

Please sign up or login with your details

Forgot password? Click here to reset