Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary Data-Parallel Jobs

02/14/2021
by   Lauritz Thamsen, et al.
0

Distributed data processing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their data processing jobs, while the resource usage of these jobs also typically fluctuates considerably. Therefore, multiple jobs usually get scheduled onto the same shared resources to increase the resource utilization and throughput of clusters. However, job runtimes and the utilization of shared resources can vary significantly depending on the specific combinations of co-located jobs. This paper presents Hugo, a cluster scheduler that continuously learns how efficiently jobs share resources, considering metrics for the resource utilization and interference among co-located jobs. The scheduler combines offline grouping of jobs with online reinforcement learning to provide a scheduling mechanism that efficiently generalizes from specific monitored job combinations yet also adapts to changes in workloads. Our evaluation of a prototype shows that the approach can reduce the runtimes of exemplary Spark jobs on a YARN cluster by up to 12.5 and waiting times can be bounded.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2022

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Many organizations routinely analyze large datasets using systems for di...
research
05/22/2019

Two stage cluster for resource optimization with Apache Mesos

As resource estimation for jobs is difficult, users often overestimate t...
research
01/31/2018

Henge: Intent-driven Multi-Tenant Stream Processing

We present Henge, a system to support intent-based multi-tenancy in mode...
research
07/30/2019

DeepPlace: Learning to Place Applications in Multi-Tenant Clusters

Large multi-tenant production clusters often have to handle a variety of...
research
05/12/2020

DMR API: Improving cluster productivity by turning applications into malleable

Adaptive workloads can change on–the–fly the configuration of their jobs...
research
11/14/2018

Anomaly Analysis for Co-located Datacenter Workloads in the Alibaba Cluster

In warehouse-scale cloud datacenters, co-locating online services and of...
research
04/25/2016

Do the Hard Stuff First: Scheduling Dependent Computations in Data-Analytics Clusters

We present a scheduler that improves cluster utilization and job complet...

Please sign up or login with your details

Forgot password? Click here to reset