Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters

11/09/2021
by   Jonathan Bader, et al.
0

Scientific workflow management systems like Nextflow support large-scale data analysis by abstracting away the details of scientific workflows. In these systems, workflows consist of several abstract tasks, of which instances are run in parallel and transform input partitions into output partitions. Resource managers like Kubernetes execute such workflow tasks on cluster infrastructures. However, these resource managers only consider the number of CPUs and the amount of available memory when assigning tasks to resources; they do not consider hardware differences beyond these numbers, while computational speed and memory access rates can differ significantly. We propose Tarema, a system for allocating task instances to heterogeneous cluster resources during the execution of scalable scientific workflows. First, Tarema profiles the available infrastructure with a set of benchmark programs and groups cluster nodes with similar performance. Second, Tarema uses online monitoring data of tasks, assigning labels to tasks depending on their resource usage. Third, Tarema uses the node groups and task labels to dynamically assign task instances evenly to resources based on resource demand. Our evaluation of a prototype implementation for Kubernetes, using five real-world Nextflow workflows from the popular nf-core framework and two 15-node clusters consisting of different virtual machines, shows a mean reduction of isolated job runtimes by 19.8 managers and 4.54 cluster usage. Moreover, executing two long-running workflows in parallel and on restricted resources shows that Tarema is able to reduce the runtimes even more while providing a fair cluster usage.

READ FULL TEXT

page 1

page 8

page 9

research
08/16/2022

Reshi: Recommending Resources for Scientific Workflow Tasks on Heterogeneous Infrastructures

Scientific workflows typically comprise a multitude of different process...
research
12/05/2018

ADARES: Adaptive Resource Management for Virtual Machines

Virtual execution environments allow for consolidation of multiple appli...
research
08/15/2023

Prism: Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems

Ensuring the reliability of cloud systems is critical for both cloud ven...
research
11/22/2022

Leveraging Reinforcement Learning for Task Resource Allocation in Scientific Workflows

Scientific workflows are designed as directed acyclic graphs (DAGs) and ...
research
12/22/2021

Rightsizing Clusters for Time-Limited Tasks

In conventional public clouds, designing a suitable initial cluster for ...
research
05/23/2022

Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters

Many scientific workflow scheduling algorithms need to be informed about...
research
07/06/2022

A Kubernetes 'Bridge' operator between cloud and external resources

Many scientific workflows require dedicated compute resources, including...

Please sign up or login with your details

Forgot password? Click here to reset