Leveraging Reinforcement Learning for Task Resource Allocation in Scientific Workflows

11/22/2022
by   Jonathan Bader, et al.
0

Scientific workflows are designed as directed acyclic graphs (DAGs) and consist of multiple dependent task definitions. They are executed over a large amount of data, often resulting in thousands of tasks with heterogeneous compute requirements and long runtimes, even on cluster infrastructures. In order to optimize the workflow performance, enough resources, e.g., CPU and memory, need to be provisioned for the respective tasks. Typically, workflow systems rely on user resource estimates which are known to be highly error-prone and can result in over- or underprovisioning. While resource overprovisioning leads to high resource wastage, underprovisioning can result in long runtimes or even failed tasks. In this paper, we propose two different reinforcement learning approaches based on gradient bandits and Q-learning, respectively, in order to minimize resource wastage by selecting suitable CPU and memory allocations. We provide a prototypical implementation in the well-known scientific workflow management system Nextflow, evaluate our approaches with five workflows, and compare them against the default resource configurations and a state-of-the-art feedback loop baseline. The evaluation yields that our reinforcement learning approaches significantly reduce resource wastage compared to the default configuration. Further, our approaches also reduce the allocated CPU hours compared to the state-of-the-art feedback loop by 6.79

READ FULL TEXT

page 1

page 5

page 6

research
09/13/2023

Lotaru: Locally Predicting Workflow Task Runtimes for Resource Management on Heterogeneous Infrastructures

Many resource management techniques for task scheduling, energy and carb...
research
11/09/2021

Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters

Scientific workflow management systems like Nextflow support large-scale...
research
01/20/2023

Adaptive Resource Allocation for Workflow Containerization on Kubernetes

In a cloud-native era, the Kubernetes-based workflow engine enables work...
research
05/04/2020

MARS: Multi-Scalable Actor-Critic Reinforcement Learning Scheduler

In this paper, we introduce a new scheduling algorithm MARS based on a c...
research
08/16/2022

Reshi: Recommending Resources for Scientific Workflow Tasks on Heterogeneous Infrastructures

Scientific workflows typically comprise a multitude of different process...
research
03/04/2019

Resource-sharing Policy in Multi-tenant Scientific Workflow as a Service Platform

Increasing adoption of scientific workflows in the community has urged f...
research
02/11/2022

Global Optimization of Data Pipelines in Heterogeneous Cloud Environments

Modern production data processing and machine learning pipelines on the ...

Please sign up or login with your details

Forgot password? Click here to reset