DeepAI AI Chat
Log In Sign Up

Leveraging Reinforcement Learning for Task Resource Allocation in Scientific Workflows

by   Jonathan Bader, et al.
Berlin Institute of Technology (Technische Universität Berlin)

Scientific workflows are designed as directed acyclic graphs (DAGs) and consist of multiple dependent task definitions. They are executed over a large amount of data, often resulting in thousands of tasks with heterogeneous compute requirements and long runtimes, even on cluster infrastructures. In order to optimize the workflow performance, enough resources, e.g., CPU and memory, need to be provisioned for the respective tasks. Typically, workflow systems rely on user resource estimates which are known to be highly error-prone and can result in over- or underprovisioning. While resource overprovisioning leads to high resource wastage, underprovisioning can result in long runtimes or even failed tasks. In this paper, we propose two different reinforcement learning approaches based on gradient bandits and Q-learning, respectively, in order to minimize resource wastage by selecting suitable CPU and memory allocations. We provide a prototypical implementation in the well-known scientific workflow management system Nextflow, evaluate our approaches with five workflows, and compare them against the default resource configurations and a state-of-the-art feedback loop baseline. The evaluation yields that our reinforcement learning approaches significantly reduce resource wastage compared to the default configuration. Further, our approaches also reduce the allocated CPU hours compared to the state-of-the-art feedback loop by 6.79


page 1

page 5

page 6


Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters

Scientific workflow management systems like Nextflow support large-scale...

Adaptive Resource Allocation for Workflow Containerization on Kubernetes

In a cloud-native era, the Kubernetes-based workflow engine enables work...

Towards Advanced Monitoring for Scientific Workflows

Scientific workflows consist of thousands of highly parallelized tasks e...

MARS: Multi-Scalable Actor-Critic Reinforcement Learning Scheduler

In this paper, we introduce a new scheduling algorithm MARS based on a c...

Reshi: Recommending Resources for Scientific Workflow Tasks on Heterogeneous Infrastructures

Scientific workflows typically comprise a multitude of different process...

Tromino: Demand and DRF Aware Multi-Tenant Queue Manager for Apache Mesos Cluster

Apache Mesos, a two-level resource scheduler, provides resource sharing ...

Bioinformatics Computational Cluster Batch Task Profiling with Machine Learning for Failure Prediction

Motivation: Traditional computational cluster schedulers are based on us...