MARS: Multi-Scalable Actor-Critic Reinforcement Learning Scheduler

by   Betis Baheri, et al.

In this paper, we introduce a new scheduling algorithm MARS based on a cost-aware multi-scalable reinforcement learning approach, which serves as an intermediate layer between HPC resource manager and user application workflow, MARS ensembles the pre-generated models from users workflows and decides on the most suitable strategy for optimization. A whole workflow application would be split into several optimized subtasks. Then based on a pre-defined resource management plan. A reward will be generated after executing a scheduled task. Lastly, MARS updates the Deep Neural Network (DNN) model for future use. MARS is designed to be able to optimize the existing models through the reinforcement mechanism. MARS can adapt to the shortage of training samples and optimize the performance by itself, especially through combining the small tasks together or switching between pre-built scheduling strategy such as Backfilling, SJF, etc, then choosing the most suitable approach. We tested MARS using different real-world workflow traces. MARS can achieve between 5 better performance while comparing to the other approaches.



There are no comments yet.


page 3

page 6

page 10


A Deep Reinforcement Learning Approach to Concurrent Bilateral Negotiation

We present a novel negotiation model that allows an agent to learn how t...

A Deep Actor-Critic Reinforcement Learning Framework for Dynamic Multichannel Access

To make efficient use of limited spectral resources, we in this work pro...

Pseudorehearsal in actor-critic agents with neural network function approximation

Catastrophic forgetting has a significant negative impact in reinforceme...

Job Scheduling on Data Centers with Deep Reinforcement Learning

Efficient job scheduling on data centers under heterogeneous complexity ...

An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic

Recently, the applications of deep neural network (DNN) have been very p...

Analysis of Workflow Schedulers in Simulated Distributed Environments

Task graphs provide a simple way to describe scientific workflows (sets ...

Obtain Employee Turnover Rate and Optimal Reduction Strategy Based On Neural Network and Reinforcement Learning

Nowadays, human resource is an important part of various resources of en...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Over the last few years Scientific Workflow Management Systems (SWMS’s) have become valuable tools to carry out complex scientific experiments[WorkflowManagementForWhom]. Active research in scientific workflow management has enabled a large number of systems that can be used by scientists in practice, addressing a large variety of scientists’ needs. Current workflow management systems, incorporating with resource management system, offer generic services to handle task management, distribution, and monitoring and failure management on various types of platforms [Distributed4683116, 6655672]. Although executing workflows systems on cloud and HPC infrastructures have been studied, and many services offer various capabilities, the lack of optimized and sophisticated scheduler systems which can collaborate with existing resource management on HPC and multiple cloud systems is yet to be investigated.

Different workflows require different optimizations. For instance, CPU intensive tasks need to be optimized to enhance the instruction throughput. On the other hand, I/O intensive tasks should be scheduled towards minimizing the data transferring between different infrastructures. These challenges can be solved partially through meticulously designed heuristics. Pursuing recent research in HPC scheduling algorithms, the most common designs either apply a optimal solution for heuristic models or require changes at system level and replace existing resource manager

[HPC_Scheduling]. This process often has to be repeated if the workload or the metric of interest changes. Beside, replacing the entire resource manager is not applicable especially when the desired optimal system relies on changing resource manager every time if researchers finds more efficient solution. Similarly existing users does not have to change their code and upgrade entire workflow for a optimized performance instead it can be achieved by simply choosing a different configuration instead of changing their code.

In summary, we illustrate the major challenges within current existing scheduling system:

  • The underlying systems are complex and the low level scheduling is done by resource manager of that particular system. For instance, in cluster scheduling, the running time of a task varies with data locality, server characteristics, interactions with other tasks, and interference on shared resources such as CPU caches, network bandwidth, etc.[Etsion2005ASS]

  • HPC systems usually have their own resource management to utilize the entire system, e.g., [slurm_ref]. However, these tools are not optimized for dynamic changes and optimization based on prediction and cost.

  • Another significant problem in resource mangers is practical instantiates have to make online decisions with noisy inputs and work well under diverse conditions. Decision between CPU, Memory, I/O and Cost can have different meaning for individual workflows.

  • Switching between different systems can be challenging, requirements and configuration can be different based on systems.

  • Lastly, HPC systems performance metrics of interest, such as CPU optimization in scale of entire system can be extremely hard to optimize in a principled manner.[10.1145/3005745.3005750]

To overcome these issues, we expect to design a generic scheduling system based on latest and greatest algorithm working with existing resource manager on HPC system. Later on this method can be expanded to support varies different cloud infrastructure such as Google Cloud [GoogleCloud], Amazon AWS [Amazon_AWS], Microsoft Azure [MS_Azure], etc. Regardless of targeted system, MARS can calculate the best and optimal solution and assign it to targeted resource manager on top of heterogeneous hardware infrastructure and complex and various workflow patterns. Instead of replacing entire resource manager using an extension to communicate with existing one can be more efficient and more practical.

In this paper we introduce a multi-scalable actor-critic reinforcement learning scheduler (MARS) to address following

  • MARS presents multi-scalable scheduling policy ensembling A3C reinforcement learning [2019arXiv190906040P] and heuristic policies.

  • MARS optimizes scheduling performance through task parallelism and classification with the use of DAG and graph comparison.

  • MARS requires minimal changes to the existing resource manager on HPC such as Slurm [slurm_ref].

The rest of paper is organized as follows: in section II-A we discuss about HPC workflow requirements and descriptions along with server parameters and our motivation. We explain how MARS integrates previous heuristic algorithms along with asynchronous actor-critic reinforcement learning and we give a detailed explanation of reinforcement learning approach and our decision on how to select the best suitable scheduling algorithm in III. We discuss our implementation methods in section IV. In section V we discuss about our results and compare them to previous works and talk about our observation. we also explain why MARS is the best suitable scheduler compared to other approaches. Lastly, in section VI we talk about previously literature in this area. We conclude in VII.

Ii Background AND Motivation

Ii-a Background

A Workflow application is a set of tasks or instructions executed on some arbitrary input by some particular order as steps. Figure 1 shows a simple workflow. Though workflow description can be in any form, however, to use more standard version we choose to comply Simple Workflow Service (SWF) [Amazon_AWS]. Each task could have dependency to others or it could be executed individually and independently.

Fig. 1: Workflow Description

To improve performance of workflows and create more meaningful relation between tasks, steps and requirements, we use a directed acyclic graph (DAG) based on each component. Moreover, a second graph can be built based on requirements of entire workflow. Any workflows can have just one or multiple requirement DAGs based on the complexity of the workflow. Similarly, two DAGs represents target system resources and scheduling requirements. System resources are queries from existing resource manager which explained in detail in section III-A1, scheduling requirements are number of CPUs per node, and/or entire workflow, the amount of memory, I/O and the cost based on desired parameters such as I/O throughput, CPU usage, GPU usage, etc. More details are provided in section III-B1.

Traditionally Mu’alem and et al. [932708]

introduced backfilling over well known first come first serve (FCFS) algorithm to overcome the fragmentation problem. Backfilling uses dynamic partitioning to schedule tasks to run on distributed system to maximize the performance. There are two major implementations, conservative backfilling and easyfilling. Both can cause a starvation. However, with defining a window size interval and normal distribution of input we can achieve a better performance

[932708]. Further more, Fan and et al [10.1145/3307681.3325401] introduced multi-objective selection scheduling based on fixed window sizes. Their method uses FCFS with a fixed interval to avoid starvation which is adopted in couple of reinforcement learning proposals[Domeniconi2019CuSHCS] when training time grows exponentially for the increased dataset.

On the other hand, based on recent research [2019arXiv190906040P]

reinforcement learning network requires at most 512 dataset to train and test, however in terms of small workflow HPC administrators need to combine small sized task together or use heuristic algorithms. Because reinforcement learning uses vector based image transition, the workflows‘ size need to be the same. Otherwise the algorithm is incapable of training. Adding 0 to fill missing data could be a possible solution. However, considering the high cost and low performance, it is better to use backfilling instead of reinforcement learning method for simple workflows and small tasks. In RL to avoid long and expensive training cost, researchers

[Job_scheduling] introduced fixed window size and time limitation on training set. By limiting the training size, we can avoid complication on model and workflow comparison. The downside is that the model is not trained properly. To minimize the limitation, the size of selected interval for window must be chosen accordingly.

Ii-B Motivation

The normal Reinforcement Learning (RL) schedulers require to replace the existing HPC resource management tools, but in most of the cases, users have to adapt and change their workflow to satisfy the new system’s requirements. Some workflows require complicated RL scheduling and other are simple enough to use traditional scheduling algorithm.

One limitation of HPC reinforcement learning algorithm is that the agent only has control over one action at the time, and the knowledge of environment variables are limited. Another limitation is that the entire training set has to be specific for HPC system. Otherwise, the training model would not be optimized. To address these issues we take a subset from heuristic dataset and synthetic dataset to create a more suitable network model for training. Complementary to previous approaches we also also read the actual result from HPC resource manager Slurm [slurm_ref] to validate our result. With combination of these two approach together we generate a more accurate model and keep the two previous ones and to rollback in case of negative reward.

Agent’s decision is based on the policy

, which is defined as a probability distribution over actions

is the probability that action is taken in state . In general, there are many possible state, action pairs. Due to the limitation of resources in practice, it is impossible to store the policy in tabular form. However, we can use a approximator function. Many forms of function approximators can be used to represent the policy. for example, a linear combination of state and action space is a popular choice. Deep Neural Networks (DNNs) [10.1145/3335484.3335513] are a better choice for function approximation due to the fact of feature automation and previous successes[2019arXiv190906040P].

Existing HPC resource managements are suffering from large number of tasks. For better optimization, we need to either replace them with more specific algorithm depending on workflow or use an intermediate layer to communicate with resource manager. Replacing the resource management is not only time consuming but also requiring knowledge of workflows. Since replacing resource management is costly and compromises support for legacy workflows, in our approach we don’t require to change the sub-levels and use the existing tools to increase the performance[HPC_RM_Book].

In our approach, we introduce a median layer to existing HPC resource manager in order to avoid replacing entire system and not depending on one solution for all possible cases. The user can specify how many parameter servers and nodes to use, including the amount of required resource (e.g. CPU, Memory, GPU, I/O, etc.) then submits the workflow to MARS. Our Scheduler MARS chooses between simple backfilling or advance reinforcement learning algorithm, then assigns tasks by executing appropriate default command on HPC system.

Iii Mars Design and Implementation

Iii-a Mars System Overview

Fig. 2: MARS System Overview

Figure 2 shows the overall system structure of MARS. We assume that workflow description and generated DAG graph are provided to scheduling system. One example of previous work done in BEEFlow [841636600], which proposed an in-situ analysis enabled workflow management system which supports multiple platforms using HPC containers [PriedhorskyR17].

The reinforcement learning module in MARS contains an scheduler agent, environment, an neural network based on server parameters input and reward value from HPC environment. At each time step t the agent observes the parameters on HPC state , then chooses an action . following that action, the environment’s state would proceed to and the agent receives reward . The state transitions and rewards are stochastic and are assumed to have the Markov property; i.e. the state transition probabilities and rewards depend only on the state of the environment and the action taken by the agent [1996cs5103K].

In general we take following steps shown in algorithm 1 to accomplish the optimization for each workflow.

Result: Saved Model M
Input: Created DAG from workflows Input : Decision D Input: Policy Input: Available HPC Resources from Existing Resource Manager (SLURM) if Task and dependency  then
      Compare and Parallel Tasks
       D = () = (D) M = MARS(,,) return M
end if
Algorithm 1 MARS Overall Algorithm

The corresponding benefit to our design are: 1). Each workflow can be executed independently from others; 2). HPC system is not dependent to one algorithm; 3). workflows can run simultaneously along each other. In respect to users’ workflows, there is no limitation over workflow description to support new algorithms, since we generate a DAG model based on workflow description that we can easily adapt to change the algorithm based on returning values from resource manager. On top of that, because the entire optimization is done regardless of actual HPC systems, we can update saved models based on best suitable parameters.

Iii-A1 Algorithm Selection

Resource management in HPC system is based on utilization of CPU, Memory and I/O. on the other point of interest taking into the consideration, cost of each task execution on different cloud infrastructure can help scientists to minimize the overall cost.

Traditionally schedulers are optimizing tasks only on one dimension, a simple back-filling scheduler can be an example, in back-filling scheduling, scheduler tries to utilize on CPU usage. [932708]

In the next step, more sophisticated schedulers use modern Machine Learning algorithm to optimize tasks based on CPU, Memory and I/O


However, in most of those methods scheduler either sacrifice one for another or find the average solution. More improved schedulers uses reward function to update the trained model, updated model will adapt to the recent input and learns based on previous execution [Delimitrou2014QuasarRA].

In our proposal, MARS decides between simple algorithm such as back-filling to more complicated online algorithm such as asynchronous actor critic reinforcement learning to execute tasks. By creating a model based on RL-A3C algorithm and updating that model with the similar technique which previously introduces by D. Zhang and et al. [2019arXiv191008925Z] we can reuse a proper trained model with similar workflows. However, training the system is highly correlates to the size and number of tasks in one arbitrary workflow.

Based on our observation, small workflows such as a simple RNA search, 3 would result in an inefficient model. In the other hand complex and large workflows in RL such as Blast, 3 would cause an over-fitting the network. [2018arXiv180406893Z]

This phenomenon would result in an inefficient reward value and model. In our approach, by combining a time window and using custom loss function, the reward value and model generated from workflow would be more accurate compared to previous approaches


Iii-B Policy Model and Algorithm

Our policy model depends on the size of workflow, in term of small workflow which can be optimized with simple FCFS algorithm, MARS bypass the RL algorithm and creates an simple scheduling tasks ready to be executed on HPC. In the other hand, when workflows contain large subsection of tasks and the running time requires hours to days, MARS selects an arbitrary RL algorithm based on previously saved models.

The reinforcement learning module in MARS contains an scheduler agent, environment, an neural network based on server parameters input and reward value from HPC environment. At each time step t the agent observes the parameters on HPC state , then chooses an action . following that action, the environment’s state would proceed to and the agent receives reward . The state transitions and rewards are stochastic and are assumed to have the Markov property; i.e. the state transition probabilities and rewards depend only on the state of the environment and the action taken by the agent [1996cs5103K].

In most RL approaches learning is done by performing gradient-decent

on the policy parameters. the key idea in policy gradient methods is to estimate the gradient by observing the trajectories of executions that obtained by following the policy

[2016arXiv161106256B]. Similar to Monte Carlo Method [4736059], the samples multiple trajectories and uses the empirically computed cumulative discounted reward. However, this approach is based on naive algorithm and usually calculates the local maximum instead of global maximum. In order to overcome this limitation we use similar method as other researchers [2016arXiv161106256B], RL with Actor Critic Algorithm (ACA) in MARS.

Iii-B1 Reinforcement Learning Objects

The definition for objective function for policy gradients is: In more specific explanation [2015arXiv150902971L] the objective is to learn a policy maximizes the cumulative future reward to be received starting from any given time t until the terminal time T. In our approach these parameters are read from existing resource manager, and the action taken upon the optimizing the tasks execution is done by MARS[2018arXiv180100690T] . Since we want to optimize the policy towards CPU, Memory and Cost utilization, [2019arXiv191008925Z] we take the derivative of the objective with respect to the the policy parameter [10.5555/3044805.3044850, 2017arXiv170303864S]: The policy function is a neural network based on chosen parameters. Most often we consider the expectation to optimize the entire workflow, where represents the probability of occurrence of expected value of finishing tasks on HPC system, function denoting the value of . In order to derive policy we define function as: where i is an arbitrary starting point in a trajectory,[2016arXiv160202867T] is the probability of the occurrence of , given the trajectory . Using well known machine learning techniques [2015arXiv150205477S, 2017arXiv170706347S, simple_gradient, 10.5555/3009657.3009806], mapping between HPC server parameters and RL properties we can redesign reinforcement learning to support HPC system. Instead of replace the entire system [Mujoco] on bare-metal level, we read the parameter servers from resource manager.

Iii-B2 Reinforcement Learning Using Actor Critic

As mentioned before the high variability in log probabilities and cumulative reward values can make noisy gradients, and cause unstable learning and the policy distribution skewing to a non-optimal direction. On top of that, when trajectories have a cumulative reward of 0, the reward values would be meaningless due to the fact of probability distribution. That causes an instability and slow convergence of policy method. In order to improve policy gradients first method is to use a baseline

in to make cumulative reward smaller and introducing smaller gradients and more stable updates. the summary of baseline functions are [10.1137/S0363012901385691, PETERS20081180]:

Where the Q value can be learned by parameterizing the Q function with a neural network. Next we can define Actor Critic method, where the Critic estimates the value function which can be Q-Value or state value V-Value. In our approach we took the state value from existing resource manager such as Slurm, MARS communicates with Slurm CTL to calculate the reward values based on available resources such as CPU, Memory, I/O. Algorithm 2 MARS Policy RL-A3C Algorithm [10.1137/S0363012901385691].

Result: HPC Reward Estimation
Input: HPC Scheduling Action based on State Parameters Input: HPC CPU, Memory, I/O, Cost Values v̌ Algorithm parameters: step sizes > > Initialize policy parameter and state-value weights (e.g., to 0) Set weights to 0 at beginning, Initializing C as the Cost Probability added to evaluation; while 

for each epochs

       Initialize (first state of episode);
       while  is not terminal (for each time step) do
             Take action , Observe , ;
             (if is terminal, then v̌;
       end while
end while
Algorithm 2 MARS RL-A3C Policy

To improve on existing Actor Critic method, we can compare the difference between taking a specific action to the average, based on general action at the given state. this defines the Advantage value in A2C, . To avoid constructing two additional neural networks we can use the relationship between Q and V value from bellman optimality, that give us and rewriting advantage would gives us . Applying Asynchronous method, and putting everything together we have as a Advantage Actor Critic (A3C) [10.1137/S0363012901385691, 2019arXiv191008925Z, 2019arXiv190906040P].

As we explained earlier, computation of the Reward Value can have different meanings. The critic is a state-value function, MARS can optimized based on Parameter Server values read from Slurm or any other resource manager and final value results can be used to determine if there was an improvement or not.

Figure 3 shows the Policy Structure of MARS. User’s workflow description can be in any standard format such as Common Workflow Language (CWL) [CWL], The Workflow Description Language (WDL) [FEITELSON20142967], Standard Workload Format (SWF) [WDL], etc. As mentioned before in section II first the DAG is generated from workflow description containing tasks (tasks) to execute and the dependency between them, in our example one workflow can be as simple as one one task or have multiple depend parts such as Blast example or it could be linear search workflow. the generated data then would be be feed to our categorizing module which determines the dept of workflow based on description, graph comparison algorithm and heuristic generated models.

The algorithm selector module decide on whether to use RL-A3C or basic FCFS, as mentioned before for simple workflows which require only limited execution time if there are no other workflows running, and the description requires most of the system resources, running RL-A3C would cause an overhead. however, in case MARS can combine multiple independent workflows together and run RL-A3C it would switch back to use RL-A3C algorithm and build the best suitable model for that specific type. We kept the traditional algorithm such as FCFS, Backfilling and etc.

In order to support legacy workflows and save on training time and in case HPC system is not equip with GPU, small optimization based on known graph combining algorithm [8723466, 298205] would ran next to combine the parallel tasks together. Compared to normal Reinforcement Learning technique we use graph search algorithm to identify the best possible model for gain optimal outcome along with user input as a variable to differentiate between CPU, Memory , I/O and Cost of each task. generated model will be used to train the system for optimization and feedback output.

Next, MARS queries the available resources from Slurm, knowing the current state of system and workflow description next MARS created a state description based on Job type, Number of time slots run, remaining epochs, allocated resources on HPC, allocated number of workers based on workflow description, and allocated number of parameter server from HPC. based on previous discussion we build policy and value network calculating a baseline, and initiate an action, then using Slurm CTL on HPC we initiate batch of tasks on HPC (Action).

In addition, MARS needs to decide the best split between tasks and parallelism based on available resources, knowing that each workflows can be divided into sub-workflows based on searching paths, MARS categorizes tasks into groups and after this separation it generates a deep neural network based on user input along with CPU, Memory and I/O values.

Considering a cluster with l resource types (e.g., CPU Memory, I/O, Cost), each separated branch of tasks from arbitrary workflow would be an input to MARS scheduling system. Similar to prior work [Multi-resource], execution time is known before scheduling period; in more detail the resource profile of each task j is given by the vector = (,1,…,,l) of resources requirements, and the duration of each task execution.

Finally using Slurm CTL MARS queries about available remaining resources, current executing tasks, previously executed times, and corrupted previous tasks. based on return MARS calculates the reward value and using baseline it updates the neural network. In order to overcome training overhead and inefficient models, MARS creates an arbitrary base network based on heuristic workflow data, if the data is absent from database we simply generate a similar workflow with smaller tasks to train the network.

Fig. 3: MARS Policy Network

Iii-B3 Graph Comparison and Parallel Optimizer

In most RL based schedulers the generated workflow graph and cost is not considered. The deep neural network is purely based on workflows input data or previous execution, however if we consider the graph generated from workflows and use search algorithms to find the similarities in individual tasks we can predict and categorize each task based on their CPU, Memory, I/O intensity. In addition, we can also consider the cost of each execution. Based on predefined table, we can calculate how much each individual task would cost to run on some arbitrary cloud infrastructure.

As studied before Directed Acyclic Graphs (DAGs) in practice have tens or hundreds of stages with different requirements and execution time. based on dependency and requirements each task can be executed in parallel or it needs to wait for other tasks to be completed. This complexity can be challenging in terms of scheduling, to solve this issue MARS needs to execute tasks in parallel as much as possible without wasting CPU or Memory utilization [Graph_Example].

Each workflow can be defined in CWL [CWL], WDL [WDL], SWF [Amazon_AWS] and etc. format, we can generate DAGs based on requirements and dependency from any of these standards. Considering workflow’s requirements can be vary based on HPC system, but dependency and sub-tasks order would not change we can compare generated DAGs with each other based on sub-workflow order and dependency.

As mentioned before graph comparison is algorithmically hard, similar to C. Delimitrou and et al. [298205] approach we use scale-up and scale-out method to achieve the categorization. Assuming that workflows data DAG can be categorized and compared to each other based on size and resources, MARS tries to combine the independent tasks together as a single parallel task.

Iii-B4 Decision Making

MARS decision making is based on the compared DAG and heuristic data, using heuristic data model, DAG classification, or based on size of workflow MARS chooses the best suitable algorithm between basic back-filling to RL-A3C to execute an arbitrary workflow 3. In complement to combining CPU, Memory, I/O and creating a general neural network, we generate an individual network based on graph comparison and user input for RL-A3C candidate workflows. Complementary to previous method, users’ variable is used to determine the intensity of requirements and in order to achieve a better result the logs from target HPC system will be used in the evaluation.

In case of RL-A3C workflows, the first initiation and task execution would have to be on more general deep neural network and more simpler reward function due to the lack of training data, however after couple of execution more detail network can be replaced. After that process, MARS would get the output from the HPC system and calculate the universal reward mean, as we know returning a positive value from reward function can identify the desired settings then, it would cause MARS

to continue optimizing on the same network for similar workflows. In the other hand, cumulative negative reward value would cause a feature selection change in the network and updating the loss function.

Algorithm 3 shows the basic decision making of MARS scheduler, our design uses workflow size and configuration to decide on the algorithm policy. In our experiment we observe that workflows less than 512 is not sufficient enough to run directly on RL-A3C, in order to improve this issue we either combine the next workflow with previous one or run the heuristic algorithm. In the first part of algorithm we try to combine the next workflow with current workflow, next if the compatibility of dimension fails or the existence of next workflow is absent, then MARS chooses the heuristic algorithm. Next, for the large workflows in order to avoid over-fitting the network we split those into sub-workflows and execute RL-A3C algorithm. In each step we save RL-A3C model for future use.

Result: Best Suitable Action
Input: Workflow & Workflow size Initializing workflow task size, Queue, Task, Model: , Q, , M if  <MEDIAN then
       if  & is compatible (RL-A3C vector dimensions) with  then
             >MEDIAN ; ; ;
             if  <MIN  then
                   ; SJF(Q);
                   ; UNICEF(Q);
             end if
       end if
       while  >MAX do
             ; ;
       end while
      ; ;
end if
Algorithm 3 MARS Decision Making Policy

Iii-B5 Cost Consideration

By introducing individual reward function for each layer of neural network we can optimize on each dimension, and update our network based on that particular part. Deep neural network training is to use projection points to determine the weights between each steps, in our model if we train based on predicted categories not only we can increase the speed of training but also we can get a better result.

As mentioned before, training model on simple workflows not only requires system administrators to change the entire back bone of scheduling system, on top of that training time is usually is longer than the workflow itself. To resolve this issues MARS either combines the input from small workflows together or run the heuristic algorithm. Knowing that RL-A3C uses Adam Optimizer and soft-max to calculate the probability, adding the cost factor to the dimension would introduce complication to optimized the workflow, however in order to solve this issue, we consider to apply the cost as a probability after training model values, in order to re-arrange the scheduled tasks, we take the mean of first epochs batch and multiply it by cost factor and compare it to next iteration. By adding the cost factor into the last step of neural network gives an extra advantage over optimization.

In another point of view, using cost probability in training model and modifying the loss function is another solution, however, the model generated from workflow would be more specific to user cost preference instead of presenting more general solution. In this case next workflow similar to previous one, would have be considered with same cost factors.

Iv Implementation


algorithms are implemented using Tensorflow

[45381] and Gym OpenAI [gym], for training process we used Proximal Policy Optimization (PPO) algorithm derived from OpenAI Spinning Up library [SpinningUp2018, 2017arXiv170706347S].

We used both randomly generated data set based on real workflows and actual real-world data from different sources for evaluating proposed solution. The real-world workflows are based on SWF archive [FEITELSON20142967] as shown in Table I.

Name  CPU Month(s) Date
SDSC IBM-SP2 128 24 1998
SDSC IBM-Blue 1152 32 2000
High Performance Computing Center 240 42 2002
Argonne National Laboratory Intrepid 163840 8 2009
Synthetic_G001 256 12 2019
Synthetic_G002 1024 6 2019
TABLE I: List of Workload Traces

In our experiment we aim to compare the previous works with MARS. We compare MARS with heuristic job scheduling algorithms, shown in Table II. The table II shows the heuristic scheduling policies infused with MARS, which can improve the performance of legacy and modern workflows. MARS are compared with two well known policies First Come First Served (FCFS)[fcfs1989], where tasks are scheduled by the arrival order, and Shortest Job First (SJF)[RemziBook], where tasks with smaller processing times are scheduled ahead of the other tasks. Another comparative policies are WFP3 and UNICEF [5289206],which are based on the processing time, requested number of cores and waiting time of the tasks. WFP3 favors shorter and older tasks over large ones without starvation; and UNI favors small tasks by using fast turnaround policy for performance enhancement. Policy F1, F2, F3 and F4 [10.1145/3126908.3126955], represent the nonlinear machine learning-based scheduling algorithms for minimizing the average bounded slowdown of tasks. Based on our observation switching to known heuristic algorithm along with RL-A3C would increases the performance and saves a noticeable amount of time to train for the basic legacy workflows.

Name Function
TABLE II: Heuristic Scheduling Policy Used

In HPC system, workflow tasks may arrive continuously. In order to train the model using RL-A3C, we save the training results after predefined window time, then leave the actor critic algorithm to improve the model. After building a basic model based on RL algorithm, the Actor Critic part starts evaluating the network. This strategy would create a training batch for workflow, if the batch size is too small, MARS‘s decision module gives two options if the remaining workflow size is sufficient enough MARS combines sub-workflows together. In the other hand, in absent of sufficient size, MARS would switch back to back-filling or FCFS algorithm.

In our experiment running basic workflows on RL-A3C takes significant amount of time to train and causes inefficiency in HPC system. in order to overcome this issue using combination of legacy and RL-A3C algorithm would be more appropriate. Another issue in RL-A3C is over-fitting the model due to the large batch size and exponential growth of number of possible tasks. In order to solve this issue, we introduce a median layer to create sub-workflows. Based on our observation the best training sets are between 512 to 20000 running on 2000 to 4000 epochs for RL-A3C. Knowing that the smaller or larger batch sizes could introduce an issue, MARS decision module would combine or split the sub-workflows.

As we described in Section III, in RL-A3C, state is the input of DNN agent, and the representation of state is a vector, containing available resources and pending tasks. In HPC number of pending and arriving tasks can vary, however in DNN the vector to create the network should be a fixed-sized vector, in order to overcome this issue we took the same approach as previous works and add extra 0s to the end of the vector [2018arXiv181001963M].

V Evaluation

Fig. 4: MARS Policy Algorithm Comparison

In this section we present our result obtained by running MARS scheduler on simulated environment using data traces generated from HPC data center. First we describe the environment setup and workflows traces used in our experiment, next we evaluate different algorithms and compare them to our approach. We discuss performance evaluation under different conditions and workloads of HPC environment. Our simulator was inspired by similar method D. Zhang and et al. [2019arXiv191008925Z]. However, in order to comply with our approach we extended the simulator with Gym and OpenAI to return the proper reward values from environment. Running the training set on actual HPC environment requires enormous number of iteration to learn, considering that most of the HPC environments are not capable of running RL-A3C algorithm due to the fact of missing GPU capability or available resources for non-HPC applications, the best approach is to either dedicate an arbitrary external server to train the model or run simulation on local environment.

V-a Simulation Environment

We simulate a homogeneous HPC environment executing tasks based on moving forward the timestamp instead of actual running those tasks. The entire workflows was based on actual traces collected from real systems, but we use CWL and SWF workflows format to guarantee the compatibility. When a workflow is generated, if the resources required to run an arbitrary task belonging to the generated workflow is not present, the simulator uses back-filling method to run smaller tasks first.

Fig. 5: MARS Learning Rate Ratio

V-B HPC Reward and Metrics

HPC scheduling metrics are mostly based on response time and it defines as the mean the total wall-clock time from the instant at which the task is submitted to the system, until it finishes its run. The most basic method to calculate the running time and wait time for tasks is slowdown, . more sophisticate method is to take the average slowdown to minimize the wait time[10.5555/646382.689681]. Table III

shows different evaluation metrics. The problem with the slowdown metric is that it overemphasizes the importance of very short jobs, to overcome this issue Feitelson et al.

[bounded-slowdown] have suggested Bounded-slowdown. The behavior of this metric depends on the choice of which is the threshold value. Zotkin and et al. [10.5555/822084.823251] have introduced new problem where tasks that do the same amount of work with the same response time may lead to different slowdown results due to their shape which is the ratio of processors to time. this let them to introduce another metric known as per-processor slowdown. the reason we used average bounded slowdown instead of per-processor is in our workflow examples the shape of our test systems are identical to each other.

Metric Formula
TABLE III: Scheduling Metrics

In our approach we set the goal as minimizing the average bounded slowdown . At beginning of the algorithm calculating the average is not possible, instead we consider to return 0 as a reward. after finishing the entire task sequence then the RL-A3C agent gets the average as .

V-C Results

In this section we shown that MARS by using combination of heuristic and RL-A3C algorithm can improve the performance, time and avoid over-fitting the network for scheduling tasks on HPC system. Most of the reinforcement learning algorithm need to be configured with proper parameters from HPC, figure 4 shows the different policies based on different configuration, the y-axis is the average bounded slowdown and the x-axis is the different scheduling policies.

Our scheduler ratio of training and testing was 70% to 30% similar to most of RL algorithms. We categorized three different configuration and sizes for our test bed, small data-set was contained between 512 tasks to 2000, medium size data-set was from 2000 to 9000 tasks and lastly large data-set was between 10000 to 25000 tasks. We randomly selected tasks from different data-sets and perform experiment with different configuration. The other factor we considered is the number of iteration per each task in DNN and the delay between task arrival. By experimenting on different configuration we showed that in terms of reinforcement learning and heuristic algorithm the proper configuration causes a significant difference in result. Lastly, we added the cost-aware probabilities after creating the RL-A3C model.

In figure 4 part (a) we choose a large data-set from IBM SDSC Blue with 20000 tasks to train and 6000 tasks to test, however since the data configuration was chosen randomly the reinforcement learning algorithm reacts worst than MARS. Similarly in part (b), we selected 15000 random tasks and observed the same result. However, if the workflow size is large enough and the data is consistent with configuration of DNN, the RL-A3C algorithm will improve. Figure 4 part (c) was HPC2N data-set with 4000 selected tasks, and Figure 4 part (d) is small selected tasks from ANL Intrepid data-set, all three experiment configuration was chosen randomly.

As discussed, MARS scheduler tries to solve this issue in two ways, either combine the tasks together to generate proper size for training and testing in RL-A3C or switch back to heuristic algorithm. In our experiment we showed that in case of proper and ideal configuration 4 (e) RL-A3C performs better compared to MARS however, since in HPC achieving the ideal configuration is rather difficult, in other cases such as Figure 4 part (f) using suggested method derives a better performance. Our experiment shows MARS on average can achieve between 5% to 60% better performance compared to other policies.

Another issue in reinforcement learning to consider is over-fitting the network, in figure 5 we showed that based on data-set configuration and learning interaction ration with HPC system, we can achieve a different performance. Figure 5 part (a) is a large data-set with 50000 iteration per each task which causes RL-A3C learning to interact frequent with HPC system.

Figure 5 part (b) is the optimal configuration with the proper size data-set, however in part (c) the configuration and HPC parameters changes randomly and that causes the RL-A3C agent to interact with HPC more often. Figure 5 part (d) and part (e) is the comparison of different experiments together and lastly the part (f) is showing insufficient data-set size to train. to resolve these issues MARS tries to update the the reward values from HPC after each iteration and by selecting heuristic algorithm for small data-set sizes we bypass the inefficient training model.

In our test experiment, cost of each task was randomly generated and after RL-A3C soft-max values we incorporate costs as another probability function as probability between 0 and 1. We used Gaussian distribution to add the cost factor to the final step of DNN soft-max calculation. As discussed before, adding the cost to training model would result in unique data model. As a consequence of keeping the generality of the model, the cost would be Incorporated after creating the DNN network. By calculating the cost with each action taken by agent, more specific reward value can be derived from HPC system. As shown in figure

4, with random configuration for RL-A3C, the performance decreases between 5% to 60%. However, by using MARS policy and combining heuristic and RL-A3C with cost-awareness the performance improves back to optimal solution.

Vi Related Works

HPC task scheduling has been a long-time research topic. Countless studies have been done, including heuristic algorithms such as First Come First Serve (FCFS), Shortest Job First (SJF) and more sophisticated policies like WEP3, UNICEF and even machine learning approaches. MARS is clearly different from the existing studies as it takes advantage of existing resource management on HPC system and it combine the best suitable algorithm to maximize the performance and reduce the training time[5289206, 10.1145/3126908.3126955, 932708, heristic, AKYOL200795, 265940, Singh1996MappingAS].

Mirhoseini et al. [46646, 2017arXiv170604972M] use DRL to optimize placement of computation graph, Xu et al. [2018arXiv180105757X] use the same method to select routing paths between network nodes for traffic, and Mao et al. [10.1145/3098822.3098843] used the same principle to dynamically select video stream rates.

Recently, several studies also started to leverage deep reinforcement learning in resource allocation and job scheduling in a distributed environment such as DeepRM[10.1145/3005745.3005750], and Decima[2018arXiv181001963M], however none are using existing HPC resource management and combine the heuristic algorithm with deep reinforcement learning.

Although they used similar DRL methods as MARS, these studies are not designed for scheduling HPC tasks, which are fixed, rigid, and non-preemptable.

These differences lead to different designs and optimizations in MARS, detailed in Section III-A. Most recent HPC tasks scheduling [10.1145/3126908.3126955] uses brute force simulations to generate a large number of data samples, each of which shows the best scheduling decision given a random job sequence. Then, applying machine learning methods on these data samples to build scheduling functions that can best-fit these samples. MARS uses the best suitable algorithm from heuristic to deep reinforcement learning in order to increase the optimization and performance.

Vii Conclusion

In this study, we proposed a new cost-aware reinforcement learning policy for task scheduling on HPC system using the existing resource manager which enables the system administrators and users to optimize the scheduling tasks based on any preferred algorithm and cost effectiveness. Also, we showed that using MARS by combining heuristic and deep reinforcement learning actor-critic algorithm, HPC system can be optimized for both legacy and complex workflows. By showing the different RL-A3C configuration and switching between heuristic and RL-A3C we achieved a better result and performance. MARS can improve the modularity and support for users legacy and complex workflows and it can optimizes tasks execution based on the most appropriate approach.