Hierarchical Deep Reinforcement Learning Approach for Multi-Objective Scheduling With Varying Queue Sizes

07/17/2020 ∙ by Yoni Birman, et al. ∙ 0

Multi-objective task scheduling (MOTS) is the task scheduling while optimizing multiple and possibly contradicting constraints. A challenging extension of this problem occurs when every individual task is a multi-objective optimization problem by itself. While deep reinforcement learning (DRL) has been successfully applied to complex sequential problems, its application to the MOTS domain has been stymied by two challenges. The first challenge is the inability of the DRL algorithm to ensure that every item is processed identically regardless of its position in the queue. The second challenge is the need to manage large queues, which results in large neural architectures and long training times. In this study we present MERLIN, a robust, modular and near-optimal DRL-based approach for multi-objective task scheduling. MERLIN applies a hierarchical approach to the MOTS problem by creating one neural network for the processing of individual tasks and another for the scheduling of the overall queue. In addition to being smaller and with shorted training times, the resulting architecture ensures that an item is processed in the same manner regardless of its position in the queue. Additionally, we present a novel approach for efficiently applying DRL-based solutions on very large queues, and demonstrate how we effectively scale MERLIN to process queue sizes that are larger by orders of magnitude than those on which it was trained. Extensive evaluation on multiple queue sizes show that MERLIN outperforms multiple well-known baselines by a large margin (>22



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Scheduling algorithms plan the allocation of resources to tasks over a given time in order to optimize one or more evaluation metrics (e.g., throughput, average waiting-time in queue). The resources addressed by task scheduling can take on multiple forms: memory and CPU in a computing environment, machines in a workshop, runways at an airport, etc. Scheduling is crucial in multiple domains, including the manufacturing and service industries 

(Hall et al., 1997), medicine (May et al., 2011), and even malware detection (Staniford and Aziz, 2015)

. While simple scheduling tasks can be easily solved using existing heuristic approaches 

(Zhao and Stankovic, 1989), multi-objective task scheduling (MOTS) problems are more challenging. The added difficulty stems not only from the need to balance multiple goals, but also from the need to sometimes reconcile contradictory metrics.

In recent years, deep reinforcement learning (DRL)-based solutions have emerged as a promising alternative to existing heuristic scheduling solutions, often achieving state-of-the-art results (Mao et al., 2016). DRL algorithms have some significant strengths compared to other types of scheduling methods, particularly in cases that involve uncertainty (Ganesan et al., 2016). First and foremost, they enable the formulation of sophisticated strategies, including those in which they make short-term sacrifices in order to reap larger rewards later on. Secondly, DRL algorithms are capable at efficiently exploring large state and action spaces, thus enabling them to develop novel and effective policies for complex scenarios. Thirdly, if the reward function (i.e., rewards and punishments for various actions and outcomes) is defined correctly, the DRL algorithm will very likely develop a strategy that achieves the desired goals (see (Birman et al., 2019) for a good example of how changing the reward function changes the algorithm’s behavior).

While highly effective, DRL-based algorithms also have two significant shortcomings when applied to MOTS problems. The first shortcoming of DRL-based approaches is the need to integrate multiple and often conflicting objectives into a single reward function. This is particularly the case when the processing of each task is fixed but instead constitutes a multi-objective optimization problem of its own. Such a function will need to address the multi-optimization problem both for the individual tasks and the entire queue, resulting in a complex and difficult-to-define expression. Moreover, each optimization goal is likely to have its own reward frequency and scale, thus contributing to the difficulty of defining the reward function. The use-case presented in this study (see Section 4) provides an excellent example of such a scenario: our goal is both to screen a set of files for malware, while minimizing the average processing time of each file. This challenge is further complicated by the fact that the analysis of each file is also a multi-objective optimization problem, with multiple detectors that can be used, each with its own capabilities and resource usage.

The second challenge associated with applying DRL-based solution to MOTS problems is that there no easy way of ensuring that samples are processed identically regardless of the state of the queue. This is the case because of the need to define a single reward function that models all constraints and priorities; integrating bounds and constraints into complex function is far from trivial.
Consider a DRL-based system tasked with detecting malware in a queue of incoming files: the two (conflicting) goals set for the system are high detection rates (accuracy) and low average processing times. If such a system becomes backlogged over time (i.e., average processing times rising quickly), the DRL-agent may begin conducting less extensive analysis of files (i.e., compromising detection rates) in order to reduce processing times. A scenario where an item is processed differently based on the current state of the queue is unacceptable in multiple domains, including medical testing and safety maintenance checks.

A more general shortcoming of applying DRL to scheduling problems – one that is shared by all artificial neural network (ANN) architectures as well as other types of machine learning (ML) algorithms – is that the size of the input used to train the model must be fixed. This requirement means that scheduling policies developed by such algorithms are unable to effectively operate on queues larger than the ones on which they were trained. This limitation can possibly lead to significantly sub-optimal solutions, as the algorithm can only process a part of the queue at each given time. To (partially) address this problem, DRL-based scheduling algorithms are often trained on large queue sizes. While this course of action improves the algorithm’s performance, it also requires considerably larger architectures, training data, and training time (see our results in Section 


In this study we present , a hierarchical DRL-based scheduling approach for multiple objective scheduling. Our proposed approach addresses the abovementioned shortcomings of other DRL-based approaches by applying a two-part solution. The first part of our proposed solution is creating a hierarchy of DRL agents. Instead of attempting to solve the MOTS problem in its entirety using a single ANN architecture, we divide the problem into two parts: a) devising a policy for the processing of individual items in the queue, and; b) devising a policy for managing the queue. By using this approach we are able to accurately define the policy for the processing of specific items while also ensuring that each item is processed in an identical manner regardless of its position in the queue. Additionally, the use of a modular solution results in smaller architectures that are easier to train than a single large architecture.

The second part of our proposed solution is a novel approach for enabling DRL-based scheduling solutions to manage queues larger than those on which they were trained. Our proposed approach is easy to implement and compatible with most DRL-based approaches. Moreover, our proposed solution enables the efficient processing of dynamic queue sizes, i.e., queues where new items are stochastically added over time. To the best of our knowledge, the managing of dynamic queues was never addressed previously.

We evaluate on a large-scale malware detection dataset (presented in (Birman et al., 2019)). Our goal is to enable cost-effective analysis of the dataset’s files: maintaining high detection rates while spending as little time as possible analyzing each file. The domain is challenging as it requires balancing both detection accuracy and the processing time of the entire queue. Our evaluation results shows that outperform multiple practical and realistic baselines used to evaluate our use-case by at-least 22% across multiple queue sizes.

Our contributions in this study are: (1) we propose a multi-objective scheduling framework that utilizes a multi-tier DRL solution. This approach simplifies the training process while enabling us to define various constraints; (2) to the best of our knowledge, we are the first to present a DRL-based solution for scheduling with no prior data both on the processed item and stochastic processing times, i.e., high uncertainty; (3) we propose a hierarchical modeling approach that enables the processing of varying queue sizes without the need of retraining the algorithm; and (4) we present an evaluation on a large real-world use-case (malware detection) that demonstrates the effectiveness and usefulness of our proposed scheduling framework.

2. Background: Reinforcement Learning

Reinforcement learning (RL) is a learning technique used for sequential decision making, often when only partial information is available or when the solution space is large. RL is highly adept at intelligently exploring various strategies in a highly efficient manner. Because of their ability to successfully operate in complex domains, especially when coupled with deep learning, RL has been applied in domains such as robotics, control problems 

(Schulman et al., 2015)

, genetic algorithms 

(Such et al., 2017), complex games (Silver et al., 2017) and scheduling (Mao et al., 2016).

The RL problem setting consists of an environment and an agent. The agent takes actions that affect the environment and change its state. Each action (or sequence of actions) incurs a reward that provides feedback to the agent on the quality of its decisions. Agents can optimize their behavior by interacting with the environment and devising a policy that will yield maximal rewards overall. At every time-step , the agent selects an action from the action space that modifies the state of the environment and incurs a reward (positive or negative). The goal of the agent is to maximizes future accumulated reward where is the index of the final time-step.

A common approach for selecting the action to be taken at each state is the action-value function  (Sutton and Barto, 2018), also known as the -function. The function approximates the expected returns should we take action at state . While the methods are varied, RL algorithms which use -functions aim to discover (or closely approximate) the optimal action-value function which is defined as where is the policy mapping states to actions (Sutton and Barto, 2018)

. Since estimating

for every possible state-action combination is highly impractical (Mnih et al., 2015), it is common to use an approximator where represents the parameters of the approximator. Deep reinforcement learning (DRL) algorithm perform this approximation using neural nets, with being the parameters of the network.

3. The Proposed Method

3.1. Motivation

While DRL has proven very effective in optimizing a single objective (Mao et al., 2019), to date no study has successfully applied this approach to multi-objective scheduling (although multi-resource problems have been addressed (Mao et al., 2016)). This is likely due to the difficulty of balancing multiple, and often contradicting, objectives in a single reward function. When multiple goals affect the reward function, it is more difficult to isolate the effect each single action has on each of the objectives. Moreover, there is an inherent difficulty in integrating objectives that may have different value scales and distributions and are provided at different intervals.

Our proposed approach partitions the original problem into separate “sub-problems”, each solved with its own DRL-agent. Such partitioning simplifies each individual problem, enables modularity, and reduces the complexity of the overall optimization process. It is important to note that each sub-problem does not have to address all the goals of the original problems, but instead can solve a subset of the said goals. This can be done by defining additional (intermediary) goals in order to facilitate the desired outcome for each sub-problem. Such a use of an intermediary goal is presented in our own use-case in Section 4.

We define our approach as modular since it enables us to easily replace each of the individual DRL-agents used to solve the sub-problems. This trait is very useful from a practical standpoint since it enables us to re-calibrate our model (e.g., change some of our priorities—greater accuracy and lower throughput) without re-training all of its components. Our approach was inspired by the modular NN approach (Kimoto et al., 1990), which applies this idea in the field of robotics.

3.2. Problem Formulation

In order to simplify our representation, we present a problem formulation for a two-tier hierarchical model. The proposed representation can easily be expanded to include additional tiers.

Let be a queue of items. Let be the internal reward function, which defines the reward (positive or negative) obtained when processing item . Let be the outer reward function

, which defines the reward for processing the entire queue. We begin by optimizing the loss function of the internal agent, thus setting the policy of the internal agent defined by


where is the nth item in and is the loss function of the internal agent.

Once we set the policy of the internal agent we can define the policy of the outer agent. Our goal is to minimize the loss function and optimize the policy of the outer agent .

where is the loss function of the outer agent.

Motivation. Our proposed approach for addressing this challenge consists of a modular and hierarchical DRL architecture. We begin by setting all of the problem domains’ constraints and priorities, both for each individual item in the queue (e.g., desired detection rate for defects) and for the queue overall (e.g., average processing time). We define these priorities by and , respectively. We then train one DRL agent – the “internal agent” – to create the internal policy that optimally addresses . Finally, we “freeze” and use a second DRL agent – the “outer agent” – to train the outer policy , whose goal is to optimize by scheduling the order by which is applied on the items in . The inputs for are both the current state of the queue and the outputs of the internal-agent.

It is important to point out that we first set the internal policy and only train attemp to optimize the outer policy. This appraoch ensures that each item in the queue is processed in the same manner regardless of its position in the queue. This trait, which cannot be guaranteed in DRL architectures that use a single reward function for the entire problem, ensures consistency in performance and equal treatment of all items in the queue regatdless of their position in the queue. Guarantees such as the ones our approach provide are critical in many fields, including medical testing and airplane maintenance.

3.3. System Architecture

Modular architecture.

Our proposed solution architecture consists of an outer agent whose goal is to schedule the processing of the various items in the queue, and an internal agent whose goal is to determine the manner by which each individual item is processed (see Figure 1). As mentioned before, each agent is trained separately: the internal agent is trained first until it converges. Then, the outer agent is trained by interacting with the internal agent, i.e., exploring various scheduling strategies that involve the fully-trained internal model. The internal agent is “frozen” while the outer agent is trained, ensuring both modularity and that the desired performance of the internal agent—which was defined during its training—is maintained.

Figure 1. Illustrating our two-tier, modular DRL-based scheduling algorithm. The internal agent uses an optimal policy defining how to process a single item in the queue. The outer agent’s goal is to serve as the scheduling mechanism for the queue.

Multiple objectives.

The roles of the two agents are very different: The goal of the internal agent is to create an optimal policy for a single item in the queue. As a result, the state of the internal agent represents the current state of a single item . The goal of the outer agent is to serve as the scheduling mechanism for the queue. Therefore, the state representation of outer agent is a concatenation of all current item representation , with the addition (to each item state ) of a single value indicating whether the processing of the item has ended. An example of the state representation of the outer agent is presented in Figure 2, where each row represents an item, and each column represents an action that can be taken by the internal agent on that item. Non-negative cell values are the outputs of executed actions.

Figure 2. An example of a state matrix with items. The inner item representation consists out of values. indicates whether the processing of the item has ended. Note that is the th value in the inner item representation.

Actions selection.

At every time step, the outer agent can select the item that will be sent to the internal agent for processing. As a result, the size of the outer agent’s action space is equal to the size of the queue. The size of the internal agent’s is determined by the number of processes it can apply on an item (columns in Figure 2), with an additional action for final classification.

It is important to note that outer agent’s scheduling is preemptive. This means that once an item has been submitted to , the internal agent only performs a single action rather than analyze the item until completion. can then choose to send a different item to , leaving the first item to be processed at a later time. The rationale of using this approach is simple: since each action taken by reveals additional information the item, has a chance to weigh the benefit of continuing to process the current item against processing another.

The internal agent is also preemptive in the sense that it can issue a final decision about the analyzed item, without having to run all possible test/processes on it. This setting, initially proposed in (Birman et al., 2019), enables the internal agent to strike the desired balance between performance (e.g., classification accuracy) and the resources allocated to achieving it.

Operating under high levels of uncertainty.

We argue that due to its being a DRL-based method, has an important advantage over existing solutions when dealing with high degrees of uncertainty. The uncertainty presents itself in two ways in our use-case: first, the internal agent has no way of knowing in advance the output provided by each detector. Additionally, the runtime of each detector varies from file to file, thus adding another level of complexity to the malware detection process. Secondly, the outer agent has to contend with uncertainty regarding the actions and running time of the internal agent as it selects the next files to be processed by the latter.

Unlike other commonly-used approaches (See Section 5.2), requires no preliminary information on the processed items—file size, file type, bounds on running time, etc.—and adapts its policy by interacting with the items over time. The ability to operate under high uncertainty is shared both by and . interacts with individual items and devises its own policy for processing them. interacts both with items of the queue and with , without any prior knowledge on either.

3.4. Adapting to Changes in Queue Length

One significant shortcoming of DRL-based solutions to queue management is the network’s inability to adapt to changes in the state or action space (Schulman et al., 2015). More specifically, we refer to the fact that the input of the network, and consequently its number of actions, must be of a fixed size. This inflexibility leads to two types of problems. First, this could easily lead to sub-optimal solutions, with easy-to-process items having to wait until the first items are done. An example of this scenario is presented in (Mao et al., 2016): since the value of was 100, the item will not be considered until the first items are completed. Secondly, this inability to infer the learnt logic to larger queue sizes forces practitioners to train their DRL-agents on relatively large state representations, a fact that leads to longer running times and difficulties for the deep network to reach convergence.

To address the challenges described above, we propose a novel hierarchical approach for dynamic queue size scheduling. Given a queue , we partition it into fixed size subsets of size , where is also the number of queue items the outer agent is configured to receive as input. This partitioning results in sub-queues. To ensure that all sub-queues are exactly of size

, we use padding when needed. Our padding consists of items that are flagged as “completed” (i.e., their processing is already complete), which effectively ensures that our fully-trained DRL-agent will ignore them.

Once the partitioning into sub-queues is complete, we apply the outer agent on each sub-queue. This results in the creation of a selected set of queue-items whose size we denote as . Next we check to see whether . If that is the case, is provided as input to and the scheduling process continues as described in Section 3.3. Otherwise, if , we once again partition the current set into , and continue to do so iteratively until we reach an item set of size . An illustration of the proposed process is presented in Figure 3.

Figure 3. An example of two-stage hierarchical state-action reduction process, with trained DRL algorithm on queue with length of four.

The proposed hierarchical approach has two significant advantages. First, it enables the use of a DRL algorithms with fixed-size input representation for processing queues of practically any size, thus removing one of the main limitations to applying DRL to queue management. Moreover, the item processing is done in a way that looks at every item on the queue—no item is ignored. Secondly, our hierarchical approach makes it possible to train networks with smaller input sizes to process large queues. As a result, we can train smaller networks with less parameters, thus leading to faster convergence and the need for fewer computing resources. The results of our evaluation in Section 5.3 support our claim.

4. Use-Case: Malware Detection

Our use-case is based on the study presented in (Birman et al., 2019), where a DRL-based framework was used to perform cost-aware analysis of malware files. The underlying insight of the said study is that while organizations often deploy an ensemble of detectors to ensure high detection rates, in many cases a subset of the available detectors can produce the correct classification using far fewer computing resources and shorter execution times. The authors of (Birman et al., 2019)

created a reward function that factors in both correctness of the classification and the time needed to reach the decision, and show that their approach can significantly reduce the time needed to classify a file (by

80%) while only marginally harming detection accuracy.

It is important to note that the solution presented in (Birman et al., 2019) is designed for the cost-effective classification of individual files and not to the management of queues. For this reason, we use the DRL agent developed in (Birman et al., 2019) as the internal agent in this use case, and then train the outer agent ourselves. For a comprehensive overview of the architecture, we refer the reader to the original paper. The remainder of this section provides a short overview on the internal and outer agents’ architecture.

4.1. The Internal Agent

The goal of the internal DRL-agent is to create a cost-effective policy for the analysis of files for possible malware. To achieve this goal, the agent performs the following steps: 1) send a file to one detector; 2) receive the classification output of the detector (a value in the range [0,1]); 3) based on the available information, determine whether to provide a final classification to the file, thus terminating the process, or query an additional detector(s). If the latter option is chosen, all steps are repeated.

The reward functions evaluated in (Birman et al., 2019) are presented in Table 1. All functions define the cost of making a mistake (i.e., false-positive of false-negative) as a function of the time spent classifying the file. The logic behind this approach is simple yet novel: discourage the DRL-agent from querying detectors that are unlikely to provide useful information (i.e., increase chanced of being correct), as they will only lead to more “pain” in case of a mistake. The reward for correct classifications is either fixed, or a function of the time spent. The former option encourages the algorithm to be more cost-oriented, resulting in shorter processing times per file. The latter approach yields superior performance but saves only a little amount of computing resources. In our study we chose to use the reward function of experiment #3, which offers (in our view) the best cost/effective solution (savings of about 80% in running time while reducing performance by only 0.5%). This is the policy used in all our experiments throughout this study.

The state space of the internal agent is represented by a vector containing a single entry for each of the available malware detectors. The value of each cell is either -1 (minus one)—meaning that the detector was not yet queried—or containing the output of the detector, a value in the range


The action space of the internal agent is similarly simple: it contains one action for each detector, and choosing this action will query the corresponding detector to classify the file. Additionally, there are two more actions: 1) classify file as benign, and; 2) classify file as malware. Choosing one of the two latter actions terminates the analysis of the file.

Exp. Reward Setup Accuracy Mean
# TP TN FP FN (%) Time (sec)
1 C’(t) C’(t) -C’(t) -C’(t) 96.867 48.61
2 C’(t) C’(t) -10C’(t) -10C’(t) 96.801 48.37
3 1 1 -C’(t) -C’(t) 96.212 10.53
4 10 10 -C’(t) -C’(t) 95.424 3.68
5 100 100 -C’(t) -C’(t) 91.220 0.73
Table 1. The five reward setups evaluated in (Birman et al., 2019). In the experiments presented in this study, we use the configuration of experimetn #3.

4.2. The Outer Agent

The goal of the outer agent is to schedule the processing of the analyzed files by the internal agent. The goal of the outer agent is to minimize the average time a file spends in the queue while waiting to be classified. A detailed description of the evaluation metric is provided in Section 5.1.

The state space of the outer agent is modeled using a matrix like the one presented in Figure 2. Each row in the matrix represents a single file in the queue, and it consists of the file’s state, as it is represented by the internal agent. Simply put, the state space of the outer agents is a concatenation of the internal agent’s state representation for all files. The action space of the outer agent is equal to the initial number of files in the queue. Choosing action indicates that the file is sent to the internal agent for processing.

It is important to note that the outer agent interacts with the internal agent as a black-box model. The outer agent has no information regarding the internal agent’s inner workings or decision-making process. The outer agent develops its own policy simply by interacting with internal agent and inferring on its own the optimal policy. This setting is both simple and more robust, as it enables modular training and the replacement of either the internal or outer agents.

5. Evaluation

5.1. Experimental Setting

Hardware setting. For our experiments, we used the VMware ESXi operating system for our servers, each with two processing units (CPU). The server had total of 32 cores, 512GB of RAM and 100TB of SSD disk space. Figure 4 provides a comprehensive overview of the infrastructure we used. The outer agent process ran on virtual machine (VM) with the Ubuntu 18.04 LTS operation system. The virtual machine had 16 CPU cores, 16GB of RAM, and 10TB of SSD storage. The agent uses a management service that allows both the training and execution of the DRL algorithm, using different tuning parameters.

The internal agent process ran, according to the specified specification provided in (Birman et al., 2019), on three VMs with Ubuntu 18.04 LTS operation system. Each machine had 4 CPU, 16GB RAM configuration with additional 100GB of SSD storage. Upon the arrival of files for analysis, the agent stores them in a logical queue at a dedicated storage space, which is also accessible to the internal agent. Both agents use an external storage to all logging information in an indexing engine for search and analysis capabilities.

Figure 4. The experimental infrastructure’s architecture. hosts the outer agent and the PE files. hosts the internal agent.

The dataset. We evaluated on the dataset presented in (Birman et al., 2019), which consists of 25,000 executable files, half malicious and half benign. We obtained both the dataset and the reported running time of each detector for every file, which enabled us to train a DRL agent that accurately replicate the the results reported in (Birman et al., 2019). As explained in Section 4, we use this architecture as our internal agent.

Training and the evaluation measure. We trained our outer agent with the goal of optimizing the scheduling process of all files. The metric used both to train the outer agent and evaluate the overall performance of our approach was average job completion time (Hall et al., 1997). Let be the number of files in the queue, the total processing time of in the internal agent, and the waiting time of in the queue. The completion time for item in queue is . The average completion time is calculated as shown in Equation 1. In essence, this metric is used to measure the average processing time for each file in the queue.


It is important to note that the experimental setting contains of a high degree of uncertainty. Our approach uses no prior information about the analyzed files (not even their size, which is used by some of our baselines in Section 5.2). All available information on a given file is obtained solely through its processing (i.e., sending it to the internal agent). Additionally, we trained on a queue-size of 10 items. By doing so, we demonstrate our model’s ability to easily scale for larger queues, sometimes larger by orders of magnitude, despite of the high uncertainty of the dataset.

Hyperparameter settings. We used the following settings throughout our evaluation. The train/test split was 90%/10%, with the same setting used for all experiments. The DRL architecture used was actor-critic with experience replay (Lin, 1992)

. The outer agent was trained for 34 epochs, which required 22 hours. The framework was implemented on OpenAI Gym, using python version 3.6. Both DRL agents –


– used an actor-critic architecture with a single hidden layer of size 20. The hidden and output layers used ReLU and Softmax functions respectively. We used a replay buffer of size 10, which was activated after 1000 episodes. We used a learning rate of

, an exponential decay rate of and a fuzz factor (epsilon) of

. We also used RMSprop 

(Tieleman and Hinton, 2012) optimization. All experiments were trained until convergence. We used penalties to discourage the agent from taking illegal actions (i.e., selecting files that were already classified).

5.2. Scheduling Algorithms Baselines

We compare both to “naive” solutions and to well-known scheduling algorithms. All of our chosen baselines are known to function well in high levels of uncertainty. Additionally, all baselines are able to seamlessly operate both on different queue-sizes and on dynamic queues where additional items arrive stochastically.

It is important to note that all baselines are “competing” against the outer agent , i.e., they all select the order of the files to be sent to . This was done for two reasons: First, these baselines are scheduling algorithms, and therefore cannot perform the classification process of individual files. Secondly, by using the same internal agent for all algorithms we ensure that the performance in terms of classification accuracy is uniform. We can therefore evaluate the various algorithms based on their running times. We wish to stress this point again, as it is crucial to understanding our evaluation: using different internal agents or allowing the internal agent to change its policy throughout the evaluation will lead to different detection rates and thus make the comparison between the algorithms impossible.

Because of the high uncertainty of our problem definition (i.e., no prior information exists on the analyzed files), several commonly-used scheduling approaches could not be used in our experiments. In order to overcome this limitation, we define two groups of baselines: a “realistic” group in which the baseline algorithms have access to the same information as our approach, and an “unrealistic” group in which the baselines have access to additional data that is not available to . Our evaluation shows that outperforms both groups (except for the baseline which serves as an optimal lower bound).

5.2.1. “Realistic” Baselines

This group consists of four baselines, all with access to the same information as . It should be noted that two baselines in this group (SFF and LFF) also use the sizes of the analyzed files. does not use this information, but since this information can be easily achieved we include these baselines in the current group.

First Come First Serve (FCFS). A naive scheduling algorithm, that schedules tasks according to their initial position in the queue. In our use-case (see Section 4), once a file reaches the top of the queue, it is processed by the internal agent until a classification decision is reached (i.e., “malware” or “benign”).

Smallest File First (SFF). A variant of the shortest job first (SJF) approach. Assuming that a smaller file is likely to require less time to classify, the algorithm sorts the files in the queue based on their size, in an ascending order. The files are then sequentially processes until completion. We have also tested an inverted version of this scheduler: the longest file first (LFF) algorithm.

Multi Level Feedback Queue (MLFQ). A priority queue-based algorithm that allocates items to multiple sub-queues based on their required resources. In our experiments we used three sub-queues that partitioned the items based on the running time of next detector assigned to them by the internal agent (i.e., the time of the next action to be performed on the file). Once the detector was applied on the file, determines (but doesn’t execute) what is the next detector that needs to be used. Based on the running time of that detector, the file was then assigned to the appropriate sub-queue. In case where the next action was final classification, the item was removed from the queue.

5.2.2. “Unrealistic” Baselines

Each baseline in this group has access to information that is either unavailable to (e.g., knowledge on general processing times distributions) or “oracular” (knowledge of specific running times in advance). For each baseline, we specify the specific information available to it.

Shortest Expected Processing Time (SEPT). This baseline implements a stochastic scheduling approach. Since approaches of this type require knowledge about the distribution of the overall processing time of the population (Smith, 1956), we extract this information from the training set prior to running the scheduling algorithm.

Correlation Based Processing Time (CBPT). This baseline assumes that we have, in advance, the classification results (i.e., confidence score) of one of the malware detector for all files. Based on these scores, we sort the files in the queue according to their likelihood of being benign. Since benign files usually require less analysis than malicious ones, since the internal agent usually processes them more quickly, this is a high-performing baseline. For this task we chose the detector with the highest Pearson correlation between its confidence scores and the true item labels.

It is important to note that we treat the confidence scores used for the item-ranking as prior knowledge, meaning that the internal agent may call the detector that produced the classifications as part of its analysis. Despite its relatively high performance (see Figure 5), we consider this baseline as unrealistic since applying even a single detector in advance to all the items in large queue sizes will lead to very poor performance due to the long time it would take to produce a classification.

Shortest Processing Time (SPT). The files are ordered in an ascending order, based on their total classification time. In other words, we have perfect information on the time needed to classify each file. This baseline in guaranteed to achieve the top performance.

Longest Processing Time (LPT). The files are ordered in an descending order, based on their total classification time. This baseline is guaranteed to achieve the worst performance.

5.3. Experimental Results

We conducted three types of experiments, each with an increasing degree of complexity. We began by evaluating on the queue size on which it was trained (10 items). Then, we evaluated our approach’s ability to perform on larger queue sizes. Finally, we evaluated a dynamic queue where new items are being stochastically added.

5.3.1. Experiment 1: Fixed Queue Size

This evaluation was conducted on , which is also the input size of our DRL-agent. To ensure the validity of our results, we randomly sampled 10 files from our test set and provided it to all evaluated algorithms. This process was repeated 2,500 times, with the presented results being the average performance across all runs.

Figure 5. Average completion time comparison in a 10-items queue (diagonal stripe: ”realistic” baselines; vertical stripe: ”unrealistic” baselines).

The results of this experiment, presented in Figure 5, clearly show that our approach outperforms all the evaluated baselines (except for the SPT, which is the optimal scenario). When compared to “realistic” baselines (diagonal stripe columns in Figure 5), the average job completion time is shorter by 27%-71%. When compared to “unrealistic” baselines (vertical stripe columns), the average job completion time is shorter by 14%-57% .

Our analysis indicates that the reason for ’s superior performance is its ability to better handle uncertainty. Since both the outer and internal agents need to address uncertainty (albeit at different aspects of the challenge), the outer agent’s ability to infer the internal agent’s policy and behavioral patterns enables it to create its own complementary policy. The interaction between the two DRL-agents is particularly evident in their dealings with difficult-to-classify files, which are files that the internal agent would not classify with applying multiple detectors. The outer agent identifies these time-consuming files early on (usually after the confidence score of the first detector is produced) and immediately pushed such files to the back of the queue in order to finish with the “easier” files first.

5.3.2. Experiment 2: Large Queue Size

Next we evaluate ’s ability to perform well on varying queue sizes. We ran the same experimental setup as Experiment 1, but with queue sizes ranging from 10 to 100. For each queue size, we generated 1,250 random queues which were used by all algorithms for evaluation. It is important to note that the architecture used in all experiments was trained in a queue size of 10.

10 16.91 19.72 23.32 25.49 40.84 56.41
20 28.74 32.53 37.14 45.26 78.28 111.14
30 41.01 45.31 50.82 64.32 115.38 166.53
40 50.44 58.08 64.96 83.40 152.92 221.96
50 62.55 70.89 79.15 103.35 189.05 276.65
60 74.22 83.89 92.58 122.04 227.22 332.22
70 85.05 96.55 106.75 141.61 264.95 387.17
80 95.44 109.29 120.64 160.80 302.00 442.48
90 105.48 122.26 134.37 178.20 339.48 499.50
100 115.50 134.97 148.60 198.70 374.90 553.80
Table 2. The average completion time for the different algorithms for a single file over queue sizes ranging from 10 to 100.

The results of our evaluation are presented in Table 2 and Figure 6. once again outperforms all the baselines across all queue sizes. The percentage of improvement in performance over the realistic baselines is 22%-76%, while the improvement over the unrealistic baselines is 13%-64%. The results clearly show that our hierarchical approach to the modeling of large queues is very effective in enabling DRL-based solutions to scale to larges queue sizes.

Additionally, we were interested in determining whether ’s superior performance is likely to subsists for larger queue sizes. We therefore analyzed the increase in average file processing time as a function of the queue size for each of the evaluated algorithms. The results, presented in Figure 7, show that for the increase in average processing time plateaus relatively quickly. The meaning is that methods that achieved better performance for smaller queue sizes are likely to maintain their relative lead in lager queue sizes (it is important to note that each algorithm in Figure 7 is measured with respect to itself).

Figure 6. The relative performance of the baselines compared to . Higher percentage means longer running times compared to our method.
Figure 7. The percentage of the increase in average file processing time as a function of the queue size (with respect to the algorithm’s performance on ).

5.3.3. Experiment 3: Dynamic Queues with Stochastic Arrivals

Figure 8. Queue size behavior in different entry rates. The graphs show three use-cases of incoming rates respectively from left to right: incoming rate average processing time, incoming rate = average processing time, and incoming rate average processing time. has showed stability and consistency in its results with respect to the other methods.

In most real-world scenarios, queues are dynamic, with new items being added at various time intervals. This is the case for call centers, manufacturing floors, and (as in our use-case) organizational firewall that filter incoming files. This scenario adds another level of complexity compared with previous experiments, because the scheduling algorithms need to predict the number and characteristics of the incoming files.

We evaluated three use-cases (i.e., different scenarios). In the first use-case, the incoming files rate is higher than the average processing time. This means that the backlog will grow for all approaches, and that their test will be slowing this growth. In the second use-case, the incoming files rate is equal to the average processing time. In this case we expect the size of the backlog to be stable, and the scheduling algorithms will be evaluated based on the size of the backlog they keep. In the third use-case the incoming files rate is lower than the average processing time, and the scheduling algorithms will be evaluated based on their ability to keep the backlog as close as possible to zero.

The time interval for adding new files to the queue was identical for all three use-cases. Through the analysis of our training set, we’ve learned that the average file processing time as

seconds with a standard deviation of

seconds. Therefore, every 7.8 seconds, we randomly sampled a fixed number of files for each use case: for the first experiment, the number of files was , for the second use case, the number was and for the third use case, the number was

. The variance in performance between the added file batches stems from the fact that the files of each batch are samples randomly and therefore their characteristics vary.

In all use-cases, we sampled 1,000 files overall and recorded the backlog of each analyzed algorithm until the backlog was cleared. The results of our experiments are presented in Figure 8, and they clearly show that significantly outperformed all baselines (except for the optimal baseline, which we use as a bound). In all three scenarios, the backlog kept by our approach was the smallest, often by a significant margin. These results once again illustrate the effectiveness of our proposed approach in general and that of our hierarchical representation in particular.

5.3.4. Analysis: Training Larger DRL architectures

We now demonstrate the significant efficiently that can be obtained by using our proposed hierarchical approach for analyzing larger queues. To this end, we trained two additional architectures, where the sizes of the input and output layers of the outer agent were enlarged so that the architecture could analyze queue sizes of 20 and 30 respectively.

We compare the running times of the two new architectures to the original architecture. The results, presented in Table 3 show that in order to achieve comparable performance to that of the original (), the larger architectures need to run for significantly longer periods of time. For example, in order to reach the same final average processing time as the original , the version needs to run almost seven times as long. For the we were not even to obtain full convergence on our hardware and had to terminate the experiment.

# Training Time Avg. Completion
Queue of (Hours) Time (Seconds)
Size Epochs
20 10 5 19 34.46 39.62
20 12 42 31.48 36.64
34 22 73 28.74 30.24
40 - 102 - 30.04
58 - 141 - 29.86
30 10 5 24 49.17 65.19
20 12 51 44.91 61.47
34 22 95 41.01 53.88
Table 3. A Comparison of the training time (hr) and the Avg. completion time for a file in the queue (sec) between the original architecture () and two larger architectures trained on respectively.

6. Related Work

The field of MOTS-related solutions is diverse both in its techniques and the domains in which it is applies. For example, in the field of operations management, the authors of  (Köksalan and Keha, 2003) developed a heuristic approach based on a genetic algorithm for the bi-objective scheduling on a single machine problem of minimizing flow time (i.e., total processing time) and number of tardy (i.e., measure of a delay in execution) jobs. In the semiconductor manufacturing domain, the work presented in (Gupta and Sivakumar, 2005) addressed the problem of scheduling independent jobs on a single testing machine with due dates and sequence-dependent setup times.

RL has been proven as an efficient method for scheduling. In the computing memory control domain, RL was used for resource allocation: in their paper (Ipek et al., 2008), the authors introduced a memory control-based scheduling algorithm using RL to fully utilize dynamic random-access memory (DRAM) bandwidth. This goal was achieved by observing the system state and estimating the long-term performance impact of each predefined possible actions of the processor. In the cloud computing domain (Mao et al., 2016), the authors presented a RL-based scheduling method in a multi resource cluster environment uses the preliminary data of job duration and resource requirements per job. In the application domain (Zhang and Dietterich, 1995), the authors presented an automated repair system that uses RL algorithm for scheduling and allocating repairs based on previously defined constraints. Our review of the related works indicates that there are no implementations of RL where scheduling was conducted under uncertainty (i.e., no knowledge on the expected processing time of each job).

RL for multi-objective scheduling was also introduced in several domains. In manufacturing systems domain, (Aissani et al., 2009) presented a RL-based approach for efficient machine maintaining tasks. Their objective was to assign maintenance tasks between different resources while minimizing resource down-time due to maintenance. Another implementation, conducted in the grid computing resource allocation domain (Perez et al., 2010), proposed the use of RL for minimizing the waiting time tasks to computational resources when the resources are under the use of other tasks. To the best of our knowledge, there were no cases of using RL for multi objective scheduling that included both resource allocation optimization and task scheduling of a queue.

When referring the problem of changes in the state-action space, there are hardly any solutions since RL relies on a fixed state-action space. In their paper, (Heffetz et al., 2019) showed an approach for dynamic actions modeling that enables a RL agent to model a varying number of actions using a fixed-size representation. They devise a hierarchical representation of the actions space, where each level of the hierarchy is split into equal sized clusters of the actions. The agent iterates over the clusters of each level, selecting one action per cluster. The chosen actions are passed to the next level of the hierarchy, which is then also clustered. The process is repeated until it reaches a hierarchy level in which there are actions at most, among these actions one is chosen. We adopt this method and fit it to our specific needs.

Hierarchical reinforcement learning. The field of hierarchical reinforcement learning (HRL) is a computational approach intended to address issues of RL such as large action/state space and generalization of complex environments by learning a policy made up of multiple layers, each of which is responsible for control at a different level of temporal abstraction.

HRL has been proposed in several forms. The feudal learning (Dayan and Hinton, 1993) approach takes advantage of two notions: the managerial hierarchy observes the environment at different resolutions and the communication is made between managers and ”workers” through goals - a reward is given for reaching them. In (Bacon et al., 2017), the authors showed another two levels-based approach, where the bottom level is responsible of output actions given a sub-policy, and a top level which outputs sub-policy given a policy-over-options. Another approach presented in (Dietterich, 2000), obtains the task of hierarchy by decomposing the Q value of state-action pair into the sum of two components - the the total expected reward received when executing the action in the current state and the total reward expected from the performance of the parent-task.

It is very important to note that all the approaches mentioned above, and to the best of our knowledge all HRL methods, do not address the challenges associated with MOTS problems, and particularly those which have conflicting goals.

7. Conclusions

We presents , a two-stage DRL-based scheduling approach specifically designed to tackle the challenges of multi-objective optimization and high levels of uncertainty. Through extensive evaluation, we show that outperforms several commonly-used baselines by a wide margin. Additionally, we present a novel hierarchical approach for applying DRL on dynamic queues. The ability to train the DRL-agent over smaller queue sizes enables the use of smaller architectures and leads to shorter convergence times. For future work, we plan to further explore the application of in additional domains.


  • N. Aissani, B. Beldjilali, and D. Trentesaux (2009) Dynamic scheduling of maintenance tasks in the petroleum industry: a reinforcement approach. Engineering Applications of Artificial Intelligence 22 (7), pp. 1089–1103. Cited by: §6.
  • P. Bacon, J. Harb, and D. Precup (2017) The option-critic architecture. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §6.
  • Y. Birman, S. Hindi, G. Katz, and A. Shabtai (2019) ASPIRE: automated security policy implementation using reinforcement learning. CoRR abs/1905.10517. External Links: Link, 1905.10517 Cited by: §1, §1, §3.3, §4.1, Table 1, §4, §4, §5.1, §5.1.
  • P. Dayan and G. E. Hinton (1993) Feudal reinforcement learning. In Advances in neural information processing systems, pp. 271–278. Cited by: §6.
  • T. G. Dietterich (2000) Hierarchical reinforcement learning with the maxq value function decomposition. Journal of artificial intelligence research 13, pp. 227–303. Cited by: §6.
  • R. Ganesan, S. Jajodia, A. Shah, and H. Cam (2016) Dynamic scheduling of cybersecurity analysts for minimizing risk using reinforcement learning. ACM Trans. Intell. Syst. Technol. 8 (1). External Links: ISSN 2157-6904, Link, Document Cited by: §1.
  • A. K. Gupta and A. I. Sivakumar (2005) Single machine scheduling with multiple objectives in semiconductor manufacturing. The International Journal of Advanced Manufacturing Technology 26 (9-10), pp. 950–958. Cited by: §6.
  • L. A. Hall, A. S. Schulz, D. B. Shmoys, and J. Wein (1997) Scheduling to minimize average completion time: off-line and on-line approximation algorithms. Mathematics of operations research 22 (3), pp. 513–544. Cited by: §1, §5.1.
  • Y. Heffetz, R. Vainshtein, G. Katz, and L. Rokach (2019) DeepLine: automl tool for pipelines generation using deep reinforcement learning and hierarchical actions filtering. CoRR abs/1911.00061. External Links: Link, 1911.00061 Cited by: §6.
  • E. Ipek, O. Mutlu, J. F. Martínez, and R. Caruana (2008) Self-optimizing memory controllers: a reinforcement learning approach. In ACM SIGARCH Computer Architecture News, Vol. 36, pp. 39–50. Cited by: §6.
  • T. Kimoto, K. Asakawa, M. Yoda, and M. Takeoka (1990) Stock market prediction system with modular neural networks. In 1990 IJCNN international joint conference on neural networks, pp. 1–6. Cited by: §3.1.
  • M. Köksalan and A. B. Keha (2003) Using genetic algorithms for single-machine bicriteria scheduling problems. European Journal of Operational Research 145 (3), pp. 543–556. Cited by: §6.
  • L. Lin (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8 (3-4), pp. 293–321. Cited by: §5.1.
  • H. Mao, M. Alizadeh, I. Menache, and S. Kandula (2016) Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pp. 50–56. Cited by: §1, §2, §3.1, §3.4, §6.
  • H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, and M. Alizadeh (2019) Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM Special Interest Group on Data Communication, pp. 270–288. Cited by: §3.1.
  • J. H. May, W. E. Spangler, D. P. Strum, and L. G. Vargas (2011) The surgical scheduling problem: current research and future opportunities. Production and Operations Management 20 (3), pp. 392–405. Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §2.
  • J. Perez, C. Germain-Renaud, B. Kégl, and C. Loomis (2010) Multi-objective reinforcement learning for responsive grids. Journal of Grid Computing 8 (3), pp. 473–492. Cited by: §6.
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §2, §3.4.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §2.
  • W. E. Smith (1956) Various optimizers for single-stage production. Naval Research Logistics Quarterly 3 (1-2), pp. 59–66. Cited by: §5.2.2.
  • S. G. Staniford and A. Aziz (2015) Systems and methods for scheduling analysis of network content for malware. Google Patents. Note: US Patent 8,990,939 Cited by: §1.
  • F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and J. Clune (2017) Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567. Cited by: §2.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. 2nd edition, MIT press Cambridge. Cited by: §2.
  • T. Tieleman and G. Hinton (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §5.1.
  • W. Zhang and T. G. Dietterich (1995) A reinforcement learning approach to job-shop scheduling. In IJCAI, Vol. 95, pp. 1114–1120. Cited by: §6.
  • W. Zhao and J. A. Stankovic (1989) Performance analysis of fcfs and improved fcfs scheduling algorithms for dynamic real-time computer systems. In [1989] Proceedings. Real-Time Systems Symposium, pp. 156–165. Cited by: §1.