Deep Reinforcement Learning in System Optimization

08/04/2019 ∙ by Ameer Haj-Ali, et al. ∙ berkeley college 6

The recent advancements in deep reinforcement learning have opened new horizons and opportunities to tackle various problems in system optimization. Such problems are generally tailored to delayed, aggregated, and sequential rewards, which is an inherent behavior in the reinforcement learning setting, where an agent collects rewards while exploring and exploiting the environment to maximize the long term reward. However, in some cases, it is not clear why deep reinforcement learning is a good fit for the problem. Sometimes, it does not perform better than the state-of-the-art solutions. And in other cases, random search or greedy algorithms could outperform deep reinforcement learning. In this paper, we review, discuss, and evaluate the recent trends of using deep reinforcement learning in system optimization. We propose a set of essential metrics to guide future works in evaluating the efficacy of using deep reinforcement learning in system optimization. Our evaluation includes challenges, the types of problems, their formulation in the deep reinforcement learning setting, embedding, the model used, efficiency, and robustness. We conclude with a discussion on open challenges and potential directions for pushing further the integration of reinforcement learning in system optimization.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) is a class of stochastic optimization techniques for Markov Decision Processes (MDP) 

Bellman (1957), when the MDP is not known. In RL, an agent continually interacts with the environment Kaelbling et al. (1996); Sutton and Barto (2018)

. In particular, the agent observes the state of the environment, and based on this observation takes an action. The goal of the RL agent is then to compute a policy–a mapping between the environment states and actions–that maximizes a long term reward. There are multiple ways to extrapolate the policy. Non-approximation methods usually fail to predict good actions in states that were not visited in the past, and require storing all the action-reward pairs for every visited state, a task that incurs a huge memory overhead and complex computation. Instead, approximation methods have been proposed. Among the most successful ones is using a neural network in conjunction with RL, also known as deep RL. Deep models allow RL algorithms to solve complex problems in an end-to-end fashion, handle unstructured environments, learn complex functions, or predict actions in states that have not been visited in the past. Deep RL is gaining wide interest recently due to its success in robotics, Atari games, and superhuman capabilities 

Mnih et al. (2013); Doya (2000); Kober et al. (2013); Peters et al. (2003)

. Deep RL was the key technique behind defeating the human European champion in the game of Go, which has long been viewed as the most challenging of classic games for artificial intelligence 

Silver et al. (2016).

Many system optimization problems have a nature of delayed, sparse, aggregated or sequential rewards, where improving the long term sum of rewards is more important than a single immediate reward. For example, an RL environment can be a computer cluster. The state could be defined as a combination of the current resource utilization, available resources, time of the day, duration of jobs waiting to run, etc. The action could be to determine on which resources to schedule each job. The reward could be the total revenue, jobs served in a time window, wait time, energy efficiency, etc., depending on the objective. In this example, if the objective is to minimize the waiting time of all jobs, then a good solution must interact with the computer cluster and monitor the overall wait time of the jobs to determine good schedules. This behavior is inherent in RL. RL also has the advantage of self-learning, adaptation, and exploration, which do not require prior knowledge. RL can also learn sophisticated system characteristics that a straightforward solution like first come first served allocation scheme cannot. For instance, it could be better to put earlier long-running arrivals on hold if a shorter job requiring fewer resources is expected shortly.

In this paper, we review different attempts to overcome system optimization challenges with the use of deep RL. Unlike previous reviews Hameed et al. (2016); Mahdavinejad et al. (2018); Krauter et al. (2002); Wang and O’Boyle (2018); Ashouri et al. (2018); Luong et al. (2019) that focus on machine learning methods without discussing deep RL models or applying them beyond a specific system problem, we focus on deep RL in system optimization in general. From reviewing prior work, it is evident that standardized metrics for assessing deep RL solutions in system optimization problems are lacking. We thus propose quintessential metrics to guide future work in evaluating the use of deep RL in system optimization. We also discuss and address multiple challenges that faced when integrating deep RL into systems.

We observe that the system problems that have been tackled thus far with deep RL include packet classification Liang et al. (2019), congestion control, Jay et al. (2019); Ruffy et al. (2018), resource allocation and scheduling in the cloud Mao et al. (2016); He et al. (2017a, b); Tesauro et al. (2006); Xu et al. (2017); Liu et al. (2017); Arabnejad et al. (2017); Xu et al. (2012); Rao et al. (2009), query optimization Krishnan et al. (2018); Marcus and Papaemmanouil (2018); Zhong et al. (2017); Ortiz et al. (2018); Guu et al. (2017); Liang et al. (2016); Marcus et al. (2019), compiler optimization Huang et al. (2019); Kulkarni and Cavazos (2012), and scheduling Addanki et al. (2019); Paliwal et al. (2019); Coons et al. (2008), with most of the works focusing on resource allocation and scheduling in the cloud.

We anticipate that the use of deep RL in system optimization will grow. However, we believe there is much room for improvement in the RL algorithms, system simulators, and problems tackled.

2 Background

Figure 1: Reinforcement learning. By observing the state of the environment, the agent takes actions for which he receives rewards. The agent’s goal is to take actions that maximize cumulative reward.

One of the promising machine learning approaches is reinforcement learning (RL), in which an agent learns by continually interacting with an environment Kaelbling et al. (1996). In RL, the agent observes the state of the environment, and based on this state/observation takes an action as illustrated in figure 1. The ultimate goal is to compute a policy–a mapping between the environment states and actions–that maximizes expected future reward. RL can be viewed as a stochastic optimization solution for solving Markov Decision Processes (MDPs) Bellman (1957), when the MDP is not known. An MDP is defined by a tuple with four elements: where is the set of states of the environment, describes the set of actions or transitions between states,

describes the probability distribution of next states given the current state and action and

is the reward of taking action in state . Given an MDP, the goal of the agent is to gain the largest possible cumulative reward. The objective of an RL algorithm associated with an MDP is to find a decision policy that achieves this goal for that MDP:


where is a sequence of states and actions that define a single episode, and is the length of that episode. Deep RL leverages a neural network to learn the policy (and sometimes the reward function). Over the past couple of years, a plethora of new deep RL techniques have been proposed Mnih et al. (2016); Ross et al. (2011); Sutton et al. (2000); Schulman et al. (2017); Lillicrap et al. (2015).

Policy Gradient (PG) Sutton et al. (2000), for example, uses a neural network to represent the policy. This policy is updated directly by differentiating the term in Equation 1 as follows:


and updating the network parameters (weights) in the direction of the gradient:


Proximal Policy Optimization (PPO) Schulman et al. (2017) improves on top of PG for more deterministic, stable, and robust behavior by limiting the updates and ensuring the deviation from the previous policy is not large.

In contrast, Q-Learning Watkins and Dayan (1992), state-action-reward-state-action (SARSA) Rummery and Niranjan (1994) and deep deterministic policy gradient (DDPG) Lillicrap et al. (2015) are temporal difference methods, i.e.

, they update the policy on every timestep (action) rather than on every episode. Furthermore, these algorithms bootstrap and, instead of using a neural network for the policy itself, they learn a Q-function, which estimates the long term reward from taking an action. The policy is then defined using this Q-function. In Q-learning the Q-function is updated as follows:


in other words the Q-function updates are performed based on the action that maximizes the value of that Q-function. On the other hand, in SARSA, the Q-function is updated as follows:


i.e., the Q-function updates are performed based on the action that the policy would select given state . DDPG fits multiple neural networks to the policy, including the Q-function and target time-delayed copies that slowly track the learned networks and greatly improve stability in learning.

Algorithms such as upper-confidence-bound and greedy could later be used to determine the way the policy is defined based on the Q-function Auer (2002); Sutton and Barto (2018). The surveyed works in this paper focus on the epsilon greedy method where the policy is defined as follow:


A method is considered to be on-policy if the new policy is computed directly from the decisions made by the current policy. PG, PPO, and SARSA are thus on policy while DDPG and Q-Learning are off policy. All the mentioned methods are model-free: they do not require a model of the environment to learn, but instead learn directly from the environment by trial and error. In some cases, a model of the environment could be available. It might also be possible to learn a model of the environment. This model could be used for planning and enable more robust training as less interaction with the environment may be required.

Field Problem
Variations of
Networking Congestion Control Ruffy et al. (2018)
Jay et al. (2019)
Ruffy et al. (2018)
Ruffy et al. (2018)
Classification Liang et al. (2019)
Cloud Performance
Xu et al. (2012)
Tesauro et al. (2006)
Arabnejad et al. (2017)
Rao et al. (2009)
Mao et al. (2016)
He et al. (2017a)
He et al. (2017b)
Energy Efficiency
Xu et al. (2017)
Liu et al. (2017)
Sequence to
Zhong et al. (2017)
Guu et al. (2017)
Liang et al. (2016)
Query Optimization
Krishnan et al. (2018)
Marcus et al. (2019)
Ortiz et al. (2018)
Marcus and Papaemmanouil (2018)
Compiler Phase Ordering Huang et al. (2019) Huang et al. (2019) Kulkarni and Cavazos (2012)
Scheduling Device Placement
Paliwal et al. (2019)
Addanki et al. (2019)
Paliwal et al. (2019)
Coons et al. (2008)
Table 1: The RL agents used for each system problem.

Most RL methods considered in this work are structured around value function estimation (e.g.

, Q-values) and using gradients to update the policy. However, it not always the case. For example, genetic algorithms, simulated annealing, genetic programming, and other gradient-free optimization methods - often called

evolutionary methods Sutton and Barto (2018) - can also solve RL problems. Also, they are analogous to the way biological evolution produces organisms with skilled behavior. Evolutionary methods can be effective if the space of policies is sufficiently small, the policies are common and easy to find, and the state of the environment is not fully observable. This works considers only the deep versions of these methods, i.e., using a neural network in conjunction with evolutionary methods that are usually used to evolve and update the neural network parameters or vice versa.

Multi-armed bandits Berry and Fristedt (1985); Auer et al. (2002) simplify RL by removing the learning dependency on state and thus providing evaluative feedback that depends entirely on the action taken (1-step RL problems). The actions usually are decided in a greedy manner, by updating the benefit estimates of performing each action, independently from other actions. To consider the state in a bandit solution contextual bandits can be used Chu et al. (2011). In many cases, a bandit solution might perform as well as the complex full RL solution or even better/faster.

2.1 Prior RL Works With Alternative Approximation Methods

Multiple works proposed in the past to use different than deep neural network approximation methods for RL in system optimization. These works include reliability and monitoring Das et al. (2014); Zhu and Yuan (2007); Zeppenfeld et al. (2008), and memory management Ipek et al. (2008); Andreasson et al. (2002); Peled et al. (2015); Diegues and Romano (2014) in multicore systems. Congestion control Li et al. (2016); Silva et al. (2016), packet routing Choi and Yeung (1996); Littman and Boyan (2013); Boyan and Littman (1994), algorithm selection Lagoudakis and Littman (2000), cloud caching Sadeghi et al. (2017), energy efficiency Farahnakian et al. (2014) and performance Peng et al. (2015); Jamshidi et al. (2015); Barrett et al. (2013); Arabnejad et al. (2017); Mostafavi et al. (2018). Instead of using a neural network to approximate the policy, these works used tables, linear approximations, and other approximation methods to train and represent the policy. Tables were generally used to store the Q-values, i.e., one value for each action, state pair, which are used in training and this table becomes the ultimate policy. Neural networks in general can outperform both approaches Lin (1993). Unlike neural networks, tables require very large memory to store every visited state and action, and cannot properly approximate actions in new states that might be similar to other states visited in the past. Neural networks can also learn linear and non-linear functions.

3 RL in System Optimization

In this section we discuss the different system challenges tackled using RL and divide them into two categories: Episodic Tasks in which the agent-environment interaction naturally breaks down into a sequence of separate terminating episodes and Continuing Tasks in which it does not. For example, when optimizing resources in the cloud, the jobs arrive continuously and there is not a clear termination state. But when optimizing the order of SQL joins, the query has a finite number of joins, and thus after enough steps the agent arrives at a terminating state.

3.1 Continuing Tasks

An important feature of RL is learning without prior knowledge of the target scenario, ability to learn and update environmental knowledge by actual observations and adapt in unpredictable environment. Jobs in the cloud arrive in an unpredictable and continuous manner. This might explain why most system optimization challenges tackled with RL were in the cloud Mao et al. (2016); He et al. (2017a, b); Tesauro et al. (2006); Xu et al. (2017); Liu et al. (2017); Xu et al. (2012); Rao et al. (2009). A good job scheduler in the cloud should make decisions that are good in the long term. Such scheduler should sometimes forgo short term gains in an effort to realise greater long term benefits. For example, it might be better to delay a long running job if a short running job is expected to arrive soon. It should also be adaptive to variations in the underlying resource performance and scale in the presence of new or unseen workloads combined with large numbers of resources.

These schedulers have a variety of objectives, including minimizing average performance of jobs and optimizing the resource allocation of the virtual machines Mao et al. (2016); Tesauro et al. (2006); Xu et al. (2012); Rao et al. (2009), optimizing data caching on edge devices and base stations versus the cloud He et al. (2017a, b), and energy efficiency Xu et al. (2017); Liu et al. (2017). Table 1 lists the algorithms used for addressing each problem.

Interestingly, for cloud challenges most works were driven by Q-learning (or the very similar SARSA). In the absence of a complete environmental model, model free Q-learning can be used to generate optimal policies. It is able to make predictions incrementally by bootstrapping the current estimate onto previous estimates and provide good sample efficiency Jin et al. (2018). It is also characterized by inherent continuous temporal difference behavior where the policy could be updated immediately after each step (not the end of trajectory); something that might be very useful for online adaptation.

3.2 Episodic Tasks

Due to the sequential nature of decision making in RL, the order of the actions taken has a major impact on the rewards the RL agent collects. The agent can thus learn these patterns and pick more rewarding actions. Previous works took advantage of this behavior in RL to optimize congestion control Jay et al. (2019); Ruffy et al. (2018)

, decision trees for packet classification 

Liang et al. (2019), sequence to SQL/program translation Zhong et al. (2017); Guu et al. (2017); Liang et al. (2016), order of SQL joins Krishnan et al. (2018); Ortiz et al. (2018); Marcus and Papaemmanouil (2018); Marcus et al. (2019), and compiler phase ordering Huang et al. (2019); Kulkarni and Cavazos (2012) and device placement Addanki et al. (2019); Paliwal et al. (2019).

In these problems, after enough steps, the agent will always arrive at a clear terminating step. For example, in query join order optimization, the number of joins is finite and known from the query. Another example, in congestion control – where the routers need to adapt the sending rates to provide high throughput without comprising fairness – the updates are performed on a fixed number of senders/receivers known in advance. These updates combined define one episode. Due to this episodic behavior it might be possible to explain why there is a bigger trend towards using PG methods for these types of problems as they don’t require a continuous temporal difference behavior and can operate in batches of multiple queries for example. Nevertheless, in some cases Q-learning was still used, mainly for sample efficiency as the environment step might take a relatively long time.

To improve the performance of PG methods it is possible to take advantage of the way the gradient computation is performed. If the environment is not needed to generate the observation, it is possible to save many environment steps. This is achieved by rolling out the whole episode from interacting only with the policy and performing one environment step at the very end. The sum of rewards will be the same as the reward received from this environment step. For example, in query optimization, since the observations are encoded directly from the actions, and the environment is mainly used to generate the rewards, it will be possible to repeatedly perform an action, form the observation directly from this action, feed it to the policy network. After the end of the episode, the environment could be triggered to get the final reward, which would be the sum of the intermediate rewards. This could significantly reduce the training time.

3.3 Discussion: Continuous VS Episodic

Continuous policies can handle both continuous and episodic tasks while episodic policies cannot. So, for example, Q-Learning can handle all the tasks mentioned in this work, while PG based methods cannot directly handle it with out special handling. For example, in Mao et al. (2016), the authors limited the the scheduler window of jobs to , allowing the agent in every time step to schedule up to jobs out of all arrived jobs. The authors also discussed this issue of ”bounded time horizon” and hoped to overcome it by using a value network to replace the time-dependent baseline. It is interesting to notice that previous works on continuous system optimization tasks using non deep RL approaches Choi and Yeung (1996); Littman and Boyan (2013); Boyan and Littman (1994); Peng et al. (2015); Jamshidi et al. (2015); Barrett et al. (2013); Arabnejad et al. (2017); Sadeghi et al. (2017); Farahnakian et al. (2014) used Q-Learning.

One solution for handling continuing problems with PG based methods without episode boundaries is to define performance in terms of the average rate of reward per time step Sutton and Barto (2018) (Chapter 13.6). Such approach could help better fit the continuous problems to these episodic RL algorithms.

Description Reference State/Observation Action Reward Objective Model
congestion control
Jay et al. (2019)
Ruffy et al. (2018)
histories of sending
rates and resulting
statistics (e.g., loss rate)
changes to sending rate
and negative of
latency or
loss rate
maximize throughput
while maintaining
packet classification Liang et al. (2019)
encoding of the
node, e.g., the split
cutting a classification
tree node or partitioning
a set of rules
classification time
build optimal decision
tree for packet
SQL join
order optimization
Krishnan et al. (2018)
Ortiz et al. (2018)
encoding of the query next join negative cost
minimize execution
query optimization Marcus et al. (2019) query/plan encodings next join negative cost performance
tree conv.,
SQL join
order optimization
Marcus and Papaemmanouil (2018)
matrix encoding
of the join
tree structure
next join 1/cost
minimize execution
sequence to
Zhong et al. (2017)
SQL vocabulary,
question, column
query corresponding
to the token
-2 invalid query,
-1 valid but wrong result,
+1 valid and right result
tokens in the
WHERE clause
language to
program translation
Guu et al. (2017)
natural language
a sequence of
program tokens
1 if correct result
0 otherwise
generate equivalent
semantic parsing Liang et al. (2016)
embedding of
the words
a sequence of
program tokens
positive if correct
0 otherwise
generate equivalent
resource allocation
in the cloud
Mao et al. (2016)
current allocation of
cluster resources &
resource profiles of
waiting jobs
next job
to schedule
all jobs in the
system ( is the
duration of job )
minimize average
job slowdown
resource allocation He et al. (2017a, b)
status of edge
devices, base stations,
content caches
which base station,
to offload/cache
or not
total revenue
maximize total
resource allocation
in the cloud
Tesauro et al. (2006)
current allocation
& demand
next resource
to allocate
payments maximize revenue FCNN
resource allocation
in cloud radio
access networks
Xu et al. (2017)
active remote radio
heads & user demands
which remote
radio heads
to activate
negative power
cloud resource
allocation &
power management
Liu et al. (2017)
current allocation
& demand
next resource
to allocate
linear combination of
total power consumption,
VM latency,
& reliability metrics
power efficiency
weight sharing
automate virtual
machine (VM)
configuration process
Rao et al. (2009)
Xu et al. (2012)
current resource
-response time
maximize performance
compiler phase
Kulkarni and Cavazos (2012)
Huang et al. (2019)
program features
next optimization
minimize execution
device placement
Paliwal et al. (2019)
Addanki et al. (2019)
computation graph
of graph node
maximize performance
& minimize peak
distributed instr-
uction placement
Coons et al. (2008) instruction features
instruction placement
speedup maximize performance FCNN
Table 2: Problem formulation in the deep RL setting.
Work Problem
Step Time
Number of Steps
Per Iteration
Number of training
Total Number
Of Steps
Improves State
of the Art
Compares Against
Liang et al. (2019) 20-600ms up to 60,000 up to 167 1,002,000 ✓(18%)
Jay et al. (2019) 50-500ms 8192 1200 9,830,400 ✓(similar)
Ruffy et al. (2018) 0.5s N/A N/A 50,000-100,000
Mao et al. (2016) 10-40ms 20,000 1000 20,000,000 ✓(10-63%)
He et al. (2017a)
He et al. (2017b)
N/A N/A 20,000 N/A no comparison
Tesauro et al. (2006) N/A N/A 10,000-20,000 N/A no comparison
Xu et al. (2017) N/A N/A N/A N/A no comparison
Liu et al. (2017) 1-120 minutes 100,000 20 2,000,000 no comparison
Rao et al. (2009)
Xu et al. (2012)
N/A N/A N/A N/A no comparison
Krishnan et al. (2018)  10ms  640  100  64,000 ✓(70%)
Ortiz et al. (2018) N/A N/A N/A N/A no comparison
Marcus et al. (2019) 250ms 100-8,000 100 10,000-80,000 ✓(10-66%)
Marcus and Papaemmanouil (2018) 1.08s N/A N/A 10,000 ✓(20%)
sequence to
Zhong et al. (2017) N/A 80,654 300 24,196,200 ✓(similar)
language to
program trans.
Guu et al. (2017) N/A N/A N/A 13,000 ✓(56%)
Liang et al. (2016) N/A 3,098 200 619,600 ✓(3.4%)
Huang et al. (2019)  1s N/A N/A 1,000-10,000 ✓(similar)
Kulkarni and Cavazos (2012)
13.2 days
for all steps
Addanki et al. (2019) N/A (seconds) N/A N/A 1,600-94,000 ✓(3%)
Paliwal et al. (2019) N/A (seconds) N/A N/A 400,000 ✓(5%)
Coons et al. (2008) N/A (minutes) N/A 200 N/A (days)
Table 3: Evaluation results.

4 Formulating the RL environment

Table 2 lists all the works we reviewed and their problem formulation in the context of RL, i.e., the model, observations, actions and rewards definitions. Among the major challenges when formulating the problem in the RL environment is properly defining the system to RL environment translation layer, i.e., state, action spaces and rewards. The rewards are generally sparse and behave similarly for different actions, making the RL training ineffective due to bad gradients. The states are generally defined using hand engineered features that are believed to encode the state of the system. This results in a large state space with some features that are less helpful than others and rarely capturing the actual system state. Using model based RL can alleviate this bottleneck and provide more sample efficiency. Liu et al. (2017) used auto-encoders to help reduce the state dimensionality. The action space is also large but generally represents actions that are directly related to the objective. Another challenge is the environment step. Some tasks require long time for the environment to perform one step, significantly slowing the learning process of the RL agent.

Interestingly, most works focus on using simple out of the box fully connected neural networks (FCNNs), while some works that targeted parsing and translation (Liang et al. (2016); Guu et al. (2017); Zhong et al. (2017)

) used recurrent neural networks (RNNs) 

Graves et al. (2013) due to their ability to parse strings and natural language. While FCNNs are simple and easy to train to learn a linear and non-linear function policy mappings, sometimes having a more complicated network structure suited for the problem could further improve the results.

4.1 Evaluation Results

Table 3 lists training, and evaluation results of the reviewed works. We consider the time it takes to perform a step in the environment, the number of steps needed in each iteration of training, number of training iterations, total number of steps needed, and whether the prior work improves the state of the art and compares against random search/bandit solution.

The total numbers of steps and the the cost of each environment step is important to understand the sample efficiency and practicality of the solution, especially when considering RL’s inherent sample inefficiency Schaal (1997); Hester et al. (2018). For different workloads the number of samples needed varied from thousands to millions. The environment step time also varied from milliseconds to minutes. In multiple cases the interaction with the environment is very slow. Note that in most cases when the environment step time was a few milliseconds, it was because it was a simulated environment, not a real one. We observe that for faster environment steps more training samples were gathered to leverage that and further improve the performance. This excludes Liu et al. (2017) where a cluster was used and thus more samples could be processed in parallel.

As listed in Table 3, many works did not provide sufficient data to reproduce the results. Reproducing the results is necessary to further improve the solution and enable future evaluation of the solution and comparison against it.

4.2 Frameworks and Toolkits

A few RL benchmark toolkits for developing and comparing reinforcement learning algorithms, and providing a faster simulated system environment have been recently proposed. OpenAI Gym Brockman et al. (2016) supports an environment for teaching agents everything from walking to playing games like Pong or Pinball. Iroko Ruffy et al. (2018) provides a data center emulator to understand the requirements and limitations of applying RL in data center networks. It interfaces with the OpenAI Gym and offers a way to evaluate centralized and decentralized RL algorithms against conventional traffic control solutions.

Park Mao et al. (2019) proposes an open platform for easier formulation of the RL environment for twelve real world system optimization problems with one common easy to use API. The platform provides a translation layer between the system and the RL environment making it easier for RL researchers to work on systems problems. With that being said, the framework lacks the ability to change the action, state and reward definitions, making it harder to improve the performance by easily modifying these definitions.

5 Metrics for Evaluating Deep RL in System Optimization

In this section, we present a set of essential metrics that make the case for applying deep RL to a system optimization problem. These metrics can help researchers evaluate the deep RL solution in system optimization problems.

Is the MDP Clearly Defined?

By formal definition from dynamical systems theory, the problem of RL is the optimal control of incompletely known MDP. MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards but future states and rewards. This involves delayed rewards and trade off between delayed and immediate reward. In MDPs, the new state and new reward are dependent only on the preceding state and action. Given a perfect model of the environment an MDP can compute the optimal policy.

MDPs should be a straightforward formulation of the system problem as an agent learning from continually interacting with the system to achieve a particular goal, and the system responds to these interactions with a new state and reward. The agent’s goal is to maximize over time rewards.

Is It a Reinforcement Learning Problem?

What distinguishes RL from other machine learning approaches is the presence of self exploration and exploitation, and the tradeoff between them. For example, RL is different from supervised learning. The later is learning from a training set with labels provided by external supervisor that is knowledgeable. For each example the label is the correct action the system should take. The objective of this kind of learning is to act correctly in new situations not present in the training set. However, supervised learning is not suitable for learning from interaction as often it is impractical to obtain examples representative of all the cases in which the agent has to act.

Are the Rewards Delayed?

RL algorithms do not maximize the immediate reward of taking actions but the over time reward. For example, an RL agent can take actions that give low intermediate rewards but overall higher than taking a greedy action in every step. If the objective is to maximize the immediate reward or the actions are not dependent, then other simpler approaches such as bandit and greedy algorithms will perform better than deep RL as their objective is to maximize the immediate reward.

What is Being Learned?

It is important to provide insights on what is being learned by the agent. For example, which actions are taken in which states and why? Can the knowledge learned be applied to new states/tasks? Is there a structure of the problem being learned? If a brute force solution is possible for simpler tasks, it will also be helpful to show how far the performance of the RL agent is from the brute force solution. In some cases, not all the hand engineered features are useful. This results in high variance and prolonged training. Feature analysis can help overcome this challenge. For example, in 

Coons et al. (2008)

significant performance gaps were shown for different feature selection.

Does it Outperform Random Search and a Bandit Solution?

In some cases the RL solution is just another form of a smart random search. In many cases, the good RL results were achieved merely by luck. Random search might perform as good as RL, or even better as it is less complicated. For example, in Huang et al. (2019), the authors showed 10% improvement over the baseline by using random search.

In some cases the actions are independent and a greedy or a bandit solution can achieve the optimal or near optimal solution. Using a bandit method is equivalent to a 1-step RL solution and thus the objective is to maximize the immediate reward. Maximizing the immediate reward could also deliver the overall maximum reward and thus a comparison against a bandit solution can help unfold this and perhaps show that the problem is not an MDP.

Are the Expert Actions Observable?

In some cases it might be possible to have access to expert actions, i.e.

, optimal actions. For example, if a brute force search is plausible and practical then it is possible to outperform deep RL by using it or using imitation learning 

Schaal (1999), which is a supervised learning approach that learns by imitating the expert actions.

Is It Possible to Reproduce/Generalize the Good Results?

The learning process in deep RL is stochastic and thus the good results are sometimes achieved due to local maximum, simple task, and luck. In Haarnoja et al. (2018) different results were generated by just changing the random seeds. In many cases the good results cannot be reproduced by retraining, training on new tasks or generalizing to new tasks.

Does It Outperform State of the Art?

The most important metric in the context of system optimization in general, is outperforming the state of the art. Improving state of the art includes different objectives such as, efficiency, performance, throughput, bandwidth, fault tolerance, security, utilization, reliability, robustness, complexity, and energy. If the proposed approach does not perform better than state of the art in some metric then it is hard to justify using it. Frequently, the state of the art solution is also more stable, practical, and reliable than deep RL. In many prior works listed in Table 3 a comparison against state of the art is not available or deep RL performs worse. In some cases deep RL can perform as good as state of the art or slightly worse, but still be a useful solution as it achieves an improvement in other metrics.

6 Policies and Network Models

Multiple policies and network models can be used. RL frameworks like RLlib Liang et al. (2017), Intel’s Coach Caspi et al. (2017), TensorForce Kuhnle et al. (2017), Facebook Horizon Gauci et al. (2018), and Google’s Dopamine Castro et al. (2018) can help the users pick the right RL model as they provide implementations of many policies and models for which a convenient interface is available.

As a rule of thump, we rank RL algorithms based on sample efficiency as follows: model-based approaches (most efficient), temporal difference methods, PG methods, and evolutionary algorithms (least efficient). In general, many RL environments run in a simulator. For example 

Paliwal et al. (2019); Mao et al. (2019, 2016), run in a simulator as the real environment’s step would take minutes or hours, which significantly slows down the training. If this simulator is fast enough or training time is not constrained then PG methods can perform well. If the simulator is not fast enough or training time is constrained then temporal difference methods can do better than PG methods as they are more sample efficient.

If the environment is genuine then temporal difference can do well if interaction with the environment is not slow while model based RL performs better if the environment is slow. Model based methods require a model of the environment (that can be learned) and rely on planning rather than learning. Since planing is not done in the actual environment but as much faster steps in the model of that environment, it requires the less samples from the real environment to learn. With that being said, model free methods are often used as they are simpler to deploy and have the potential to generalize better from exploration in the real environment.

If good policies are easy to find and the space of policies is small enough or if time is not a bottleneck for the search, then evolutionary methods can be effective. Evolutionary methods also have advantages when the learning agent cannot observe the complete state of the environment. Bandit solutions are good if the problem can be viewed as a one step RL problem.

PG methods in general are more stable than methods that do not directly use and derive a neural network to represent the agent’s policy itself such as Q-learning. The greedy nature of directly deriving the policy and moving the gradient in the direction of the objective make PG methods more stable, easier to reason about, and reliable.

7 Challenges

In this section, we discuss the primary challenges that face the application of deep RL in system optimization.

Interactions with Real Systems Can Be Slow. Generalizing from Faster Simulated Environments Can Be Restricted.

Unlike the case with simulated environments that can run fast, when running on a real system, performing an action can trigger reward received after a long delay. For example, when scheduling jobs on a cluster of nodes, some jobs might require hours to run, and thus improving their performance by monitoring job execution time will be very slow. To speed up this process, some works used simulators as cost models instead of the actual system. These simulators often do not fully capture the actual behavior of the real system and thus the RL agent might not work well when used in practice after being trained in a simulated environment. More comprehensive environment models can alleviate the generalization from simulated environments. RL methods that can be more sample efficient, will speed up the training in the real system environment.

Instability and High Variance. This is a common problem which leads to bad policies when tackling system problems with deep RL. Such policies can generate a large performance gap when trained multiple times and behave in an unpredictable manner. This is mainly due to poor formulation of the problem as an RL problem, limited observation of the state, i.e., the used embedding and input features are not sufficient/meaningful, and sparse or similar rewards. Sparse rewards can be due to bad reward definition or the fact that some rewards cannot be computed directly and are known only by the end of the episode. For example, in Liang et al. (2019), where deep RL is used to optimize decision trees for packet classification, the reward (the performance of the tree) is known only after up to 15,000 steps when the whole tree is built. In some cases using more robust and stable policies can help. For example, Q-learning is known to have good sample efficiency but unstable behavior. SARSA, double Q-learning Van Hasselt et al. (2016) and policy gradient methods on the other hand are more stable. Subtracting a bias in PG can also help reduce variance Greensmith et al. (2004).

Lack of Reproducibility.

This is often the case with most recent system optimization works that rely on deep RL in their solutions. It becomes difficult to reproduce the results due to restricted access to the resources, code and workloads used, lack of a detailed list of the used network hyperparameters and lack of stable, predictable, and scalable behavior of the different RL algorithms. This challenge prevents future deployment, incremental improvements, and proper evaluation.

Defining Appropriate Rewards, Actions and States.

This is a major challenge since without proper definitions of states, actions, and rewards, the RL solution is not useful. In many cases, the rewards are sparse or similar, the states are not fully observable to capture the whole system state and have limited features that capture only a small portion of the system state. This results in unstable and inadequate policies. Generally, the action and state spaces are large, requiring a lot of samples to learn and resulting in instability and large variance in the learned network. Therefore, retraining often fails to generate the same results.

Lack of Generalization.

The lack of generalization is an issue that deep RL solutions often suffer from. This might be beneficial when learning a particular structure. For example, in NeuroCuts Liang et al. (2019), the target is to build the best decision tree for fixed set of predefined rules and thus the objective of the RL agent is to find the optimal fit for these rules. However, lack of generalization sometimes results in a solution that works for a particular workload or setting but at the overall system performance it is not good. This problem manifests when generalization is important and the RL agent has to deal with new states that it did not visit in the past. For example, in Paliwal et al. (2019); Addanki et al. (2019) where the RL agent has to learn good resource placements for different computation graphs, the authors avoided the possibility of learning only good placements for particular computation graphs by training and testing on a wide range graphs.

Lack of Standardized Benchmarks, Frameworks and Evaluation Metrics.

The lack of standardized benchmarks, frameworks and evaluation metrics makes it very difficult to evaluate the effectiveness of the deep RL methods in the context of system optimization. In some cases, the promising results were achieved by mere luck. Thus, it is crucial to have proper standardized frameworks and evaluation metrics that define success. Moreover, we need benchmarks that enable proper training, evaluation of the results, measuring the generalization of the solution to new problems and performing valid comparisons against baseline approaches.

8 An Illustrative Example

To put all the metrics and challenges together, we discuss DeepRM Mao et al. (2016) as an illustrative example. In DeepRM, the targeted system problem is focused on resource allocation in the cloud. The objective is the job slowdown, i.e., the goal is to minimize the wait time for all jobs. DeepRM uses PG in conjunction with a simulated environment rather than a real cloud environment. This significantly improves the step time but can result in restricted generalization when used in the real environment. Furthermore, since all the simulation parameters are known, a fully observable state of the simulated environment can be captured. The actions were defined as which job should be scheduled next. The state is defined as the current allocation of cluster resources, as well as the resource profiles of jobs waiting to be scheduled. The reward is defined as the sum of of job slowdowns: where is the pure execution time of job without considering the wait time. This reward basically gives a penalty of for jobs that are waiting to be scheduled. The penalty is divided by to give a higher priority to shorter jobs.

The state, actions and reward clearly define an MDP and a reinforcement learning problem. Specifically, the agent interacts with the system by making sequential allocations, observing the state of the current allocation of resources and receiving delayed long term rewards as overall slow downs of jobs. The rewards are delayed because the agent cannot know the effect of the current allocation action on the overall slow down, of all the jobs, at any particular time step. Thus, the agent would have to wait until all the other jobs get allocated. The agent then learns which jobs to allocate in the current time step to minimize the average job slowdown, given the current resource allocation in the cloud. Note that DeepRM also learns to withhold larger jobs to make room for smaller jobs to reduce the overall average job slowdown. DeepRM is shown to outperform random search.

The expert actions are not observable in this problem as there are no methods to find the optimal allocation decision at any particular time step. During training in DeepRM, multiple examples of job arrival sequences were considered as to see if the policy generalizes and makes reliable decisions111Results provided were in the simulated system, not in the real system.. DeepRM is also shown to outperform the state-of-the-art by %.

Clearly, in the case of DeepRM, most of the challenges mentioned in Section 7 are manifested. The interaction with the real cloud environment is slow and thus the authors opted for a simulated environment. This has the advantage of speeding up the training but often could result in a policy that does not generalize to the real environment. Unfortunately, generalization tests in the real environment were not provided. The instability and high variance were addressed by subtracting a bias in the PG equation. The bias was defined as the average of job slowdowns taken at a single time step across all episodes. The implementation of DeepRM was open sourced allowing others to reproduce the results. The rewards, actions, and states defined allowed the agent to learn a policy that performed well in the simulated environment. Note that defining the state of the system was easier because the environment was simulated. The solution also considered multiple reward definitions. For example, , where is the number of unfinished jobs in the system. This reward definition optimizes the average job completion time. The jobs evaluated in DeepRM were considered to arrive online according to a Bernoulli process. In addition, the jobs were chosen randomly and it is unclear whether they represent real workload scenarios or not. This emphasizes the necessity for standardized benchmarks and frameworks to evaluate the effectiveness of the deep RL methods in scheduling jobs in the cloud.

9 Conclusion

In this work we reviewed, discussed multiple challenges in recent trends on applying deep reinforcement learning in system optimization, and proposed a set of metrics that can help evaluate the harness of deep reinforcement learning in a system optimization problem.

The recent trends of deep RL in system optimization were mainly in packet classification, congestion control, compiler optimization, scheduling, query optimization and cloud computing. Looking forward, we anticipate deep RL in system optimization will grow. We also believe there is much more room for improvement in both the deep reinforcement learning algorithms and the system problems that could be tackled with it.


  • R. Addanki, S. B. Venkatakrishnan, S. Gupta, H. Mao, and M. Alizadeh (2019) Placeto: learning generalizable device placement algorithms for distributed machine learning. arXiv preprint arXiv:1906.08879. Cited by: §1, Table 1, §3.2, Table 2, Table 3, §7.
  • E. Andreasson, F. Hoffmann, and O. Lindholm (2002) To collect or not to collect? machine learning for memory management.. In Java Virtual Machine Research and Technology Symposium, pp. 27–39. Cited by: §2.1.
  • H. Arabnejad, C. Pahl, P. Jamshidi, and G. Estrada (2017) A comparison of reinforcement learning techniques for fuzzy cloud auto-scaling. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 64–73. Cited by: §1, §2.1, Table 1, §3.3.
  • A. H. Ashouri, W. Killian, J. Cavazos, G. Palermo, and C. Silvano (2018) A survey on compiler autotuning using machine learning. ACM Computing Surveys (CSUR) 51 (5), pp. 96. Cited by: §1.
  • P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Machine learning 47 (2-3), pp. 235–256. Cited by: §2.
  • P. Auer (2002) Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3 (Nov), pp. 397–422. Cited by: §2.
  • E. Barrett, E. Howley, and J. Duggan (2013) Applying reinforcement learning towards automating resource allocation and application scalability in the cloud. Concurrency and Computation: Practice and Experience 25 (12), pp. 1656–1674. Cited by: §2.1, §3.3.
  • R. Bellman (1957) A markovian decision process. In Journal of Mathematics and Mechanics, pp. 679–684. Cited by: §1, §2.
  • D. A. Berry and B. Fristedt (1985) Bandit problems: sequential allocation of experiments (monographs on statistics and applied probability). London: Chapman and Hall 5, pp. 71–87. Cited by: §2.
  • J. A. Boyan and M. L. Littman (1994) Packet routing in dynamically changing networks: a reinforcement learning approach. In Advances in neural information processing systems, pp. 671–678. Cited by: §2.1, §3.3.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: 1606.01540 Cited by: §4.2.
  • I. Caspi, G. Leibovich, G. Novik, and S. Endrawis (2017) Reinforcement learning coach. External Links: Document, Link Cited by: §6.
  • P. S. Castro, S. Moitra, C. Gelada, S. Kumar, and M. G. Bellemare (2018) Dopamine: A Research Framework for Deep Reinforcement Learning. External Links: Link Cited by: §6.
  • S. P. Choi and D. Yeung (1996) Predictive q-routing: a memory-based reinforcement learning approach to adaptive traffic control. In Advances in Neural Information Processing Systems, pp. 945–951. Cited by: §2.1, §3.3.
  • W. Chu, L. Li, L. Reyzin, and R. Schapire (2011) Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214. Cited by: §2.
  • K. E. Coons, B. Robatmili, M. E. Taylor, B. A. Maher, D. Burger, and K. S. McKinley (2008) Feature selection and policy optimization for distributed instruction placement using reinforcement learning. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pp. 32–42. Cited by: §1, Table 1, Table 2, Table 3, §5.
  • A. Das, R. A. Shafik, G. V. Merrett, B. M. Al-Hashimi, A. Kumar, and B. Veeravalli (2014) Reinforcement learning-based inter-and intra-application thermal optimization for lifetime improvement of multicore systems. In Proceedings of the 51st Annual Design Automation Conference, pp. 1–6. Cited by: §2.1.
  • N. Diegues and P. Romano (2014) Self-tuning intel transactional synchronization extensions. In 11th International Conference on Autonomic Computing (ICAC 14), pp. 209–219. Cited by: §2.1.
  • K. Doya (2000) Reinforcement learning in continuous time and space. Neural computation 12 (1), pp. 219–245. Cited by: §1.
  • F. Farahnakian, P. Liljeberg, and J. Plosila (2014) Energy-efficient virtual machines consolidation in cloud data centers using reinforcement learning. In 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 500–507. Cited by: §2.1, §3.3.
  • J. Gauci, E. Conti, Y. Liang, K. Virochsiri, Y. He, Z. Kaden, V. Narayanan, and X. Ye (2018) Horizon: facebook’s open source applied reinforcement learning platform. arXiv preprint arXiv:1811.00260. Cited by: §6.
  • A. Graves, A. Mohamed, and G. Hinton (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §4.
  • E. Greensmith, P. L. Bartlett, and J. Baxter (2004) Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research 5 (Nov), pp. 1471–1530. Cited by: §7.
  • K. Guu, P. Pasupat, E. Z. Liu, and P. Liang (2017) From language to programs: bridging reinforcement learning and maximum marginal likelihood. arXiv preprint arXiv:1704.07926. Cited by: §1, Table 1, §3.2, Table 2, Table 3, §4.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §5.
  • A. Hameed, A. Khoshkbarforoushha, R. Ranjan, P. P. Jayaraman, J. Kolodziej, P. Balaji, S. Zeadally, Q. M. Malluhi, N. Tziritas, A. Vishnu, et al. (2016) A survey and taxonomy on energy efficient resource allocation techniques for cloud computing systems. Computing 98 (7), pp. 751–774. Cited by: §1.
  • Y. He, F. R. Yu, N. Zhao, V. C. Leung, and H. Yin (2017a) Software-defined networks with mobile edge computing and caching for smart cities: a big data deep reinforcement learning approach. IEEE Communications Magazine 55 (12), pp. 31–37. Cited by: §1, Table 1, §3.1, §3.1, Table 2, Table 3.
  • Y. He, N. Zhao, and H. Yin (2017b) Integrated networking, caching, and computing for connected vehicles: a deep reinforcement learning approach. IEEE Transactions on Vehicular Technology 67 (1), pp. 44–55. Cited by: §1, Table 1, §3.1, §3.1, Table 2, Table 3.
  • T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, et al. (2018) Deep q-learning from demonstrations. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §4.1.
  • Q. Huang, A. Haj-Ali, W. Moses, J. Xiang, I. Stoica, K. Asanovic, and J. Wawrzynek (2019) AutoPhase: compiler phase-ordering for hls with deep reinforcement learning. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 308–308. Cited by: §1, Table 1, §3.2, Table 2, Table 3, §5.
  • E. Ipek, O. Mutlu, J. F. Martínez, and R. Caruana (2008) Self-optimizing memory controllers: a reinforcement learning approach. In ACM SIGARCH Computer Architecture News, Vol. 36, pp. 39–50. Cited by: §2.1.
  • P. Jamshidi, A. M. Sharifloo, C. Pahl, A. Metzger, and G. Estrada (2015) Self-learning cloud controllers: fuzzy q-learning for knowledge evolution. In 2015 International Conference on Cloud and Autonomic Computing, pp. 208–211. Cited by: §2.1, §3.3.
  • N. Jay, N. Rotman, B. Godfrey, M. Schapira, and A. Tamar (2019) A deep reinforcement learning perspective on internet congestion control. In International Conference on Machine Learning, pp. 3050–3059. Cited by: §1, Table 1, §3.2, Table 2, Table 3.
  • C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan (2018) Is q-learning provably efficient?. In Advances in Neural Information Processing Systems, pp. 4863–4873. Cited by: §3.1.
  • L. P. Kaelbling, M. L. Littman, and A. W. Moore (1996) Reinforcement learning: a survey. Vol. 4, pp. 237–285. Cited by: §1, §2.
  • J. Kober, J. A. Bagnell, and J. Peters (2013) Reinforcement learning in robotics: a survey. The International Journal of Robotics Research 32 (11), pp. 1238–1274. Cited by: §1.
  • K. Krauter, R. Buyya, and M. Maheswaran (2002) A taxonomy and survey of grid resource management systems for distributed computing. Software: Practice and Experience 32 (2), pp. 135–164. Cited by: §1.
  • S. Krishnan, Z. Yang, K. Goldberg, J. Hellerstein, and I. Stoica (2018) Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196. Cited by: §1, Table 1, §3.2, Table 2, Table 3.
  • A. Kuhnle, M. Schaarschmidt, and K. Fricke (2017)

    Tensorforce: a tensorflow library for applied reinforcement learning

    Note: Web page External Links: Link Cited by: §6.
  • S. Kulkarni and J. Cavazos (2012) Mitigating the compiler optimization phase-ordering problem using machine learning. In ACM SIGPLAN Notices, Vol. 47, pp. 147–162. Cited by: §1, Table 1, §3.2, Table 2, Table 3.
  • M. G. Lagoudakis and M. L. Littman (2000) Algorithm selection using reinforcement learning.. In ICML, pp. 511–518. Cited by: §2.1.
  • W. Li, F. Zhou, W. Meleis, and K. Chowdhury (2016) Learning-based and data-driven tcp design for memory-constrained iot. In 2016 International Conference on Distributed Computing in Sensor Systems (DCOSS), pp. 199–205. Cited by: §2.1.
  • C. Liang, J. Berant, Q. Le, K. D. Forbus, and N. Lao (2016) Neural symbolic machines: learning semantic parsers on freebase with weak supervision. arXiv preprint arXiv:1611.00020. Cited by: §1, Table 1, §3.2, Table 2, Table 3, §4.
  • E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, J. Gonzalez, K. Goldberg, and I. Stoica (2017) Ray rllib: a composable and scalable reinforcement learning library. arXiv preprint arXiv:1712.09381. Cited by: §6.
  • E. Liang, H. Zhu, X. Jin, and I. Stoica (2019) Neural packet classification. arXiv preprint arXiv:1902.10319. Cited by: §1, Table 1, §3.2, Table 2, Table 3, §7, §7.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §2, §2.
  • L. Lin (1993) Reinforcement learning for robots using neural networks. Cited by: §2.1.
  • M. Littman and J. Boyan (2013) A distributed reinforcement learning scheme for network routing. In Proceedings of the international workshop on applications of neural networks to telecommunications, pp. 55–61. Cited by: §2.1, §3.3.
  • N. Liu, Z. Li, J. Xu, Z. Xu, S. Lin, Q. Qiu, J. Tang, and Y. Wang (2017) A hierarchical framework of cloud resource allocation and power management using deep reinforcement learning. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 372–382. Cited by: §1, Table 1, §3.1, §3.1, Table 2, Table 3, §4.1, §4.
  • N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y. Liang, and D. I. Kim (2019) Applications of deep reinforcement learning in communications and networking: a survey. IEEE Communications Surveys & Tutorials. Cited by: §1.
  • M. S. Mahdavinejad, M. Rezvan, M. Barekatain, P. Adibi, P. Barnaghi, and A. P. Sheth (2018) Machine learning for internet of things data analysis: a survey. Digital Communications and Networks 4 (3), pp. 161–175. Cited by: §1.
  • H. Mao, M. Alizadeh, I. Menache, and S. Kandula (2016) Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pp. 50–56. Cited by: §1, Table 1, §3.1, §3.1, §3.3, Table 2, Table 3, §6, §8.
  • H. Mao, P. Negi, A. Narayan, H. Wang, J. Yang, H. Wang, R. Marcus, R. Addanki, M. Khani, S. He, et al. (2019) Park: an open platform for learning augmented computer systems. Cited by: §4.2, §6.
  • R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh, T. Kraska, O. Papaemmanouil, and N. Tatbul (2019) Neo: a learned query optimizer. arXiv preprint arXiv:1904.03711. Cited by: §1, Table 1, §3.2, Table 2, Table 3.
  • R. Marcus and O. Papaemmanouil (2018) Deep reinforcement learning for join order enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, pp. 3. Cited by: §1, Table 1, §3.2, Table 2, Table 3.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
  • S. Mostafavi, F. Ahmadi, and M. A. Sarram (2018) Reinforcement-learning-based foresighted task scheduling in cloud computing. arXiv preprint arXiv:1810.04718. Cited by: §2.1.
  • J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi (2018) Learning state representations for query optimization with deep reinforcement learning. arXiv preprint arXiv:1803.08604. Cited by: §1, Table 1, §3.2, Table 2, Table 3.
  • A. Paliwal, F. Gimeno, V. Nair, Y. Li, M. Lubin, P. Kohli, and O. Vinyals (2019)

    REGAL: transfer learning for fast optimization of computation graphs

    arXiv preprint arXiv:1905.02494. Cited by: §1, Table 1, §3.2, Table 2, Table 3, §6, §7.
  • L. Peled, S. Mannor, U. Weiser, and Y. Etsion (2015) Semantic locality and context-based prefetching using reinforcement learning. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), pp. 285–297. Cited by: §2.1.
  • Z. Peng, D. Cui, J. Zuo, Q. Li, B. Xu, and W. Lin (2015) Random task scheduling scheme based on reinforcement learning in cloud computing. Cluster computing 18 (4), pp. 1595–1607. Cited by: §2.1, §3.3.
  • J. Peters, S. Vijayakumar, and S. Schaal (2003) Reinforcement learning for humanoid robotics. In Proceedings of the third IEEE-RAS international conference on humanoid robots, pp. 1–20. Cited by: §1.
  • J. Rao, X. Bu, C. Xu, L. Wang, and G. Yin (2009) VCONF: a reinforcement learning approach to virtual machines auto-configuration. In Proceedings of the 6th international conference on Autonomic computing, pp. 137–146. Cited by: §1, Table 1, §3.1, §3.1, Table 2, Table 3.
  • S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §2.
  • F. Ruffy, M. Przystupa, and I. Beschastnikh (2018) Iroko: a framework to prototype reinforcement learning for data center traffic control. arXiv preprint arXiv:1812.09975. Cited by: §1, Table 1, §3.2, Table 2, Table 3, §4.2.
  • G. A. Rummery and M. Niranjan (1994) On-line q-learning using connectionist systems. Vol. 37, University of Cambridge, Department of Engineering Cambridge, England. Cited by: §2.
  • A. Sadeghi, F. Sheikholeslami, and G. B. Giannakis (2017) Optimal and scalable caching for 5g using reinforcement learning of space-time popularities. IEEE Journal of Selected Topics in Signal Processing 12 (1), pp. 180–190. Cited by: §2.1, §3.3.
  • S. Schaal (1997) Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046. Cited by: §4.1.
  • S. Schaal (1999) Is imitation learning the route to humanoid robots?. Trends in cognitive sciences 3 (6), pp. 233–242. Cited by: §5.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2, §2.
  • A. P. Silva, K. Obraczka, S. Burleigh, and C. M. Hirata (2016) Smart congestion control for delay-and disruption tolerant networks. In 2016 13th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), pp. 1–9. Cited by: §2.1.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1, §2, §2, §3.3.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2, §2.
  • G. Tesauro, R. Das, and N. K. Jong (2006) Online performance management using hybrid reinforcement learning. Proceedings of SysML. Cited by: §1, Table 1, §3.1, §3.1, Table 2, Table 3.
  • H. Van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, Cited by: §7.
  • Z. Wang and M. O’Boyle (2018) Machine learning in compiler optimization. Proceedings of the IEEE 106 (11), pp. 1879–1901. Cited by: §1.
  • C. J. Watkins and P. Dayan (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §2.
  • C. Xu, J. Rao, and X. Bu (2012) URL: a unified reinforcement learning approach for autonomic cloud management. Journal of Parallel and Distributed Computing 72 (2), pp. 95–105. Cited by: §1, Table 1, §3.1, §3.1, Table 2, Table 3.
  • Z. Xu, Y. Wang, J. Tang, J. Wang, and M. C. Gursoy (2017) A deep reinforcement learning based framework for power-efficient resource allocation in cloud rans. In 2017 IEEE International Conference on Communications (ICC), pp. 1–6. Cited by: §1, Table 1, §3.1, §3.1, Table 2, Table 3.
  • J. Zeppenfeld, A. Bouajila, W. Stechele, and A. Herkersdorf (2008)

    Learning classifier tables for autonomic systems on chip.

    GI Jahrestagung (2) 134, pp. 771–778. Cited by: §2.1.
  • V. Zhong, C. Xiong, and R. Socher (2017) Seq2sql: generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103. Cited by: §1, Table 1, §3.2, Table 2, Table 3, §4.
  • Q. Zhu and C. Yuan (2007) A reinforcement learning approach to automatic error recovery. In 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), pp. 729–738. Cited by: §2.1.