1 Introduction
Optimal sequential decision making is a fundamental problem to many diverse fields. In the recent years the Reinforcement Learning (RL) approach has experienced unprecedented success, reaching humanlevel performance in several domains, including Atari videogames (Mnih et al., 2015) or the ancient game of Go (Silver et al., 2016). This success has been largely enabled by the use of advanced function approximation techniques in combination with largescale data generation from selfplay games. Current RL methods, however, still require enormous amounts of data to learn, specially in tasks characterized by delayed, sparse rewards, where more efficient ways of exploring the problem state space are needed.
Safe online exploration can be incentivized by adding a reward bonus. This is known under different names: reward shaping (Ng et al., 1999), optimism in the face of uncertainty (Kearns & Singh, 2002), intrinsic motivation (Chentanez et al., 2005), curiositydriven RL (Still & Precup, 2012), prediction gain (Bellemare et al., 2016), or entropyregularized MDPs (Neu et al., 2017). Alternative approaches introduce correlated noise directly in the parameter space of the policy or value function (Plappert et al., 2018; Fortunato et al., 2018; Osband et al., 2016; Liu et al., 2017).
While these approaches offer significant improvements over classical exploration techniques such as
greedy or Boltzmann exploration, none of them makes explicit use of the representation of the state, which is treated as a black box. The traditional way to exploit the state structure in RL is through the framework of factored MDPs. Factored MDPs represent compactly the transition probabilities of the MDP in terms of products of factors that involve a small subset of the state variables, allowing to reduce exponentially the sample complexity in some cases
(Boutilier et al., 2000; Kearns & Koller, 1999). However, the factorization structure is in general not inherited by the value function and is not generally exploited to guide the exploration.In contrast to the RL approach in which the agent learns a policy by interacting with the environment, the planning approach for decision making assumes known models for the agent’s goals and domain dynamics, and focuses on determining how the agent should behave to achieve its objectives (Kolobov, 2012). Current planners are able to solve problem instances involving huge state spaces by precisely exploiting the problem structure that is defined in the stateaction model (Geffner & Bonet, 2013).
A family of planners, known as widthbased planners, was introduced by Lipovetzky & Geffner (2012) and became the stateoftheart for solving planning benchmarks. While originally proposed for classical planning problems, i.e., deterministic, goaldriven problems with a fully defined model, widthbased planners have evolved closer to the RL setting. Recently, these planners have been applied to Atari games using pixel features reaching comparable results with learning methods in almost realtime (Bandres et al., 2018).
In this paper, we explore further this family of methods originated in the planning community. In particular, we consider widthbased planning in combination with a learning policy, showing that this combination has benefits both in terms of exploration and feature learning. Our approach is to train a policy in the form of a neural network, and use the compact state representation learned by the policy to guide the widthbased planning algorithm. We show that the lookahead power of the resulting algorithm is comparable to or better than previous widthbased implementations that use static features. Our approach is based on the same principle as the recently proposed AlphaZero
(Silver et al., 2017a), which also interleaves learning with planning, but uses a MonteCarlo tree search planner instead.2 Background
In this section, we review the fundamental concepts of reinforcement learning and the widthbased planning.
2.1 Reinforcement Learning and MDPs
We consider sequential decision problems modelled as Markov decision processes (MDPs). An MDP is a tuple
, where is the finite state space, is the finite action space, is the transition function, and is the reward function. Here, is the probability simplex over .In each timestep , the learner observes state , selects action , moves to the next state , and obtains reward such that . We refer to the tuple as a transition. The aim of the learner is to select actions to maximize the expected cumulative discounted reward , where is a discount factor. We assume that the transition function and reward function are unknown to the learner.
The decision strategy of the learner is represented by a policy , i.e. a mapping from states to distributions over actions. A policy induces a value function such that for each state , is the expected cumulative reward when starting in state and using policy to select actions. The optimal value function achieves the maximum value in each state , i.e. , and the optimal policy is the policy that attains this maximum in each state , i.e. . Typically, an estimate of the optimal policy and/or an estimate
of the optimal value function are maintained, parameterized on vectors
and , respectively.2.2 Widthbased Planning
Iterated Width (IW) is a pure exploration algorithm originally developed for goaldirected planning problems with deterministic actions (Lipovetzky & Geffner, 2012). It requires the state space to be factored into a set of features or atoms . We assume that each feature has the same domain , e.g. binary () or realvalued (). The algorithm consists of a sequence of calls IW() for until a termination condition is reached. IW() performs a standard breadthfirst search from a given initial state , but prunes states that are not novel. When a new state is generated, IW() contemplates all tuples of atoms of with size . The state is considered novel if at least one tuple has not appeared in the search before, otherwise it is pruned.
IW() is thus a blind search algorithm that traverses the entire statespace for sufficiently large . The traversal is solely determined by how the state is structured, i.e., what features are used to represent the state. Each iteration IW() is an width search that is complete for problems whose width is bounded by and has complexity , where is the number of problem variables or features (Lipovetzky & Geffner, 2012). Interestingly, most planning benchmarks turn out to have very small width and, in practice, they can be solved in linear or quadratic time.
We illustrate IW(1) using a small example that involves three binary features (i.e. ) and four actions. Figure 1 shows an example run of IW(1), in which the initial state maps to the feature vector . Assume that breadthfirst search expands the states at depth in the order lefttoright. The first state generates two new feature values: the first and third feature have value , and therefore is not pruned. The second state does not generate any new feature values, and is thus pruned. The third state assigns to the second feature for the first time, while the fourth state is also pruned. The algorithm continues expanding the nodes that have not been pruned in a breathfirst manner until all nodes are pruned or the goal state is reached.
3 Related work
In the following subsections, we review relevant literature from three different viewpoints: possible extensions to the original IW algorithm, efficient exploration in RL, and recently proposed methods that combine planning and RL.
3.1 WidthBased planning for MDPs
There are two followups to the original IW algorithm of special relevance to this work. First, Lipovetzky et al. (2015) extended the original algorithm to MDPs by associating a reward to each state during search, equivalent to the reward accumulated on the path from to , where is the depth of in the search tree. The discount factor has the effect of favoring earlier rewards. With this extension, after the search completes, the first action on the path from to is applied, similarly to model predictive control. This version of IW(1) achieves competitive performance in the Atari suite (Bellemare et al., 2013), using as features the 128 bytes of the RAM memory representing the current game configuration.
Bandres et al. (2018)
further modified the algorithm to handle visual features. Specifically, their algorithm uses the (binary) BPROST features extracted from the images of Atari games
(Liang et al., 2016). Since storing and retrieving Atari states during breadthfirst search is costly, the authors introduced a Rollout version of IW(1) which emulates the breadthfirst traversal of the original algorithm, by keeping track of the minimum depth at which a feature is found for the first time and extending the notion of novelty accordingly. The pruned states are kept as leaves of the tree, and are considered as candidates for states with highest reward.The above contributions brought the original formulation of IW closer to the RL setting. The rollout version of IW only requires a simulator (a successor function from a given state) and a structured representation of the state in terms of atoms or features. However, two important challenges remain. First, IW is used in an openloop way and does not produce a compact policy that can be used in a reactive, closedloop environment. Second, widthbased algorithms use a fixed set of features that needs to be defined in advance. While competitive performance has been achieved using pixels as features in the Atari domain, interesting states may still require a large width to be reached, which can be unfeasible. These two challenges are the main motivation for our work.
3.2 Exploration in Reinforcement Learning
As mentioned in the introduction, there are several alternative approaches to efficient exploration in RL. Most of these approaches are based on the idea of adding an explicit bonus to the reward function. The intuition is that by adding a bonus to states that have not been frequently visited during search, the likelihood for visiting unexplored parts of the state space will increase, potentially leading to higher rewards. Even though this scheme does not preserve the Markov property (since the reward now typically depends on the number of times we have visited a state), the result is often that the learner ends up exploring larger sections of the state space. This follows the wellknown principle of optimism in the face of uncertainty (Kearns & Singh, 2002).
The UCT algorithm for MonteCarlo tree search (Kocsis & Szepesvári, 2006), and its precursor UCB for multiarmed bandits (Auer et al., 2002), are examples of algorithms that assign a reward bonus to states which is inversely proportional to the number of times that a state has been visited during search. Since the search tree has finite size, it is feasible to count the number of times each state is visited.
When the state space is very large, maintaining an explicit visitation count becomes infeasible. Bellemare et al. (2016) address this problem by introducing pseudocounts based on the prediction gain of states. The authors show that the pseudocounts are good approximators of how often states are visited, and define a reward bonus inversely proportional to the pseudocount. Martin et al. (2017) further extend the idea of pseudocounts from states to feature vectors. The idea is to decompose computation of the pseudocount such that the pseudocount of a state is composed of the pseudocounts of each feature of . This significantly simplifies the computation and reduces the algorithmic complexity.
Compared to the above approaches, IW(1) is a pure exploration algorithm that does not take reward into account at all. The only purpose of tracking the reward of states is to decide which action to perform when the search concludes.
3.3 Combining Reinforcement Learning and Planning
A natural way to combine planning and learning is to identify the planner as a “teacher” that provides correct
transitions that are used to learn a policy, as in imitation learning
(Ross et al., 2011; Guo et al., 2014). Recently, AlphaGo achieved superhuman performance in the game of Go (Silver et al., 2016)by combining supervised learning from expert moves and selfplay. AlphaZero
(Silver et al., 2017b), a version of the same algorithm that learned solely from selfplay, has outperformed previous variants, also showing stunning results in Chess and Shogui (Silver et al., 2017a).At every iteration , AlphaZero generates a tree using MonteCarlo tree search, guided by a policy and a value estimate. It keeps a visit count on the branches of the tree, and uses it to explore less frequent states (using a variant of UCT) and to generate a target policy . After tree expansion, an action is selected at the root following , and the resulting subtree is kept for the next iteration. At the end of the episode, the win/loss result is recorded and all transitions are added to a dataset. In parallel, the policy and value estimates are trained in a supervised manner with minibatches sampled from the dataset.
4 Policyguided Iterated Width
We now present our proposed algorithm to combine planning and learning. Our aim is twofold. On the one hand, in spite of its success, IW does not learn from experience, so its performance does not improve over time. On the other hand, RL algorithms usually suffer from poor exploration, struggling to solve problems with sparse rewards and significantly slowing down learning.
In this work we leverage the exploration capacity of IW(1) to train a policy estimate . Both IW and Rollout IW select actions at random with uniform probabilities. Even though both algorithms favor novel states with previously unseen feature values, random action selection does not take into account previous experience and results in a uninformed exploration. As a result, reaching a distant reward in a specific search may be arbitrary. We build on the recently proposed Rollout based IW version (Bandres et al., 2018) by incorporating an action selection policy, resulting in an informed IW search. The combination of IW and RL addresses the shortcomings of each approach, resulting in an efficient algorithm in terms of both exploration and learning a policy that can be used in closedloop scenarios.
Our extension, Policyguided Iterated Width (PIW), enhances Rollout IW by guiding the search with the current policy estimate . We consider tuples of size , i.e., IW(1), which keeps the planning step tractable. Similar to Rollout IW, PIW requires a simulator that provides the successor of a state and a representation of as features .
The algorithm interleaves a Rollout IW planning step with a policy learning step, which we describe next. After describing the basic PIW algorithm, we present a possible way to discover features using the learned policy that can in turn be used by IW. This second use of the policy can be beneficial if the original features are poor or unknown.
4.1 Planning step
At every iteration, the algorithm first selects a node for expansion, and then performs a rollout from . To find , PIW uses the current policy to select actions. The tree is traversed until a stateaction pair that has not yet been expanded is found. The rollout from also uses to select actions until a terminal state or a state that is not novel is reached. At that point, the final node is marked as solved and the process restarts until all nodes have been solved or a maximum budget of time or nodes is exhausted. Algorithm 1 shows the planning step of PIW.
In this work, our policy takes the form of a neural network (NN) with softmax outputs , where ,
, are the logits output of the NN and
is a temperature parameter for additional control of the exploration. We leverage finding a good representation of the state on the NN, which will be learned using samples from IW.In the limit we obtain the uniform policy used in widthbased algorithms. Just as in Rollout IW, actions that lead to nodes that are labelled as solved should not be considered. Thus, we set probability for each solved action and normalize over the remaining actions before sampling (Select_action_following_policy).
Every time a node is labelled as solved, the label is propagated along the branch to the root. A node is labelled as solved if each of its children is labelled as solved (Solve_and_propagate_label). Initially, all nodes of the cached tree are marked as not solved, except for the ones that are terminal (Initialize_labels).
As previously mentioned, an expanded state is considered novel if one of its atoms is true at a smaller depth than the one registered so far in the novelty table. A node that was already in the tree will not be pruned if its depth is exactly equal to the one in the novelty table for one of its atoms.
4.2 Learning step
Once the tree has been generated, the discounted rewards are backpropagated to the root:
. A target policy is induced from the returns at the root node by applying a softmax with, i.e. a onehot encoding of the maximum return, except for the cases where more than one path leads to the same return, in which case each path is assigned equal probability. The state
is stored together with the target policy to train the model in a supervised manner. We use the crossentropy error between the induced target policy and the current policy estimate to update the policy parameters, defining a loss function
L2 regularization may be added to avoid overfitting and help convergence. The model is trained by randomly sampling transitions from the dataset, which can be done in parallel, as in AlphaZero, or a training step can be taken after each lookahead. In our experiments we choose the latter, sampling a batch of transitions at each iteration. We keep a maximum of transitions, discarding outdated transitions in a FIFO manner.
Finally, a new root is selected from the nodes at depth 1 following and the resulting subtree is kept for the next planning step. This has been referred to as tree caching in previous work (Lipovetzky & Geffner, 2012), and it has been argued that not including cached nodes in the novelty table increases exploration and hence performance. Note that cached nodes will contain outdated information. Although we did not find this to have a great impact on performance, one possibility is to rerun the model on all nodes of the tree at regular intervals. However, this is not done in our experiments.
4.3 Dynamic features
The quality of the transitions recorded by IW greatly depends on the feature set used to define the novelty of states. For example, even though IW has been applied directly to visual (pixel) features (Bandres et al., 2018), it tends to work best when the features are symbolic, e.g., when the RAM state is used as a feature vector (Lipovetzky et al., 2015). Symbolic features makes planning become more effective, since the width of a problem is effectively reduced by the information encoded in the features. However, how to automatically learn powerful features for this type of structured exploration is an open challenge.
In PIW, we can use the representation learned by the NN to define a feature space, as in representation learning (Goodfellow et al., 2016). With this dependence, the behavior of IW effectively changes when interleaving policy updates with runs of IW. If appropriately defined, these features should help to distinguish between important parts of the state space. In this work, we extract
from the last hidden layer of the neural network. In particular, we use the output of the rectified linear units that we subsequently discretize in the simplest way, resulting in binary features (
for negative outputs and for positive outputs).5 Experiments
We evaluate the performance of Policyguided IW in different settings. First, we consider a toy problem where we compare our method against stateoftheart RL algorithms. We show that PIW is superior to current methods in a challenging, sparsereward environment. Second, we present preliminary results in largescale problems, testing our approach in four Atari 2600 games, and we show that PIW outperforms previous widthbased approaches. Finally, we compare the policy learned by PIW with stateoftheart results in the Atari benchmark.
5.1 Simple environments
To test our approach, we use a gridworld environment where an agent has to navigate to first pick up a key and then go through a door. An episode terminates with a reward of when the goal is accomplished, with a reward of when a wall is hit or with no reward after a maximum of 200 steps is reached. All intermediate states are not rewarded. The observation is an RGB image and possible actions are noop, going up, down, left or right. The environment is challenging since the reward is sparse and each episode terminates when the agent hits a wall (resetting the agent’s position). We consider three variants of the game, with increasing difficulty (see Figure 2).
We compare our approach to AlphaZero. Although originally designed for twoplayer zerosum games, it can be easily extended to the MDP setting. AlphaZero controls the balance between exploration and exploitation by a parameter together with a temperature parameter in the target policy , similar to ours. In the original paper, is set to 1 for a few steps at the beginning of every episode, and then it is changed to an infinitesimal temperature for the rest of the game (Silver et al., 2017b). Nevertheless, we achieved better results in our experiments with AlphaZero using for the entire episode. Furthermore, AlphaZero needs to wait until the episode ends to assign a target value for all transitions in the episode. Thus, for a fair comparison, in these experiments, PIW also adds all the transitions of an episode to the dataset upon termination.
We analyze PIW using static and dynamic features. For the first case, we take the set of BASIC features (Bellemare et al., 2013), where the input image is divided in tiles and an atom, represented by a tuple , is true if color appears in the tile . In our simple environment, we make the tiles coincide with the grid, and since there is only one color per tile, the amount of features is limited to 100. For the second case, we take the (discretized) outputs of the last hidden layer as binary feature vectors.
All algorithms share the same NN architecture and hyperparameters, specified in Table
1. We use two convolutional and two fully connected layers as in Mnih et al. (2013), and we train it using the noncentered version of the RMSProp algorithm. Although all algorithms can be run in parallel, the experiments presented in this paper have been executed using one thread.
Hyperparameter  Value  Algorithm 

Discount factor  0.99  All 
Batch size  10  All 
Learning rate  0.0007  All 
Clip gradient norm  40  All 
RMSProp decay  0.99  All 
RMSProp epsilon  0.1  All 
Tree budget nodes  50  PIW, AlphaZero 
Dataset size  PIW, AlphaZero  
L2 reg. loss factor  PIW, AlphaZero  
Tree policy temp.  1  PIW, AlphaZero 
0.5  AlphaZero  
Diritchlet noise  0.3  AlphaZero 
Noise factor  0.25  AlphaZero 
Value loss factor  1  AlphaZero, A2C 
Entropy loss factor  0.1  A2C 
Figure 3 (top row) shows results comparing PIW and AlphaZero for the three mazes. We plot the average reward as a function of the number of interactions with the environment. As expected, the number of interactions with the environment required to solve the problem increases with the level of difficulty. We observe that PIW outperforms AlphaZero in these environments using both dynamic and static features. Surprisingly, the two variants of PIW show little difference, indicating that in this simple maze, the features can be learned easily.
The difference in performance between AlphaZero and the PIW variants is explained because AlphaZero needs to go through an optimal branch several times to increase its probability for action selection, since the policy estimate is based on counts. In contrast, PIW makes decisions solely on the rewards present on the tree. Thus, it may select a branch with low count after employing the budget to explore different parts of the state space. This, together with the use of rollouts that reach deeper states, makes IW more suitable for these challenging sparsereward environments.
We also evaluate the learned policy of both algorithms every frames, and compare it with A2C. To do this, we choose a greedy policy with , i.e. sampling uniformly between actions that present maximum probability. Figure 3 (bottom row) shows the results. We observe that PIW outperforms A2C in the three scenarios. This is explained again by means of the different types of exploration performed by the algorithms. The exploration of A2C is purely random, and highly depends on the entropy loss factor. Since the reward in these environments is sparse, this factor needs to be high, which slows down the learning.
5.2 Atari games
We now consider the Atari benchmark. In this case, we only consider dynamic features. We set the maximum number of expanded nodes to , a dataset of transitions, L2norm penalty of to , and linearly anneal the learning rate from to . Just as in Bandres et al. (2018), we set the frame skip parameter to and all other hyperparameters equal to the previous experiments. In contrast to AlphaZero, PIW does not need to wait until an episode terminates to add transitions to the dataset. Thus, in these experiments transitions are added just right after they are generated, being directly available for the training step. Similar to previous work, the input for the NN consists of the last four grayscale frames, which are stacked together to form a 4channel image.
IW  RIW  RIW  PIW  

RAM  BPROST  BPROST  dynamic  
Game  (1500)  (0.5s)  (32s)  (100) 
Breakout  384.0  82.4  36.0  107.1 
Freeway  31.0  2.8  12.6  28.65 
Pong  21.0  7.4  17.6  20.7 
Qbert  3,705.0  3,375.0  8,390.0  415,271.5 
Table 2 shows results comparing PIW with previous widthbased algorithms on the Atari games Pong, Freeway, Qbert and Breakout. Our budget of 100 nodes at each tree expansion takes approximately 1 second in all games except Freeway, where the simulator is slower and takes 3 seconds. In these four games, PIW clearly outperforms previous widthbased algorithms based on pixel features, even compared to Rollout IW executions with a tree phase of 32 seconds. Furthermore, our results are comparable to the ones achieved by Lipovetzky et al. (2015), where the internal RAM state was used as the feature vector. This suggests that using the policy is not only beneficial to guide the search, but also using its learned representation (our simple discretized features of the hidden layer) results in features that are exploited by IW. Note that in our experiments we use a smaller budget of nodes (100 vs 1,500), which could explain the poorer performance in Breakout, for instance.
In two executions of Qbert, the lookahead exploits a recently discovered glitch that leads to scores of near a million points (Chrabaszcz et al., 2018)
, while still achieving a remarkable score of around 30,000 in the other three. Thus, the learned policy serves as a good heuristic for guiding the search. Nevertheless, the resulting policy itself is not a very good closedloop controller. This is shown in Figure
4, where we show the performance of PIW (what we could call the teacher) together with the learned closedloop controller , which is evaluated every interactions. In Freeway and Pong, the policy estimate is able to follow the target performance, although it does not match the results of the lookahead. In the game of Breakout we find a similar behavior as in Qbert, although the lookahead only improves in the beginning, resulting in a noisy performance.Game  Human  DQN  A3C  A3C+  PIW 

Breakout  31.8  259.40  432.42  473.93  6.9 
Freeway  29.6  30.12  0.00  30.48  23.55 
Pong  9.3  19.17  20.84  20.75  16.38 
Qbert  13,455.0  7,094.91  19,175.72  19,257.55  570.5 
Finally, we also compare the policy learned by PIW against some stateoftheart RL methods. Table 3 shows some preliminary results. Although the policy learned by PIW is not competitive, it is important to note that we use far less training samples (the horizontal axis include all environment interactions, including the tree generation). Moreover, we used a frameskip of 15 based on previous work, instead of 4 as in DQN or A3C. This value may be correct for algorithms that perform a lookahead since all movements can be anticipated, but may be too high for estimating an action based solely on the current observation.
6 Conclusions
The exploration strategy of widthbased planners is fundamentally different to existing RL methods, achieving stateoftheart performance in planning problems and, more recently, in the Atari benchmark. In this work, we have brought widthbased exploration closer to the RL setting.
Widthbased algorithms require a factorization of states into features, which may not always be available. A second contribution of this paper is the use of the representation learned by the policy as feature space. We show how such a representation can be exploited by IW, achieving comparable results to using predefined features.
Our approach learns a compact policy using the exploration power of IW(1), which helps reaching distant highreward states. We use the transitions recorded by IW(1) to train a policy in the form of a neural network. Simultaneously, the search is informed by the current policy estimate, reinforcing promising paths. Our algorithm operates in a similar manner to AlphaZero. It extends Rollout IW to use a policy estimate to guide the search, and interleaves learning and planning. Differently from AlphaZero, exploration relies on the pruning mechanism of IW, it does not keep a value estimate, and the target policy is based on seen rewards rather than visitation counts.
Just like MonteCarlo tree search, PIW requires access to a simulator. This is a departure from modelfree RL, which uses the simulator as the environment. In this sense, we make use of the simulator to generate experience, and use that experience to learn a policy that can be used efficiently at execution time. We remark that Rollout IW (and consequently PIW) does not require storing and retrieving arbitrary states, since rollouts always follow a trajectory and backtrack to the root state prior to the next rollout.
We have shown experimentally that our proposed PIW algorithm has superior performance in simple environments compared to existing RL algorithms. Moreover, we have provided results for a subset of the Atari 2600 games, in which PIW outperforms other widthbased planner algorithms. We have also evaluated the learned policy, and although it serves as a good heuristic to generate the tree, it fails to achieve the target performance of the lookahead. This could be due to several reasons (e.g. the frameskip may be too high, compared to what is used in DQN or A3C), and we leave for future work the necessary improvements to make the policy estimate match the lookahead performance. We would also like to investigate the use of a value estimate in our algorithm or to decouple learning features for IW from the policy estimate.
Acknowledgements
Miquel Junyent’s research is partially funded by project 2016DI004 of the Catalan Industrial Doctorates Plan. Anders Jonsson is partially supported by the grants TIN201567959 and PCIN2017082 of the Spanish Ministry of Science. Vicenç Gómez is supported by the Ramon y Cajal program RYC201518878 (AEI/MINEICO/FSE,UE).
References
 Auer et al. (2002) Auer, Peter, CesaBianchi, Nicolo, and Fischer, Paul. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.

Bandres et al. (2018)
Bandres, Wilmer, Bonet, Blai, and Geffner, Hector.
Planning with pixels in (almost) real time.
In
Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 27, 2018
, 2018.  Bellemare et al. (2016) Bellemare, Marc, Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and Munos, Remi. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479, 2016.
 Bellemare et al. (2013) Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
 Boutilier et al. (2000) Boutilier, Craig, Dearden, Richard, and Goldszmidt, Moisés. Stochastic dynamic programming with factored representations. Artificial intelligence, 121(12):49–107, 2000.
 Chentanez et al. (2005) Chentanez, Nuttapong, Barto, Andrew G., and Singh, Satinder P. Intrinsically motivated reinforcement learning. In Saul, L. K., Weiss, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems 17, pp. 1281–1288. MIT Press, 2005.
 Chrabaszcz et al. (2018) Chrabaszcz, Patryk, Loshchilov, Ilya, and Hutter, Frank. Back to basics: Benchmarking canonical evolution strategies for playing atari. arXiv preprint arXiv:1802.08842, 2018.
 Fortunato et al. (2018) Fortunato, Meire, Azar, Mohammad Gheshlaghi, Piot, Bilal, Menick, Jacob, Hessel, Matteo, Osband, Ian, Graves, Alex, Mnih, Volodymyr, Munos, Remi, Hassabis, Demis, Pietquin, Olivier, Blundell, Charles, and Legg, Shane. Noisy networks for exploration. In International Conference on Learning Representations, 2018.
 Geffner & Bonet (2013) Geffner, H. and Bonet, B. A Concise Introduction to Models and Methods for Automated Planning. Morgan & Claypool, 2013.
 Goodfellow et al. (2016) Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron. Deep Learning. MIT Press, 2016.
 Guo et al. (2014) Guo, Xiaoxiao, Singh, Satinder, Lee, Honglak, Lewis, Richard L, and Wang, Xiaoshi. Deep learning for realtime atari game play using offline montecarlo tree search planning. In Advances in neural information processing systems, pp. 3338–3346, 2014.
 Kearns & Koller (1999) Kearns, Michael and Koller, Daphne. Efficient reinforcement learning in factored mdps. In International Joint Conference on Artificial Intelligence, volume 16, pp. 740–747, 1999.
 Kearns & Singh (2002) Kearns, Michael and Singh, Satinder. Nearoptimal reinforcement learning in polynomial time. Machine learning, 49(23):209–232, 2002.
 Kocsis & Szepesvári (2006) Kocsis, Levente and Szepesvári, Csaba. Bandit based montecarlo planning. In European conference on machine learning, pp. 282–293. Springer, 2006.
 Kolobov (2012) Kolobov, Andrey. Planning with markov decision processes: An ai perspective. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–210, 2012.
 Liang et al. (2016) Liang, Yitao, Machado, Marlos C, Talvitie, Erik, and Bowling, Michael. State of the art control of atari games using shallow reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 485–493. International Foundation for Autonomous Agents and Multiagent Systems, 2016.
 Lipovetzky & Geffner (2012) Lipovetzky, Nir and Geffner, Hector. Width and serialization of classical planning problems. Frontiers in Artificial Intelligence and Applications, 242:540–545, 2012. ISSN 09226389. doi: 10.3233/9781614990987540.
 Lipovetzky et al. (2015) Lipovetzky, Nir, Ramirez, Miquel, and Geffner, Hector. Classical planning with simulators: Results on the atari video games. In International Joint Conference on Artificial Intelligence, volume 15, pp. 1610–1616, 2015.
 Liu et al. (2017) Liu, Yang, Ramachandran, Prajit, Liu, Qiang, and Peng, Jian. Stein variational policy gradient. In Uncertainty in Artificial Intelligence, 2017.
 Martin et al. (2017) Martin, Jarryd, Sasikumar, Suraj Narayanan, Everitt, Tom, and Hutter, Marcus. CountBased Exploration in Feature Space for Reinforcement Learning. In International Joint Conference on Artificial Intelligence, 2017.
 Mnih et al. (2013) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, and Riedmiller, Martin. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop. 2013.
 Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness, Joel, Bellemare, Marc G., Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K., Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik, Amir, Antonoglou, Ioannis, King, Helen, Kumaran, Dharshan, Wierstra, Daan, Legg, Shane, and Hassabis, Demis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 02 2015. URL http://dx.doi.org/10.1038/nature14236.
 Neu et al. (2017) Neu, Gergely, Gómez, Vicenç, and Jonsson, Anders. A unified view of entropyregularized markov decision processes. Deep Reinforcement Learning Symposium, NIPS, 2017.
 Ng et al. (1999) Ng, Andrew Y, Harada, Daishi, and Russell, Stuart. Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning, volume 99, pp. 278–287, 1999.
 Osband et al. (2016) Osband, Ian, Blundell, Charles, Pritzel, Alexander, and Van Roy, Benjamin. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pp. 4026–4034, 2016.
 Plappert et al. (2018) Plappert, Matthias, Houthooft, Rein, Dhariwal, Prafulla, Sidor, Szymon, Chen, Richard Y., Chen, Xi, Asfour, Tamim, Abbeel, Pieter, and Andrychowicz, Marcin. Parameter space noise for exploration. In International Conference on Learning Representations, 2018.
 Ross et al. (2011) Ross, Stéphane, Gordon, Geoffrey, and Bagnell, Drew. A reduction of imitation learning and structured prediction to noregret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635, 2011.
 Silver et al. (2016) Silver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Sifre, Laurent, Van Den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Silver et al. (2017a) Silver, David, Hubert, Thomas, Schrittwieser, Julian, Antonoglou, Ioannis, Lai, Matthew, Guez, Arthur, Lanctot, Marc, Sifre, Laurent, Kumaran, Dharshan, Graepel, Thore, et al. Mastering chess and shogi by selfplay with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017a.
 Silver et al. (2017b) Silver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis, Huang, Aja, Guez, Arthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton, Adrian, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017b.
 Still & Precup (2012) Still, Susanne and Precup, Doina. An informationtheoretic approach to curiositydriven reinforcement learning. Theory in Biosciences, 131(3):139–148, 2012.