Log In Sign Up

Improving width-based planning with compact policies

by   Miquel Junyent, et al.

Optimal action selection in decision problems characterized by sparse, delayed rewards is still an open challenge. For these problems, current deep reinforcement learning methods require enormous amounts of data to learn controllers that reach human-level performance. In this work, we propose a method that interleaves planning and learning to address this issue. The planning step hinges on the Iterated-Width (IW) planner, a state of the art planner that makes explicit use of the state representation to perform structured exploration. IW is able to scale up to problems independently of the size of the state space. From the state-actions visited by IW, the learning step estimates a compact policy, which in turn is used to guide the planning step. The type of exploration used by our method is radically different than the standard random exploration used in RL. We evaluate our method in simple problems where we show it to have superior performance than the state-of-the-art reinforcement learning algorithms A2C and Alpha Zero. Finally, we present preliminary results in a subset of the Atari games suite.


Deep Policies for Width-Based Planning in Pixel Domains

Width-based planning has demonstrated great success in recent years due ...

Hierarchical Width-Based Planning and Learning

Width-based search methods have demonstrated state-of-the-art performanc...

The Dreaming Variational Autoencoder for Reinforcement Learning Environments

Reinforcement learning has shown great potential in generalizing over ra...

Efficient Model-Based Deep Reinforcement Learning with Variational State Tabulation

Modern reinforcement learning algorithms reach super-human performance i...

Width-based Lookaheads with Learnt Base Policies and Heuristics Over the Atari-2600 Benchmark

We propose new width-based planning and learning algorithms applied over...

TreeQN and ATreeC: Differentiable Tree Planning for Deep Reinforcement Learning

Combining deep model-free reinforcement learning with on-line planning i...

1 Introduction

Optimal sequential decision making is a fundamental problem to many diverse fields. In the recent years the Reinforcement Learning (RL) approach has experienced unprecedented success, reaching human-level performance in several domains, including Atari video-games (Mnih et al., 2015) or the ancient game of Go (Silver et al., 2016). This success has been largely enabled by the use of advanced function approximation techniques in combination with large-scale data generation from self-play games. Current RL methods, however, still require enormous amounts of data to learn, specially in tasks characterized by delayed, sparse rewards, where more efficient ways of exploring the problem state space are needed.

Safe online exploration can be incentivized by adding a reward bonus. This is known under different names: reward shaping (Ng et al., 1999), optimism in the face of uncertainty (Kearns & Singh, 2002), intrinsic motivation (Chentanez et al., 2005), curiosity-driven RL (Still & Precup, 2012), prediction gain (Bellemare et al., 2016), or entropy-regularized MDPs (Neu et al., 2017). Alternative approaches introduce correlated noise directly in the parameter space of the policy or value function (Plappert et al., 2018; Fortunato et al., 2018; Osband et al., 2016; Liu et al., 2017).

While these approaches offer significant improvements over classical exploration techniques such as

-greedy or Boltzmann exploration, none of them makes explicit use of the representation of the state, which is treated as a black box. The traditional way to exploit the state structure in RL is through the framework of factored MDPs. Factored MDPs represent compactly the transition probabilities of the MDP in terms of products of factors that involve a small subset of the state variables, allowing to reduce exponentially the sample complexity in some cases 

(Boutilier et al., 2000; Kearns & Koller, 1999). However, the factorization structure is in general not inherited by the value function and is not generally exploited to guide the exploration.

In contrast to the RL approach in which the agent learns a policy by interacting with the environment, the planning approach for decision making assumes known models for the agent’s goals and domain dynamics, and focuses on determining how the agent should behave to achieve its objectives (Kolobov, 2012). Current planners are able to solve problem instances involving huge state spaces by precisely exploiting the problem structure that is defined in the state-action model (Geffner & Bonet, 2013).

A family of planners, known as width-based planners, was introduced by Lipovetzky & Geffner (2012) and became the state-of-the-art for solving planning benchmarks. While originally proposed for classical planning problems, i.e., deterministic, goal-driven problems with a fully defined model, width-based planners have evolved closer to the RL setting. Recently, these planners have been applied to Atari games using pixel features reaching comparable results with learning methods in almost real-time (Bandres et al., 2018).

In this paper, we explore further this family of methods originated in the planning community. In particular, we consider width-based planning in combination with a learning policy, showing that this combination has benefits both in terms of exploration and feature learning. Our approach is to train a policy in the form of a neural network, and use the compact state representation learned by the policy to guide the width-based planning algorithm. We show that the lookahead power of the resulting algorithm is comparable to or better than previous width-based implementations that use static features. Our approach is based on the same principle as the recently proposed AlphaZero 

(Silver et al., 2017a), which also interleaves learning with planning, but uses a Monte-Carlo tree search planner instead.

The next section introduces the basic background and Section 3 describes related work. We then present our approach in Section 4, followed by experimental results in Section 5. We conclude and outline future work directions in Section 6.

2 Background

In this section, we review the fundamental concepts of reinforcement learning and the width-based planning.

2.1 Reinforcement Learning and MDPs

We consider sequential decision problems modelled as Markov decision processes (MDPs). An MDP is a tuple

, where is the finite state space, is the finite action space, is the transition function, and is the reward function. Here, is the probability simplex over .

In each time-step , the learner observes state , selects action , moves to the next state , and obtains reward such that . We refer to the tuple as a transition. The aim of the learner is to select actions to maximize the expected cumulative discounted reward , where is a discount factor. We assume that the transition function and reward function are unknown to the learner.

The decision strategy of the learner is represented by a policy , i.e. a mapping from states to distributions over actions. A policy  induces a value function such that for each state , is the expected cumulative reward when starting in state and using policy to select actions. The optimal value function achieves the maximum value in each state , i.e. , and the optimal policy is the policy that attains this maximum in each state , i.e. . Typically, an estimate of the optimal policy and/or an estimate

of the optimal value function are maintained, parameterized on vectors

and , respectively.

2.2 Width-based Planning

Iterated Width (IW) is a pure exploration algorithm originally developed for goal-directed planning problems with deterministic actions (Lipovetzky & Geffner, 2012). It requires the state space to be factored into a set of features or atoms . We assume that each feature has the same domain , e.g. binary () or real-valued (). The algorithm consists of a sequence of calls IW() for until a termination condition is reached. IW() performs a standard breadth-first search from a given initial state , but prunes states that are not novel. When a new state is generated, IW() contemplates all -tuples of atoms of with size . The state is considered novel if at least one tuple has not appeared in the search before, otherwise it is pruned.

IW() is thus a blind search algorithm that traverses the entire state-space for sufficiently large . The traversal is solely determined by how the state is structured, i.e., what features are used to represent the state. Each iteration IW() is an -width search that is complete for problems whose width is bounded by and has complexity , where is the number of problem variables or features (Lipovetzky & Geffner, 2012). Interestingly, most planning benchmarks turn out to have very small width and, in practice, they can be solved in linear or quadratic time.

Figure 1: Example run of IW(1). States are represented by their feature vectors, and actions correspond to edges.

We illustrate IW(1) using a small example that involves three binary features (i.e. ) and four actions. Figure 1 shows an example run of IW(1), in which the initial state maps to the feature vector . Assume that breadth-first search expands the states at depth in the order left-to-right. The first state generates two new feature values: the first and third feature have value , and therefore is not pruned. The second state does not generate any new feature values, and is thus pruned. The third state assigns to the second feature for the first time, while the fourth state is also pruned. The algorithm continues expanding the nodes that have not been pruned in a breath-first manner until all nodes are pruned or the goal state is reached.

3 Related work

In the following subsections, we review relevant literature from three different viewpoints: possible extensions to the original IW algorithm, efficient exploration in RL, and recently proposed methods that combine planning and RL.

3.1 Width-Based planning for MDPs

There are two follow-ups to the original IW algorithm of special relevance to this work. First, Lipovetzky et al. (2015) extended the original algorithm to MDPs by associating a reward to each state during search, equivalent to the reward accumulated on the path from to , where is the depth of in the search tree. The discount factor has the effect of favoring earlier rewards. With this extension, after the search completes, the first action on the path from to is applied, similarly to model predictive control. This version of IW(1) achieves competitive performance in the Atari suite (Bellemare et al., 2013), using as features the 128 bytes of the RAM memory representing the current game configuration.

Bandres et al. (2018)

further modified the algorithm to handle visual features. Specifically, their algorithm uses the (binary) B-PROST features extracted from the images of Atari games 

(Liang et al., 2016). Since storing and retrieving Atari states during breadth-first search is costly, the authors introduced a Rollout version of IW(1) which emulates the breadth-first traversal of the original algorithm, by keeping track of the minimum depth at which a feature is found for the first time and extending the notion of novelty accordingly. The pruned states are kept as leaves of the tree, and are considered as candidates for states with highest reward.

The above contributions brought the original formulation of IW closer to the RL setting. The rollout version of IW only requires a simulator (a successor function from a given state) and a structured representation of the state in terms of atoms or features. However, two important challenges remain. First, IW is used in an open-loop way and does not produce a compact policy that can be used in a reactive, closed-loop environment. Second, width-based algorithms use a fixed set of features that needs to be defined in advance. While competitive performance has been achieved using pixels as features in the Atari domain, interesting states may still require a large width to be reached, which can be unfeasible. These two challenges are the main motivation for our work.

3.2 Exploration in Reinforcement Learning

As mentioned in the introduction, there are several alternative approaches to efficient exploration in RL. Most of these approaches are based on the idea of adding an explicit bonus to the reward function. The intuition is that by adding a bonus to states that have not been frequently visited during search, the likelihood for visiting unexplored parts of the state space will increase, potentially leading to higher rewards. Even though this scheme does not preserve the Markov property (since the reward now typically depends on the number of times we have visited a state), the result is often that the learner ends up exploring larger sections of the state space. This follows the well-known principle of optimism in the face of uncertainty (Kearns & Singh, 2002).

The UCT algorithm for Monte-Carlo tree search (Kocsis & Szepesvári, 2006), and its precursor UCB for multi-armed bandits (Auer et al., 2002), are examples of algorithms that assign a reward bonus to states which is inversely proportional to the number of times that a state has been visited during search. Since the search tree has finite size, it is feasible to count the number of times each state is visited.

When the state space is very large, maintaining an explicit visitation count becomes infeasible. Bellemare et al. (2016) address this problem by introducing pseudo-counts based on the prediction gain of states. The authors show that the pseudo-counts are good approximators of how often states are visited, and define a reward bonus inversely proportional to the pseudo-count. Martin et al. (2017) further extend the idea of pseudo-counts from states to feature vectors. The idea is to decompose computation of the pseudo-count such that the pseudo-count of a state is composed of the pseudo-counts of each feature of . This significantly simplifies the computation and reduces the algorithmic complexity.

Compared to the above approaches, IW(1) is a pure exploration algorithm that does not take reward into account at all. The only purpose of tracking the reward of states is to decide which action to perform when the search concludes.

3.3 Combining Reinforcement Learning and Planning

A natural way to combine planning and learning is to identify the planner as a “teacher” that provides correct

transitions that are used to learn a policy, as in imitation learning 

(Ross et al., 2011; Guo et al., 2014). Recently, AlphaGo achieved superhuman performance in the game of Go (Silver et al., 2016)

by combining supervised learning from expert moves and self-play. AlphaZero 

(Silver et al., 2017b), a version of the same algorithm that learned solely from self-play, has outperformed previous variants, also showing stunning results in Chess and Shogui (Silver et al., 2017a).

At every iteration , AlphaZero generates a tree using Monte-Carlo tree search, guided by a policy and a value estimate. It keeps a visit count on the branches of the tree, and uses it to explore less frequent states (using a variant of UCT) and to generate a target policy . After tree expansion, an action is selected at the root following , and the resulting subtree is kept for the next iteration. At the end of the episode, the win/loss result is recorded and all transitions are added to a dataset. In parallel, the policy and value estimates are trained in a supervised manner with minibatches sampled from the dataset.

4 Policy-guided Iterated Width

We now present our proposed algorithm to combine planning and learning. Our aim is two-fold. On the one hand, in spite of its success, IW does not learn from experience, so its performance does not improve over time. On the other hand, RL algorithms usually suffer from poor exploration, struggling to solve problems with sparse rewards and significantly slowing down learning.

In this work we leverage the exploration capacity of IW(1) to train a policy estimate . Both IW and Rollout IW select actions at random with uniform probabilities. Even though both algorithms favor novel states with previously unseen feature values, random action selection does not take into account previous experience and results in a uninformed exploration. As a result, reaching a distant reward in a specific search may be arbitrary. We build on the recently proposed Rollout based IW version (Bandres et al., 2018) by incorporating an action selection policy, resulting in an informed IW search. The combination of IW and RL addresses the shortcomings of each approach, resulting in an efficient algorithm in terms of both exploration and learning a policy that can be used in closed-loop scenarios.

Our extension, Policy-guided Iterated Width (PIW), enhances Rollout IW by guiding the search with the current policy estimate . We consider tuples of size , i.e., IW(1), which keeps the planning step tractable. Similar to Rollout IW, PIW requires a simulator that provides the successor of a state and a representation of as features .

The algorithm interleaves a Rollout IW planning step with a policy learning step, which we describe next. After describing the basic PIW algorithm, we present a possible way to discover features using the learned policy that can in turn be used by IW. This second use of the policy can be beneficial if the original features are poor or unknown.

4.1 Planning step

At every iteration, the algorithm first selects a node for expansion, and then performs a rollout from . To find , PIW uses the current policy to select actions. The tree is traversed until a state-action pair that has not yet been expanded is found. The rollout from also uses to select actions until a terminal state or a state that is not novel is reached. At that point, the final node is marked as solved and the process restarts until all nodes have been solved or a maximum budget of time or nodes is exhausted. Algorithm 1 shows the planning step of PIW.

In this work, our policy takes the form of a neural network (NN) with softmax outputs , where ,

, are the logits output of the NN and

is a temperature parameter for additional control of the exploration. We leverage finding a good representation of the state on the NN, which will be learned using samples from IW.

  function Generate_lookahead_tree(tree) 
     D := Make_empty_novelty_table()
     while within_budget and tree.root.solved do
        n, a := Select(tree.root, D)
        if a  then
           Rollout(n, a, D)
  function Select(n, D) 
        novel := Check_novelty(D, n.atoms, n.depth, false)
        if is_terminal(n) or novel then
           return n,
        a := Select_action_following_policy(n)
        if n[a] in tree then
           n := n[a]
           return n, a
  function Rollout(n, a, D) 
     while within_budget do
        n := expand_node(n, a)
        n.solved := false
        novel := Check_novelty(D, n.atoms, n.depth, true)
        if is_terminal(n) or novel then
        a := Select_action_following_policy(n)
  function Check_novelty(D, atoms, d, is_new) 
     novel := false
     for f in atoms do
        novel := novel d D[f] (is_new d = D[f])
        if d D[f] is_new then
           D[f] := d
     return novel
Algorithm 1 Planning step of Policy-Guided IW(1)

In the limit we obtain the uniform policy used in width-based algorithms. Just as in Rollout IW, actions that lead to nodes that are labelled as solved should not be considered. Thus, we set probability for each solved action and normalize over the remaining actions before sampling (Select_action_following_policy).

Every time a node is labelled as solved, the label is propagated along the branch to the root. A node is labelled as solved if each of its children is labelled as solved (Solve_and_propagate_label). Initially, all nodes of the cached tree are marked as not solved, except for the ones that are terminal (Initialize_labels).

As previously mentioned, an expanded state is considered novel if one of its atoms is true at a smaller depth than the one registered so far in the novelty table. A node that was already in the tree will not be pruned if its depth is exactly equal to the one in the novelty table for one of its atoms.

4.2 Learning step

Once the tree has been generated, the discounted rewards are backpropagated to the root:

. A target policy is induced from the returns at the root node by applying a softmax with

, i.e. a one-hot encoding of the maximum return, except for the cases where more than one path leads to the same return, in which case each path is assigned equal probability. The state

is stored together with the target policy to train the model in a supervised manner. We use the cross-entropy error between the induced target policy and the current policy estimate to update the policy parameters

, defining a loss function

L2 regularization may be added to avoid overfitting and help convergence. The model is trained by randomly sampling transitions from the dataset, which can be done in parallel, as in AlphaZero, or a training step can be taken after each lookahead. In our experiments we choose the latter, sampling a batch of transitions at each iteration. We keep a maximum of transitions, discarding outdated transitions in a FIFO manner.

Finally, a new root is selected from the nodes at depth 1 following and the resulting subtree is kept for the next planning step. This has been referred to as tree caching in previous work (Lipovetzky & Geffner, 2012), and it has been argued that not including cached nodes in the novelty table increases exploration and hence performance. Note that cached nodes will contain outdated information. Although we did not find this to have a great impact on performance, one possibility is to rerun the model on all nodes of the tree at regular intervals. However, this is not done in our experiments.

4.3 Dynamic features

The quality of the transitions recorded by IW greatly depends on the feature set used to define the novelty of states. For example, even though IW has been applied directly to visual (pixel) features (Bandres et al., 2018), it tends to work best when the features are symbolic, e.g., when the RAM state is used as a feature vector (Lipovetzky et al., 2015). Symbolic features makes planning become more effective, since the width of a problem is effectively reduced by the information encoded in the features. However, how to automatically learn powerful features for this type of structured exploration is an open challenge.

In PIW, we can use the representation learned by the NN to define a feature space, as in representation learning (Goodfellow et al., 2016). With this dependence, the behavior of IW effectively changes when interleaving policy updates with runs of IW. If appropriately defined, these features should help to distinguish between important parts of the state space. In this work, we extract

from the last hidden layer of the neural network. In particular, we use the output of the rectified linear units that we subsequently discretize in the simplest way, resulting in binary features (

for negative outputs and for positive outputs).

5 Experiments

We evaluate the performance of Policy-guided IW in different settings. First, we consider a toy problem where we compare our method against state-of-the-art RL algorithms. We show that PIW is superior to current methods in a challenging, sparse-reward environment. Second, we present preliminary results in large-scale problems, testing our approach in four Atari 2600 games, and we show that PIW outperforms previous width-based approaches. Finally, we compare the policy learned by PIW with state-of-the-art results in the Atari benchmark.

5.1 Simple environments

To test our approach, we use a gridworld environment where an agent has to navigate to first pick up a key and then go through a door. An episode terminates with a reward of when the goal is accomplished, with a reward of when a wall is hit or with no reward after a maximum of 200 steps is reached. All intermediate states are not rewarded. The observation is an RGB image and possible actions are no-op, going up, down, left or right. The environment is challenging since the reward is sparse and each episode terminates when the agent hits a wall (resetting the agent’s position). We consider three variants of the game, with increasing difficulty (see Figure 2).

Figure 2: Snapshot of the three versions of the game. The blue, green and red squares represent the agent, the key and the door respectively.

We compare our approach to AlphaZero. Although originally designed for two-player zero-sum games, it can be easily extended to the MDP setting. AlphaZero controls the balance between exploration and exploitation by a parameter together with a temperature parameter in the target policy , similar to ours. In the original paper, is set to 1 for a few steps at the beginning of every episode, and then it is changed to an infinitesimal temperature for the rest of the game (Silver et al., 2017b). Nevertheless, we achieved better results in our experiments with AlphaZero using for the entire episode. Furthermore, AlphaZero needs to wait until the episode ends to assign a target value for all transitions in the episode. Thus, for a fair comparison, in these experiments, PIW also adds all the transitions of an episode to the dataset upon termination.

We analyze PIW using static and dynamic features. For the first case, we take the set of BASIC features (Bellemare et al., 2013), where the input image is divided in tiles and an atom, represented by a tuple , is true if color appears in the tile . In our simple environment, we make the tiles coincide with the grid, and since there is only one color per tile, the amount of features is limited to 100. For the second case, we take the (discretized) outputs of the last hidden layer as binary feature vectors.

All algorithms share the same NN architecture and hyperparameters, specified in Table

1. We use two convolutional and two fully connected layers as in Mnih et al. (2013)

, and we train it using the non-centered version of the RMSProp algorithm. Although all algorithms can be run in parallel, the experiments presented in this paper have been executed using one thread.

Hyperparameter Value Algorithm
Discount factor 0.99 All
Batch size 10 All
Learning rate 0.0007 All
Clip gradient norm 40 All
RMSProp decay 0.99 All
RMSProp epsilon 0.1 All
Tree budget nodes 50 PIW, AlphaZero
Dataset size PIW, AlphaZero
L2 reg. loss factor PIW, AlphaZero
Tree policy temp. 1 PIW, AlphaZero
0.5 AlphaZero
Diritchlet noise 0.3 AlphaZero
Noise factor 0.25 AlphaZero
Value loss factor 1 AlphaZero, A2C
Entropy loss factor 0.1 A2C
Table 1: Hyperparameters used for PIW, AlphaZero and A2C.
Figure 3: Performance of PIW, AlphaZero and A2C in three simple mazes. Columns represent difficulty (1, 2 and 3 walls respectively), first row shows tree search-based algorithms lookahead performance and second row compares compact policies along learning. All plots are averages over five runs, and shades show the minimum and maximum score.

Figure 3 (top row) shows results comparing PIW and AlphaZero for the three mazes. We plot the average reward as a function of the number of interactions with the environment. As expected, the number of interactions with the environment required to solve the problem increases with the level of difficulty. We observe that PIW outperforms AlphaZero in these environments using both dynamic and static features. Surprisingly, the two variants of PIW show little difference, indicating that in this simple maze, the features can be learned easily.

The difference in performance between AlphaZero and the PIW variants is explained because AlphaZero needs to go through an optimal branch several times to increase its probability for action selection, since the policy estimate is based on counts. In contrast, PIW makes decisions solely on the rewards present on the tree. Thus, it may select a branch with low count after employing the budget to explore different parts of the state space. This, together with the use of rollouts that reach deeper states, makes IW more suitable for these challenging sparse-reward environments.

We also evaluate the learned policy of both algorithms every frames, and compare it with A2C. To do this, we choose a greedy policy with , i.e. sampling uniformly between actions that present maximum probability. Figure 3 (bottom row) shows the results. We observe that PIW outperforms A2C in the three scenarios. This is explained again by means of the different types of exploration performed by the algorithms. The exploration of A2C is purely random, and highly depends on the entropy loss factor. Since the reward in these environments is sparse, this factor needs to be high, which slows down the learning.

5.2 Atari games

We now consider the Atari benchmark. In this case, we only consider dynamic features. We set the maximum number of expanded nodes to , a dataset of transitions, L2-norm penalty of to , and linearly anneal the learning rate from to . Just as in Bandres et al. (2018), we set the frame skip parameter to and all other hyperparameters equal to the previous experiments. In contrast to AlphaZero, PIW does not need to wait until an episode terminates to add transitions to the dataset. Thus, in these experiments transitions are added just right after they are generated, being directly available for the training step. Similar to previous work, the input for the NN consists of the last four grayscale frames, which are stacked together to form a 4-channel image.

Game (1500) (0.5s) (32s) (100)
Breakout 384.0 82.4 36.0 107.1
Freeway 31.0 2.8 12.6 28.65
Pong 21.0 -7.4 17.6 20.7
Qbert 3,705.0 3,375.0 8,390.0 415,271.5
Table 2: Comparison of accumulated reward of different width-based tree search methods. Performance of PIW (lookahead) is an average of 5 runs after 15M frames. Tree budget of either nodes or time shown in parentheses. Results from Bandres et al. (2018).

Table 2 shows results comparing PIW with previous width-based algorithms on the Atari games Pong, Freeway, Qbert and Breakout. Our budget of 100 nodes at each tree expansion takes approximately 1 second in all games except Freeway, where the simulator is slower and takes 3 seconds. In these four games, PIW clearly outperforms previous width-based algorithms based on pixel features, even compared to Rollout IW executions with a tree phase of 32 seconds. Furthermore, our results are comparable to the ones achieved by Lipovetzky et al. (2015), where the internal RAM state was used as the feature vector. This suggests that using the policy is not only beneficial to guide the search, but also using its learned representation (our simple discretized features of the hidden layer) results in features that are exploited by IW. Note that in our experiments we use a smaller budget of nodes (100 vs 1,500), which could explain the poorer performance in Breakout, for instance.

Figure 4: Performance of PIW and its learned closed-loop controller in Freeway, Pong, Qbert and Breakout. Plots are averages of five runs and shades show the minimum and maximum score. Skipped frames are not counted as interactions.

In two executions of Qbert, the lookahead exploits a recently discovered glitch that leads to scores of near a million points (Chrabaszcz et al., 2018)

, while still achieving a remarkable score of around 30,000 in the other three. Thus, the learned policy serves as a good heuristic for guiding the search. Nevertheless, the resulting policy itself is not a very good closed-loop controller. This is shown in Figure

4, where we show the performance of PIW (what we could call the teacher) together with the learned closed-loop controller , which is evaluated every interactions. In Freeway and Pong, the policy estimate is able to follow the target performance, although it does not match the results of the lookahead. In the game of Breakout we find a similar behavior as in Qbert, although the lookahead only improves in the beginning, resulting in a noisy performance.

Game Human DQN A3C A3C+ PIW
Breakout 31.8 259.40 432.42 473.93 6.9
Freeway 29.6 30.12 0.00 30.48 23.55
Pong 9.3 19.17 20.84 20.75 16.38
Qbert 13,455.0 7,094.91 19,175.72 19,257.55 570.5
Table 3: Results for the policy learned by PIW compared to DQN, A3C and A3C+ (results taken from Bellemare et al. (2016)).

Finally, we also compare the policy learned by PIW against some state-of-the-art RL methods. Table 3 shows some preliminary results. Although the policy learned by PIW is not competitive, it is important to note that we use far less training samples (the horizontal axis include all environment interactions, including the tree generation). Moreover, we used a frameskip of 15 based on previous work, instead of 4 as in DQN or A3C. This value may be correct for algorithms that perform a lookahead since all movements can be anticipated, but may be too high for estimating an action based solely on the current observation.

6 Conclusions

The exploration strategy of width-based planners is fundamentally different to existing RL methods, achieving state-of-the-art performance in planning problems and, more recently, in the Atari benchmark. In this work, we have brought width-based exploration closer to the RL setting.

Width-based algorithms require a factorization of states into features, which may not always be available. A second contribution of this paper is the use of the representation learned by the policy as feature space. We show how such a representation can be exploited by IW, achieving comparable results to using pre-defined features.

Our approach learns a compact policy using the exploration power of IW(1), which helps reaching distant high-reward states. We use the transitions recorded by IW(1) to train a policy in the form of a neural network. Simultaneously, the search is informed by the current policy estimate, reinforcing promising paths. Our algorithm operates in a similar manner to AlphaZero. It extends Rollout IW to use a policy estimate to guide the search, and interleaves learning and planning. Differently from AlphaZero, exploration relies on the pruning mechanism of IW, it does not keep a value estimate, and the target policy is based on seen rewards rather than visitation counts.

Just like Monte-Carlo tree search, PIW requires access to a simulator. This is a departure from model-free RL, which uses the simulator as the environment. In this sense, we make use of the simulator to generate experience, and use that experience to learn a policy that can be used efficiently at execution time. We remark that Rollout IW (and consequently PIW) does not require storing and retrieving arbitrary states, since rollouts always follow a trajectory and backtrack to the root state prior to the next rollout.

We have shown experimentally that our proposed PIW algorithm has superior performance in simple environments compared to existing RL algorithms. Moreover, we have provided results for a subset of the Atari 2600 games, in which PIW outperforms other width-based planner algorithms. We have also evaluated the learned policy, and although it serves as a good heuristic to generate the tree, it fails to achieve the target performance of the lookahead. This could be due to several reasons (e.g. the frameskip may be too high, compared to what is used in DQN or A3C), and we leave for future work the necessary improvements to make the policy estimate match the lookahead performance. We would also like to investigate the use of a value estimate in our algorithm or to decouple learning features for IW from the policy estimate.


Miquel Junyent’s research is partially funded by project 2016DI004 of the Catalan Industrial Doctorates Plan. Anders Jonsson is partially supported by the grants TIN2015-67959 and PCIN-2017-082 of the Spanish Ministry of Science. Vicenç Gómez is supported by the Ramon y Cajal program RYC-2015-18878 (AEI/MINEICO/FSE,UE).


  • Auer et al. (2002) Auer, Peter, Cesa-Bianchi, Nicolo, and Fischer, Paul. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • Bandres et al. (2018) Bandres, Wilmer, Bonet, Blai, and Geffner, Hector. Planning with pixels in (almost) real time. In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018

    , 2018.
  • Bellemare et al. (2016) Bellemare, Marc, Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and Munos, Remi. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479, 2016.
  • Bellemare et al. (2013) Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  • Boutilier et al. (2000) Boutilier, Craig, Dearden, Richard, and Goldszmidt, Moisés. Stochastic dynamic programming with factored representations. Artificial intelligence, 121(1-2):49–107, 2000.
  • Chentanez et al. (2005) Chentanez, Nuttapong, Barto, Andrew G., and Singh, Satinder P. Intrinsically motivated reinforcement learning. In Saul, L. K., Weiss, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems 17, pp. 1281–1288. MIT Press, 2005.
  • Chrabaszcz et al. (2018) Chrabaszcz, Patryk, Loshchilov, Ilya, and Hutter, Frank. Back to basics: Benchmarking canonical evolution strategies for playing atari. arXiv preprint arXiv:1802.08842, 2018.
  • Fortunato et al. (2018) Fortunato, Meire, Azar, Mohammad Gheshlaghi, Piot, Bilal, Menick, Jacob, Hessel, Matteo, Osband, Ian, Graves, Alex, Mnih, Volodymyr, Munos, Remi, Hassabis, Demis, Pietquin, Olivier, Blundell, Charles, and Legg, Shane. Noisy networks for exploration. In International Conference on Learning Representations, 2018.
  • Geffner & Bonet (2013) Geffner, H. and Bonet, B. A Concise Introduction to Models and Methods for Automated Planning. Morgan & Claypool, 2013.
  • Goodfellow et al. (2016) Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron. Deep Learning. MIT Press, 2016.
  • Guo et al. (2014) Guo, Xiaoxiao, Singh, Satinder, Lee, Honglak, Lewis, Richard L, and Wang, Xiaoshi. Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Advances in neural information processing systems, pp. 3338–3346, 2014.
  • Kearns & Koller (1999) Kearns, Michael and Koller, Daphne. Efficient reinforcement learning in factored mdps. In International Joint Conference on Artificial Intelligence, volume 16, pp. 740–747, 1999.
  • Kearns & Singh (2002) Kearns, Michael and Singh, Satinder. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209–232, 2002.
  • Kocsis & Szepesvári (2006) Kocsis, Levente and Szepesvári, Csaba. Bandit based monte-carlo planning. In European conference on machine learning, pp. 282–293. Springer, 2006.
  • Kolobov (2012) Kolobov, Andrey. Planning with markov decision processes: An ai perspective. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–210, 2012.
  • Liang et al. (2016) Liang, Yitao, Machado, Marlos C, Talvitie, Erik, and Bowling, Michael. State of the art control of atari games using shallow reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 485–493. International Foundation for Autonomous Agents and Multiagent Systems, 2016.
  • Lipovetzky & Geffner (2012) Lipovetzky, Nir and Geffner, Hector. Width and serialization of classical planning problems. Frontiers in Artificial Intelligence and Applications, 242:540–545, 2012. ISSN 09226389. doi: 10.3233/978-1-61499-098-7-540.
  • Lipovetzky et al. (2015) Lipovetzky, Nir, Ramirez, Miquel, and Geffner, Hector. Classical planning with simulators: Results on the atari video games. In International Joint Conference on Artificial Intelligence, volume 15, pp. 1610–1616, 2015.
  • Liu et al. (2017) Liu, Yang, Ramachandran, Prajit, Liu, Qiang, and Peng, Jian. Stein variational policy gradient. In Uncertainty in Artificial Intelligence, 2017.
  • Martin et al. (2017) Martin, Jarryd, Sasikumar, Suraj Narayanan, Everitt, Tom, and Hutter, Marcus. Count-Based Exploration in Feature Space for Reinforcement Learning. In International Joint Conference on Artificial Intelligence, 2017.
  • Mnih et al. (2013) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, and Riedmiller, Martin. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop. 2013.
  • Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness, Joel, Bellemare, Marc G., Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K., Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik, Amir, Antonoglou, Ioannis, King, Helen, Kumaran, Dharshan, Wierstra, Daan, Legg, Shane, and Hassabis, Demis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 02 2015. URL
  • Neu et al. (2017) Neu, Gergely, Gómez, Vicenç, and Jonsson, Anders. A unified view of entropy-regularized markov decision processes. Deep Reinforcement Learning Symposium, NIPS, 2017.
  • Ng et al. (1999) Ng, Andrew Y, Harada, Daishi, and Russell, Stuart. Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning, volume 99, pp. 278–287, 1999.
  • Osband et al. (2016) Osband, Ian, Blundell, Charles, Pritzel, Alexander, and Van Roy, Benjamin. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pp. 4026–4034, 2016.
  • Plappert et al. (2018) Plappert, Matthias, Houthooft, Rein, Dhariwal, Prafulla, Sidor, Szymon, Chen, Richard Y., Chen, Xi, Asfour, Tamim, Abbeel, Pieter, and Andrychowicz, Marcin. Parameter space noise for exploration. In International Conference on Learning Representations, 2018.
  • Ross et al. (2011) Ross, Stéphane, Gordon, Geoffrey, and Bagnell, Drew. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635, 2011.
  • Silver et al. (2016) Silver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Sifre, Laurent, Van Den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • Silver et al. (2017a) Silver, David, Hubert, Thomas, Schrittwieser, Julian, Antonoglou, Ioannis, Lai, Matthew, Guez, Arthur, Lanctot, Marc, Sifre, Laurent, Kumaran, Dharshan, Graepel, Thore, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017a.
  • Silver et al. (2017b) Silver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis, Huang, Aja, Guez, Arthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton, Adrian, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017b.
  • Still & Precup (2012) Still, Susanne and Precup, Doina. An information-theoretic approach to curiosity-driven reinforcement learning. Theory in Biosciences, 131(3):139–148, 2012.