Single-Agent Policy Tree Search With Guarantees

11/27/2018 ∙ by Laurent Orseau, et al. ∙ Google Federal University of Viçosa 0

We introduce two novel tree search algorithms that use a policy to guide search. The first algorithm is a best-first enumeration that uses a cost function that allows us to prove an upper bound on the number of nodes to be expanded before reaching a goal state. We show that this best-first algorithm is particularly well suited for `needle-in-a-haystack' problems. The second algorithm is based on sampling and we prove an upper bound on the expected number of nodes it expands before reaching a set of goal states. We show that this algorithm is better suited for problems where many paths lead to a goal. We validate these tree search algorithms on 1,000 computer-generated levels of Sokoban, where the policy used to guide the search comes from a neural network trained using A3C. Our results show that the policy tree search algorithms we introduce are competitive with a state-of-the-art domain-independent planner that uses heuristic search.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Monte-Carlo tree search (MCTS) algorithms (coulom2007efficient; browne2012survey) have been recently applied with great success to several problems such as Go, Chess, and Shogi (silver2016mastering; silver2017mastering). Such algorithms are well adapted to stochastic and adversarial domains, due to their sampling nature and the convergence guarantee to min-max values. However, the sampling procedure used in MCTS algorithms is not well-suited for other kinds of problems (nakhost2013random), such as deterministic single-agent problems where the objective is to find any solution at all. In particular, if the reward is very sparse—for example the agent is rewarded only at the end of the task—MCTS algorithms revert to uniform search. In practice such algorithms can be guided by a heuristic but, to the best of our knowledge, no bound is known that depends on the quality of the heuristic. For such cases one may use instead other traditional search approaches such as A* (hart1968aFormalBasis) and Greedy Best-First Search (GBFS) (Doran1966), which are guided by a heuristic cost function.

In this paper we tackle single-agent problems from the perspective of policy-guided search. One may view policy-guided search as a special kind of heuristic search in which a policy, instead of a heuristic function, is provided as input to the search algorithm. As a policy is a probability distribution over sequences of actions, this allows us to provide theoretical guarantees that cannot be offered by

value (e.g., reward-based) functions: we can bound the number of node expansions—roughly speaking, the search time—depending on the probability of the sequences of actions that reach the goal. We propose two different algorithms with different strengths and weaknesses. The first algorithm, called LevinTS, is based on Levin search (levin1973search) and we derive a strict upper bound on the number of nodes to search before finding the least-cost solution. The second algorithm, called LubyTS, is based on the scheduling of luby1993speedup for randomized algorithms and we prove an upper bound on the expected number of nodes to search before reaching any solution while taking advantage of the potential multiplicity of the solutions. LevinTS and LubyTS are the first policy tree search algorithms with such guarantees. Empirical results on the PSPACE-hard domain of Sokoban (Culberson1999) show that LubyTS and in particular LevinTS guided by a policy learned with A3C (mni2016asynchronous) are competitive with a state-of-the-art planner that uses GBFS (hoffmannN01). Although we focus on deterministic environments, LevinTS and LubyTS can be extended to stochastic environments with a known model.

LevinTS and LubyTS bring important research areas closer together. Namely, areas that traditionally rely on heuristic-guided tree search with guarantees such as classical planning and areas devoted to learn control policies such as reinforcement learning. We expect future works to explore closer relations of these areas, such as the use of LevinTS and LubyTS as part of classical planning systems.

2 Notation and background

We write . Let be a (possibly uncountable) set of states, and let be a finite set of actions. The environment starts in an initial state . During an interaction step (or just step) the environment in state receives an action from the searcher and transitions deterministically according to a transition function to the state . The state of the environment after a sequence of actions is written which is a shorthand for the recursive application of the transition function from the initial state to each action of , where is the sequence of actions . Let be a set of goal states. When the environment transitions to one of the goal states, the problem is solved and the interaction stops. We consider tree search algorithms and define the set of nodes in the tree as the set of sequences of actions . The root node is the empty sequence of actions. Hence a sequence of actions of length is uniquely identified by a node and we define (the usual depth of the node is recovered with ). Several sequences of actions (hence several nodes) can lead to the same state of the environment, and we write for the set of nodes with the same state. We define the set of children of a node as , where denotes the sequence of actions followed by the action . We define the target set as the set of nodes such that the corresponding states are goal states: . The searcher does not know the target set in advance and only recognizes a goal state when the environment transitions to one. If and with then we say that is a prefix of and that is an ancestor of (and is a descendant of ).

A search tree is a set of sequences of actions (nodes) such that (i) for all nodes , also contains all the ancestors of and (ii) if , then the tree contains no descendant of . The leaves of the tree are the set of nodes such that contains no descendant of . A policy assigns probabilities to sequences of actions under the constraint that and . If is a descendant of , we define the conditional probability . The policy is assumed to be provided as input to the search algorithm.

Let TS be a generic tree search algorithm defined as follows. At any expansion step , let be the set of nodes that have been expanded (visited) before (excluding) step , and let the fringe set be the set of not-yet-expanded children of expanded nodes, with and . At iteration , the search algorithm TS chooses a node for expansion: if , then the algorithm terminates with success. Otherwise, and the iteration starts. At any expansion step, the set of expanded nodes is a search tree. Let be the node expanded by TS at step . Then we define the search time as the number of node expansions before reaching any node of the target set .

A policy is Markovian if the probability of an action depends only on the current state of the environment, that is, for all and with . In this paper we consider both Markovian and non-Markovian policies. For some function over nodes, we define the cost of a state as . Then we say that a tree search algorithm with a cost function expands states in best-first order if for all states and , if , then is visited before . We say that a state is expanded at its lowest cost if for all states , the first node to be expanded has cost .

3 Levin tree search: policy-guided enumeration

First, we show that merely expanding nodes by decreasing order of their probabilities can fail to reach a goal state of non-zero probability.

Theorem 1.

The version of TS that chooses at iteration the node may never expand any node of the target set , even if .


Consider the tree in Fig. 1. Under the left child of the root is an infinite ‘chain’ in which each node has probability . Under the right child of the root is an infinite binary tree in which each node has two children, each of conditional probability , and thus each node has probability . Before testing a node of depth at least 2 in the right-hand-side binary tree (with probability at most ), the search expands infinitely many nodes of probability . Defining the target set as any set of nodes with individual probability at most proves the claim. ∎

To solve this problem, we draw inspiration from Levin search (levin1973search; trakhtenbrot1984survey), which (in a different domain) penalizes the probability with computation time. Here, we take computation time to mean the depth of a node. The new Levin tree search (LevinTS) algorithm is a version of TS in which nodes are expanded in order of increasing costs (see LABEL:lst:levints).

LevinTS also performs state cuts (see Lines 11 of LABEL:lst:levints). That is, LevinTS does not expand node representing state if (i) the policy is Markovian, (ii) it has already expanded another node that also represents , and (iii) . By performing state cuts only if these three conditions are met, we can show that LevinTS expands states in best-first order.

1def LevinTS()
4  while 
8    if 
9      return true
10    if is_Markov()  
11      if 
12        #  has already been visited with
13        # a higher probability: State cut
14        continue
17  return false
Algorithm 1: Levin tree search.
Algorithm 2: Sampling and execution of a single trajectory. def sample_traj(depth)      for d := 0 to depth     if        return true             return false for tree= grow’=south, l=1.2cm, s sep=0.6cm, minimum width=2em, draw,circle, [root, [,edge label=node[midway,above right] [,edge label=node[midway,above right] […, draw=none, edge label=node[midway,right]] […, draw=none, edge label=node[midway,left]] ] [,edge label=node[midway,above left] […, draw=none, edge label=node[midway,right]] […, draw=none, edge label=node[midway,left]] ] ] [,edge label=node[midway,above left], [,edge label=node[midway,left]1, […,draw=none,edge label=node[midway,left]1]]]]
Figure 1: A ‘chain-and-bin’ tree.
Theorem 2.

LevinTS expands states in best-first order and at their lowest cost first.


Let us first consider the case where the policy is non-Markovian. Then, LevinTS does not perform state cuts (see Line 1 of LABEL:lst:levints). Let and be two arbitrary different nodes (sequences of actions), with . Let be the closest common ancestor of and ; it must exist since at least the root is one of their common ancestors. Then all nodes on the path from to have cost less than and thus than , due to the monotonicity of and and thus of cost, which implies by recursion from that all these nodes and thus also are expanded before . Hence, if , this proves that all states are visited first at their lowest cost. Furthermore, if , this proves that states of lower cost are visited first.

Now, if the policy is Markovian, then we need to show that state cuts do not prevent best-first order and lowest cost. Let and be two nodes representing the same state , where is expanded before . Assume that no cut has been performed before is expanded. First, since no cuts were performed, we have from the non-Markovian case that . Secondly, consider a sequence of actions taken after state , and let be the node reached after taking starting from and similarly for . Since the environment is deterministic, this sequence leads to the same state , whether starting from or from . Since the policy is Markovian, . Then from the condition (iii) of state cuts,

so the state has a lower or equal cost below than below . Since this holds for any such , can be safely cut, and by recurrence all cuts preserve the best-first ordering and lowest costs of states. The rest of the proof is as in the non-Markovian case. ∎

LevinTS’s cost function allows us to provide the following guarantee, which is an adaptation of Levin search’s theorem (solomonoff1984optimum) to tree search problems.

Theorem 3.

Let be a set of target nodes, then LevinTS with a policy ensures that the number of node expansions before reaching any of the target nodes is bounded by


From Theorem 2, the first state of to be expanded is the one of lowest cost, and with one of the nodes of lowest cost, that is, with cost . Let be the current search tree when is being expanded. Then all nodes in that have been expanded up to now have at most cost . Therefore at all leaves of the current search tree, . Since each node is expanded at most once (each sequence of actions is tried at most once) the number of nodes expanded by LevinTS until node is at most

where the first inequality is because each leaf of depth has at most ancestors, the second inequality follows from , and the last inequality is because , which follows from , that is, each parent node splits its probability among its children, and the root has probability 1. ∎

The upper bound of Theorem 3 is tight within a small factor for a tree like in Fig. 1, and is almost an equality when the tree splits at the root into multiple chains.

4 Luby tree search: policy-guided unbounded sampling


When a good upper bound is known on the depth of a subset of the target nodes with large cumulative probability, a simple idea is to sample trajectories according to (see LABEL:lst:sampling) of that maximum depth until a solution is found, if one exists. Call this strategy multiTS (see LABEL:lst:multits). We can then provide the following straightforward guarantee.

Theorem 4.

The expected number of node expansions before reaching a node in is bounded by


Remembering that a tree search algorithm does not expand children of target nodes, the result follows from observing that

is the expectation of a geometric distribution with success probability

where each failed trial takes exactly node expansions and the success trial takes at most node expansions. ∎

This strategy can have an important advantage over LevinTS if there are many target nodes within depth bounded by with small individual probability but large cumulative probability.

The drawback is that if no target node has a depth shorter than the bound , this strategy will never find a solution (the expectation is infinite), even if the target nodes have high probability according to the policy . Ensuring such target nodes can be always found leads to the LubyTS algorithm.


Suppose we are given a randomized program , that has an unknown distribution over the halting times (where halting means solving an underlying problem). We want to define a strategy that can restart the program multiple times and run it each time with a different allowed running time so that it halts in as little cumulative time as possible in expectation. luby1993speedup prove that the optimal strategy is to run for running times of fixed lengths optimized for ; then either the program halts within steps, or it is forced to stop and is restarted for another steps and so on. This strategy has an expected running time of , with where

is the cumulative distribution function of

. luby1993speedup also devise a universal restarting strategy based on a special sequence111 of running times:

1 1 2 1 1 2 4 1 1 2 1 1 2 4 8 1 1 2 1 1 2 4 1 1 2 1 1 2 4 8 16 1 1 2…

They prove that the expected running time of this strategy is bounded by and also prove a lower bound of for any universal restarting strategy. We propose to use instead the sequence222 Gary Detlefs (ibid) notes that it can be computed with or with where is ’s complement to 2. A6519:

1 2 1 4 1 2 1 8 1 2 1 4 1 2 1 16 1 2 1 4 1 2 1 8 1 2 1 4 1 2 1 32 1 2…

which is simpler to compute and for which we can prove the following tighter upper bound.

Theorem 5.

For all distributions over halting times, the expected running time of the restarting strategy based on A6519 is bounded by , where is the cumulative distribution of .

The proof is provided in Appendix B. We can easily import the strategy described above into the tree search setting (see LABEL:lst:lubyts), and provide the following result.

def multiTS(nsims, )
  for k := 1 to nsims
    if sample_traj()
      return true
  return false
Algorithm 3: Sampling of nsims trajectories of fixed depths .
def LubyTS(nsims, =1)
  for k := 1 to nsims
    if sample_traj((k))
      return true
  return false
Algorithm 4: Sampling of nsims trajectories of depths that follow A6519, with optional coefficient .
Theorem 6.

Let be the set of target nodes, then LubyTS(, 1) with a policy ensures that the expected number of node expansions before reaching a target node is bounded by

where is the cumulative probability of the target nodes with depth at most .


This is a straightforward application of Theorem 5: The randomized program samples a sequence of actions from the policy , the running time becomes the depth of a node , the probability distribution over halting times becomes the probability of reaching a target node of depth , , and the cumulative distribution function becomes . ∎

Compared to Theorem 4, the cost of adapting to an unknown depth is an additional factor . The proof of Theorem 5 suggests that the term is due to not knowing the lower bound on , and the term is due to not knowing the upper bound. If a good lower bound on the average solution length is known, one can also multiply by to avoid sampling too short trajectories as in LABEL:lst:lubyts; this may lessen the factor while still guaranteeing that a solution can be found if one of positive probability exists. In particular, in the tree search domain, the sequence A6519 samples trajectories of depth 1 half of the time, which is wasteful. Conversely, in general it is not possible to cap at some upper bound, as this may prevent finding a solution as for multiTS. Hence the factor remains, which is unfortunate since can easily be exponentially small with .

5 Strengths and weaknesses of LevinTS and LubyTS

Consider a “needle-in-the-haystack problem” represented by a perfect full and infinite binary search tree where all nodes have probability . Suppose that the set of target nodes contains a single node at some depth . According to Theorem 3, LevinTS needs to expand no more than nodes before expanding . For this particular tree, the number of expansions is closer to since there are only at most nodes with cost lower or equal to . Theorem 6 and the matching-order lower bound of (luby1993speedup) suggest LubyTS may expand in expectation nodes to reach . This additional factor of compared to LevinTS is a non-negligible price for needle-in-a-haystack searches. For multiTS, if the depth bound is larger than , then the expected search time is at most and close to , which is a factor faster than LubyTS, unless .

Now suppose that the set of target nodes is composed of nodes, all at depth . Since all nodes at a given depth have the same probability, LevinTS will expand at least and at most nodes before expanding any of the target nodes. By contrast, because the cumulative probability of the target nodes at depth is , LubyTS finds a solution in node expansions, which is an exponential gain over LevinTS. For multiTS it would be , which can be worse than due to the need for a large enough .

LevinTS can perform state cuts if the policy is Markovian, which can substantially reduce the algorithm’s search effort. For example, suppose that in the binary tree above every left child represents the same state as the root and thus is cut off from the search tree, leaving in effect only nodes for any depth . If the target set contains only one node at some depth , even when following a uniform policy, LevinTS expands only those nodes. By contrast, LubyTS expands in expectation more than nodes. LevinTS has a memory requirement that grows linearly with the number of nodes expanded, as well as a log factor in the computation time due to the need to maintain a priority queue to sort the nodes by cost. By contrast, LubyTS and multiTS have a memory requirement that grows linearly with the solution depth, as they only need to store in memory the trajectory sampled. LevinTS’s memory cost could be alleviated with an iterative deepening (korf1985depth) variant with transposition table (ReinefeldM94).

6 Mixing policies and avoiding zero probabilities

For both LevinTS and LubyTS, if the provided policy incorrectly assigns a probability too close to 0 to some sequences of actions, then the algorithm may never find the solution. To mitigate such outcomes, it is possible to ‘mix’ the policy with the uniform policy so that the former behaves slightly more like the latter. There are several ways to achieve this, each with their own pros and cons.

Bayes mixing of policies

If and are two policies, we can build their Bayes average with prior and such that for all sequence of actions , . The conditional probability of the next action is given by

where is the ‘posterior weight’ of the policy in . This ensures that for all nodes and which leads to the following refinement for Theorem 3 for example (and similarly for LubyTS):

In particular, with , LevinTS with is within a factor 2 of the best between LevinTS with and LevinTS with . More than two policies can be mixed together, leading for example to a factor compared to the best of policies when all prior weights are equal. This is very much like running several instances of LevinTS in parallel, each with its own policy, except that (weighted) time sharing is done automatically. For example, if the provided policy is likely to occasionally assign too low probabilities, one can run LevinTS with a Bayes mixture of and the uniform policy, with a prior weight closer to 1 if is likely to be far better than the uniform policy for most instances.

Local mixing of policies, fixed rate

Bayes mixing of two policies splits the search into 2 (mostly) independent searches. But one may want to mix at a more ‘local’ level: Along a trajectory , if the provided policy assigns high probability to almost all actions but a very low probability to a few ones, we may want to use a different policy just for these actions, and not for the whole trajectory. Thus, given two policies and and , the local-mixing policy is defined through its conditional probability . Then for all ,

where is the set of steps where . This can be interpreted as ‘At each step , must pay a factor of to use policy or a factor of to use ’. This works well for example if and is small, that is, the policy is used most of the time. For example, can be the uniform policy, , and is a given policy that may sometimes be wrong.

Local mixing, varying rate

The problem with the previous approach is that needs to be fixed in advance. For a depth , a penalty of the number of node expansions of is large as soon as . If no good bound on is known, one can use a more adaptive with : This gives , which means that the maximum price to pay to use only the policy for all the steps is at most , and the price to pay each step the policy is used is approximately . The optimal value of can also be learned automatically using an algorithm such as Soft-Bayes (orseau2017softbayes) where the ‘experts’ are the provided policies, but this may have a large probability overhead for this setup.

7 Experiments: computer-generated Sokoban

We test our algorithms on 1,000 computer-generated levels of Sokoban (racaniere2017imagination) of 10x10 grid cells and 4 boxes.333The levels are available at For the policy, we use a neural network pre-trained with A3C (details on the architecture and the learning procedure are in Appendix A). We picked the best performing network out of 4 runs with different learning rates. Once the network is trained, we compare the different algorithms using the same network’s fixed Markovian policy. Note that for each new level, the goal states (and thus target set) are different, whereas the policy does not change (but still depends on the state). We test the following algorithms and parameters: LubyTS(256,1), LubyTS(256,32), LubyTS(512, 32), multiTS(1, 200), multiTS(100, 200), multiTS(200, 200), LevinTS. Excluding the small values (i.e., and ), the parameters were chosen to obtain a total number of expansions within the same order of magnitude. In addition to the policy trained with A3C, we tested LevinTS, LubyTS, and multiTS with a variant of the policy in which we add 1% of noise to the probabilities output of the neural network. That is, these variants use the policy where is the network’s policy and , to guide their search. These variants are marked with the symbol (*) in the table of results. We compare our policy tree search methods with a version of the LAMA planner (Richter2010) that uses the lazy version of GBFS with preferred operators and queue alternation with the FF heuristic. This version of LAMA is implemented in Fast Downward (Helmert06thefast), a domain-independent solver. We used this version of LAMA because it was shown to perform better than other state-of-the-art planners on Sokoban problems (XieNM12). Moreover, similarly to our methods, LAMA searches for a solution of small depth rather than a solution of minimal depth.

Table 1

presents the number of levels solved (“Solved”), average solution length (“Avg. length”), longest solution length (“Max. length”), and total number of nodes expanded (“Total expansions”). The top part of the table shows the sampling-based randomized algorithms. In addition to the average values, we present the standard deviation of five independent runs of these algorithms. Since LevinTS and LAMA are deterministic, we present a single run of these approaches.

Fig. 2 shows the number of nodes expanded per level by each method when the levels are independently sorted for each approach from the easiest to the hardest Sokoban level in terms of node expansions. The Uniform searcher (LevinTS with a uniform policy) with maximum 100,000 node expansions per level—and still with state cuts—can solve no more than 9% of the levels, which shows that the problem is not trivial.

Algorithm Solved Avg. length Max. length Total expansions
Uniform 88 19 59 94,423,278
LubyTS(256, 1) 753 5 41.0 0.6 228 18.6 63,8481 2,434
LubyTS(256, 32) 870 2 48.4 0.9 1,638.4 540.7 6,246,293 73,382
LubyTS(512, 32) 884 4 54.8 4.2 3,266.6 1,287.8 11,515,937 211,524
LubyTS(512, 32) (*) 896 2 50.7 2.5 1,975.6 904.5 10,730,753 164,410
MultiTS(1, 200) 669 5 41.3 0.6 196.4 2.2 93,768 925
MultiTS(100, 200) 866 4 47.8 0.5 199.4 0.5 3260536 57185
MultiTS(200, 200) 881 1 47.9 0.7 196.4 2.3 5,768,680 116,152
MultiTS(200, 200) (*) 895 3 48.8 0.4 198.8 1 5,389,534 45,085
LevinTS 1,000 39.8 106 6,602,666
LevinTS (*) 1,000 39.5 106 5,026,200
LAMA 1,000 51.6 185 3,151,325
Table 1: Comparison of different solvers on the 1000 computer-generated levels of Sokoban. For randomized solvers (shown at the top part of the table), the results are aggregated over 5 random seeds ( indicates standard deviation). (*) Uses with .
Figure 2: Node expansions for Sokoban on log-scale. The levels indices (x-axis) are sorted independently for each solver from the easiest to the hardest level. For clarity a typical run has been chosen for randomized solvers; see Table 1 for standard deviations.

For most of the levels, LevinTS (with the A3C policy) expands many fewer nodes than LAMA, but has to expand many more nodes on the last few levels. On 998 instances, the cumulative number of expansions taken by LevinTS is ~2.7e6 nodes while LAMA expands ~3.1e6 nodes. These numbers contrast with the number of expansions required by LevinTS (6.6e6) and LAMA (3.15e6) to solve all 1,000 levels. The addition of noise to the policy reduces the number of nodes expanded by LevinTS while solving harder instances at the cost of increasing the number of nodes expanded for easier problems (see the lines of the two versions of LevinTS crossing at the right-hand side of Fig. 2). Overall, noise reduces from 6.6e6 to 5e6 the total number of nodes LevinTS expands (see Table 1). LevinTS has to expand a large number of nodes for a small number of levels likely due to the training procedure used to derive the policy. That is, the policy is learned only from the 65% easiest levels solved after sampling single trajectories—harder levels are never solved during training. Nevertheless, LevinTS can still solve harder instances by compensating the lack of policy guidance with search.

The sampling-based methods have a hard time reaching 90% success, but still improves by more than 20% over sampling a single trajectory. LubyTS(256, 32) improves substantially over LubyTS(256, 1) since many solutions have length around 30 steps. LubyTS(256, 32) is as good as multiTS(200, 100) that uses a hand-tuned upper bound on the length of the solutions.

The solutions found by LevinTS are noticeably shorter (in terms of number of moves) than those found by LAMA. It is remarkable that LevinTS can find shorter solutions and expand fewer nodes than LAMA for most of the levels. This is likely due to the combination of good search guidance through the policy for most of the problems and LevinTS’s systematic search procedure. By contrast, due to its sampling-based approach, LubyTS tends to find very long solutions.

racaniere2017imagination report different neural-network based solvers applied to a long sequence of Sokoban levels generated by the same system used in our experiments (although we use a different random seed to generate the levels, we believe they are of the same complexity). racaniere2017imagination’s primary goal was not to produce an efficient solver per se, but to demonstrate how an integrated neural-based learning and planning system can be robust to model errors and more efficient than an MCTS baseline. Their MCTS approach solves 87% of the levels within approximately 30e6 node expansions (25,000 per level for 870 levels, and 500 simulations of 120 steps for the remaining 130 levels). Although LevinTS had much stronger results in our experiments, we note that racaniere2017imagination’s implementation of MCTS commits to an action every 500 node expansions. By contrast, in our experimental setup, we assume that LevinTS solves the problem before committing to an action. This difference makes the results not directly comparable. racaniere2017imagination

’s second solver (I2A) is a hybrid model-free and model-based planning using a LSTM-based recurrent neural network with more learning components than our approaches. I2A reaches 95% success within an estimated total of 5.3e6 node expansions (4,000 on average over 950 levels, and 30,000 steps for the remaining 50 unsolved levels; this counts the internal planning steps). For comparison, LevinTS with 1% noise solves all the levels within the same total time (999 for LevinTS without noise). Moreover, LevinTS solves 95% of the levels within a total of less than 160,000 steps, which is approximately 168 node expansions on average for solved levels, compared to the reported 4,000 for I2A. Moreover, it is also not clear how long it would take I2A to solve the remaining 5%.

8 Conclusions and future works

We introduced two novel tree search algorithms for single-agent problems that are guided by a policy: LevinTS and LubyTS. Both algorithms have guarantees on the number of nodes that they expand before reaching a solution (strictly for LevinTS, in expectation for LubyTS). LevinTS and LubyTS depart from the traditional heuristic approach to tree search by employing a policy instead of a heuristic function to guide search while still offering important guarantees.

The results on the computer-generated Sokoban problems using a pre-trained neural network show that these algorithms can largely improve through tree search upon the score of the network during training. Our results also showed that LevinTS is able to solve most of the levels used in our experiment while expanding many fewer nodes than a state-of-the-art heuristic search planner. In addition, LevinTS was able to find considerably shorter solutions than the planner.

The policy can be learned by various means or it can even be handcrafted. In this paper we used reinforcement learning to learn the policy. However, the bounds offered by the algorithms could also serve directly as metrics to be optimized while learning a policy; this is a research direction we are interested in investigating in future works.


The authors wish to thank Peter Sunehag, Andras Gyorgy, Rémi Munos, Joel Veness, Arthur Guez, Marc Lanctot, André Grahl Pereira, and Michael Bowling for helpful discussions pertaining this research. Financial support for this research was in part provided by the Natural Sciences and Engineering Research Council of Canada (NSERC).


Appendix A Network architecture and learning protocol

The network takes as input a 10x10x4 grid where the last dimension is for a binary encoding of the different attributes (wall, man, goal, box), which is passed through 2 convolutional layers ( with channels, followed by

with 64 channels as well), followed by a fully connected layer of 512 ReLU units. The output layer provides logits for the 4 actions (up, down, left, right). Training is performed using A3C 


with a reward function giving a reward of -0.1 per step, +1 per box on a goal and -1 for the converse action, and +10 for solving the level (all boxes on goals), with a discount factor of 0.99; the optimizer used is RMSProp 

[Tieleman2012rmsprop] (no momentum, epsilon 0.1, decay 0.99), with entropy regularization of 0.005. During training, at each episode, the learner performs a single trajectory of length 100 (like multiTS(1, 100)), receives the corresponding rewards, then moves on to the next episode. A single level is (very likely) never seen twice during training. Similarly, it is very unlikely that a level of the 1000 test levels was seen during training. We take the best performing network, which solves around 65% of the levels when sampling a single sequence of actions. The network is trained for 3.5e9 steps (node expansions), which can seem to be a lot, however notice that this is equivalent to fully searching a single level of Sokoban (without state cuts) uniformly with 4 actions up to depth 16 (given that solutions are usually of depth more than 30). The learning process was repeated for 4 learning rates (4e-4, 2e-4, 1e-4, 5e-5) (see Fig. 3).

Figure 3: Learning curves of A3C for the 4 chosen learning rates (4e-4, 2e-4, 1e-4, 5e-5) on the Sokoban level generator.

Appendix B Another universal restarting strategy for Las Vegas programs

We use the sequence444 of runtimes :

1 2 1 4 1 2 1 8 1 2 1 4 1 2 1 16 1 2 1…

It has the ‘fractal’ property that (since ), for and , and it follows that and .

At iteration , the Las Vegas program is run for steps. For all , if , then it has a probability at least of halting, otherwise it does not halt and is forcibly stopped after computations steps. Let be the smallest power of 2 greater than or equal to . Then Lemma 8 below tells us that for we have that , that is, between two consecutive factors of , .

Let denote the probability that the algorithm halts exactly at the th run, and take and , then the expected number of computation steps (sum of the lengths of the runs) before halting is given by:

where when , and otherwise.

We restate Theorem 5 more precisely:

Theorem 7.

For all distributions over halting times, the expected runtime of the universal restarting strategy based on is bounded by

where is the cumulative distribution of .

Proof of Theorems 7 and 5.

At step , if is the number of past runs where (with ), then then with and :

where we used (remembering that is a power of 2) and Lemma 8 for . Since when , we can decompose into the steps where and the rest:

where we used Lemma 13 on the last line with . Finally, since and and :

which proves the result. ∎

Lemma 8.

For A6519, with and , and with odd, then


Since is odd, then so is , and so . ∎

Hence, for all numbers between two adjacent factors of , .

Lemma 9.

For and A6519,


If and using Lemma 8 again at :

Lemma 10.

Let A6519, then for :


Since any number can be uniquely written in the form , and with , then . ∎

Lemma 11.

For ,


Let for , then where for the unique such that and since , we have that is positive for and negative for . Thus is unimodal, and since furthermore is positive the sum can be upper bounded by the integral of the continuous function plus its maximum:

where we used integration by substitution. Adding the two terms finishes the proof. ∎

Lemma 12.

For and :


Let , then