Introduction
One of the natural approaches for selecting actions in very large state spaces is by performing a limited amount of lookahead. In the contexts of discounted MDPs, Kearns, Mansour, and Ng have shown that near to optimal actions can be selected by considering a sampled lookahead tree that is sufficiently sparse, whose size depends on the discount factor and the suboptimality bound but not on the number of problem states [Kearns, Mansour, and Ng1999]. The UCT algorithm [Kocsis and Szepesvári2006] is a version of this form of Monte Carlo planning, where the lookahead trees are not grown depthfirst but ‘bestfirst’, following a selection criterion that balances ‘exploration’ and ‘exploitation’ borrowed from the UCB algorithm for multiarmed bandit problems [Auer, CesaBianchi, and Fischer2002].
While UCT does not inherit the theoretical properties of either Sparse Sampling or UCB, UCT is an anytime optimal algorithm for discounted or finite horizon MDPs that eventually picks up the optimal actions when given sufficient time. The popularity of UCT follows from its success in the game of Go where it outperformed all other approaches [Gelly and Silver2007]; a success that has been replicated in other models and tasks such as RealTime Strategy Games [Balla and Fern2009], General Game Playing [Finnsson and Björnsson2008], and POMDPs [Silver and Veness2010].
An original motivation for the work reported in this paper was to get a better understanding of the success of UCT and related MonteCarlo Tree Search (MCTS) methods [Chaslot et al.2008].^{1}^{1}1 For a different attempt at understanding the success of UCT, see [Ramanujan, Sabharwal, and Selman2010]. It has been argued that this success is the result of adaptive sampling methods: sampling methods that achieve a good explorationexploitation tradeoff. Yet, adaptive sampling methods like RealTime Dynamic Programming (RTDP) [Barto, Bradtke, and Singh1995] have been used before in planning. For us, another important reason for the success of MCTS methods is that they address a slightly different problem; a problem that can be characterized as:

anytime action selection over MDPs (and related models) given a time window, resulting in good selection when the window is short, and near to optimal selection when window is sufficiently large, along with

nonexhaustive search combined with the ability to use informed base policies for improved performance.
From this perspective, an algorithm like RTDP fails on two grounds: first, RTDP does not appear to make best use of short time windows in large state spaces; second, and more importantly, RTDP can use admissible heuristics but not informed base policies. On the other hand, algorithms like Policy Iteration [Howard1971]
, deliver all of these features except one: they are exhaustive, and thus even to get started, they need vectors with the size of the state space. At the same time, while there are nonexhaustive versions of (asynchronous) Value Iteration such as RTDP, there are no similar ‘focused’ versions of Policy Iteration ensuring anytime optimality. Rollouts and nested rollouts are two practical and focused versions of Policy Iteration over large spaces
[Bertsekas, Tsitsiklis, and Wu1997, Yan et al.2005], yet neither one aims at optimality.^{2}^{2}2 Nested rollouts can deliver optimal action selection but for impracticable large levels of nesting.In this work, we introduce a new, simple heuristic search algorithm designed to address points 1 and 2 above, and compare it with UCT. We call the new algorithm Anytime AO*, because it is a very simple variation of the classical AO* algorithm for AND/OR graphs [Nilsson1980] that is optimal even in the presence of nonadmissible heuristics. Anytime AO* is related to recent anytime and online heuristic search algorithms for OR graphs [Likhachev, Gordon, and Thrun2003, Hansen and Zhou2007, Koenig and Sun2009, Thayer and Ruml2010]
. It is well known that A* and other bestfirst algorithms can be easily converted into anytime optimal algorithms in the presence of nonadmissible heuristics by simply continuing the search after the first solution is found. The same trick, however, does not work for bestfirst algorithms over AND/OR graphs that must be able to expand leaf nodes of the explicit graph that are not part of the best partial solution. Thus, Anytime AO* differs from AO* in two main points: first, with probability
, it expands leaf nodes that are not part of the best partial solution; second, the search finishes when time is up or there are no more leaf nodes to expand at all (not just in the best solution graph). Anytime AO* delivers an optimal policy eventually and can also use random heuristics that are not admissible and can be sampled, as when the heuristics result from rollouts of a given base policy.MDPs
Markov Decision Processes are fully observable, stochastic state models. In the discounted reward formulation, an MDP is given by a set of states, sets of actions applicable in each state , transition probabilities of moving from into when the action is applied, real rewards for doing action in the state , and a discount factor , . A solution is a policy selecting an action in each state . A policy is optimal if it maximizes the expected accumulated discounted reward.
Undiscounted finite horizon MDPs replace the discount factor by a positive integer horizon . The policies for such MDPs are functions mapping nodes into actions , where is a state and is the horizon to go, . Optimal policies maximize the total reward that can be accumulated in steps. Finite horizon MDPs are acyclic as all actions decrement the horizon to go by .
Undiscounted infinite horizon MDPs are like discounted MDPs but with discount . For such MDPs to have welldefined solutions, it is common to assume that rewards are negative, and thus represent positive costs, except in certain goal states that are costfree and absorbing. If these goal states are reachable from all the other states with positive probability, the set of optimal policies is well defined.
The MDPs above are rewardbased. AO* is used normally in a cost setting, where rewards are replaced by costs , and maximization is replaced by minimization. The two points of view are equivalent, and we will use one or the other when most convenient.
We are interested in the problem of selecting the action to do in the current state of a given infinite horizon MDPs, whether discounted or undiscounted. This will be achieved by a lookahead that uses a limited time window to run an anytime optimal algorithm over the version of the MDP that results from fixing the horizon.
Ao*
A finite horizon MDP defines an implicit AND/OR graph that can be solved by the wellknown AO* algorithm [Nilsson1980]. The root node of this graph is the pair where is the initial state and is the horizon, while the terminal nodes are of the form where is a state and is , or is a terminal state (goal or deadend). The children of a nonterminal node are the triplets where is an action in , while the children of a node are the nodes where is a possible successor state of in ; i.e., . Nonterminal nodes of the form are ORnodes where an action needs to be selected, while nodes are nonterminal ANDnodes.
The solutions graphs to these AND/OR graphs are defined in the standard way; they include the root node , and recursively, one child of every ORnode and all children of every ANDnode. The value of a solution is defined recursively: the leaves have value equal to , the ORnodes have value equal to the value of the selected child, and the ANDnodes have value equal to the cost of the action in plus the sum of the values for the children weighted by their probabilities . The optimal solution graph yields a minimum value to the root node (cost setting), and can be computed as part of the evaluation procedure, marking the best child (min cost child) for each ORnode. This procedure is called backward induction. The algorithm AO* computes an optimal solution graph in a more selective and incremental manner, using a heuristic function over the nodes of the graph that is admissible or optimistic (does not overestimate in the cost setting).
AO* maintains a graph that explicates part of the implicit AND/OR graph, the explicit graph, and a second graph , the best partial solution, that represents an optimal solution of under the assumption that the tips of are the terminal nodes with values given by the heuristic . Initially, contains the root node of the implicit graph only, and is . Then, iteratively, a nonterminal leaf node is selected from the best partial solution , and the children of this node in the implicit graph are explicated in . The best partial solution of is then revised by applying an incremental form of backward induction, setting the values of the leaves in to their heuristic values. The procedure finishes when there are no leaf nodes in the best partial graph . If the heuristic values are optimistic, the best partial solution is an optimal solution to the implicit AND/OR graph, which is partially explicated in . In the best case, ends up containing no more nodes than those in the best solution graph; in the worst case, ends up explicating all the nodes in the implicit graph. Code for AO* is shown in Fig. 1. The choice of which nonterminal leaf node in the best partial graph to expand is important for performance but is not part of the algorithm and it does not affect its optimality.
Uct
UCT has some of the flavour of AO* and it is often presented as a bestfirst search algorithm too.^{3}^{3}3This is technically wrong though, as a bestfirst algorithm just expands nodes in the best partial solution in correspondence to what is called ‘exploitation’. Yet UCT as the other MCTS algorithms, and Anytime AO* below, do ‘exploration’ as well. Indeed, UCT maintains an explicit partial graph that is expanded incrementally like the graph in AO*. The main differences are in how leaf nodes of this graph are selected and expanded, and how heuristic values for leafs are obtained and propagated. The code for UCT, which is more naturally described as a recursive algorithm, is shown in Fig. 2.
UCT consists of a sequence of stochastic simulations, like RTDP, that start at the root node. However, while the choice of successor states is stochastic, the choice of the actions is not greedy on the action values as in RTDP, but greedy on the sum of the action values and a bonus term that ensures that all applicable actions are tried in all states infinitely often at suitable rates. Here, is an exploration constant, and and are counters that track the number of simulations that had passed through the node and the number of times that action has been selected at such node.
The counters and are maintained for the nodes in the explicit graph that is extended incrementally, starting like in AO*, with the single root node . The simulations start at the root and terminate at a terminal node or at the first node that is not in the graph. In between, UCT selects an action that is greedy on the stored value plus the bonus term, samples the next state with probability , increments the counters and , and generates the node . When a node is generated that is not in the explicit graph, the node is added to the explicit graph, the registers , , and are allocated and initialized to , and a total discounted reward is sampled, by simulating a base policy for steps starting at , and propagated upward along the nodes in the simulated path. These values are not propagated using full Bellman backups as in AO* or Value Iteration, but through MonteCarlo backups that extend the current average with a new sampled value [Sutton and Barto1998]; see Fig. 2 for details.
It can be shown that UCT eventually explicates the whole finite horizon MDP graph, and that it converges to the optimal policy asymptotically. Unlike AO*, however, UCT does not have a termination condition, and moreover, UCT does not necessarily add a new node to the graph in every iteration, even when there are such nodes to explicate. Indeed, while in the worst case, AO* converges in a number of iterations that is bounded by the number of nodes in the implicit graph, UCT may require an exponential number of iterations [Munos and Coquelin2007, Walsh, Goschin, and Littman2010]. On the other hand, AO* is a modelbased approach to planning that assumes that the probabilities and costs are known, while UCT is a simulationbased approach that just requires a simulator that can be reset to the current state.
The differences between UCT and AO* are in the leafs of the graphs selected for expansion, the way they are expanded, the values to which they are initialized, and how the values are propagated. These dimensions are the ones used to define a family of Monte Carlo Tree Search (MCTS) methods that includes UCT as the best known member [Chaslot et al.2008]. The feature that is common to this family of methods is the use of Monte Carlo simulations to evaluate the leafs of the graph. The resulting values are heuristic, although not necessarily optimistic as required for the optimality of AO*. One way to bridge the gap between AO* and MCTS methods is by modifying AO* to accommodate nonadmissible heuristics, and moreover, random heuristics that can be sampled such as the cost of a base policy.
Anytime AO*
Anytime AO* involves two small changes from AO*. The first, shown in Fig. 3, is for handling nonadmissible heuristics: rather than always selecting a nonterminal tip node from the explicit graph that is IN the best partial graph, Anytime AO* selects with probability a nonterminal tip node from the explicit graph that is OUT of the best partial graph. The probability is a given parameter between and , by default . Of course, if a choice for an IN node is decided (a tip in the best partial graph) but there is no such node, then an OUT choice is forced, and vice versa. Anytime AO* terminates when neither IN or OUT choices are possible; i.e., when no tip nodes in the explicit graph are left, or when the time is up.
It is easy to see that Anytime AO* is optimal, whether the heuristic is admissible or not, because it terminates when the implicit graph has been fully explicated. In the worst case, the complexity of Anytime AO* is not worse than AO*, as AO* expands the complete graph in the worst case too.
The second change in Anytime AO*, not shown in the figure, is for dealing with random heuristics . Basically, when the value of a tip node is set to a heuristic that is a random variable, such as the reward obtained by following a base policy steps from , Anytime AO* uses samples of until the node is expanded. Until then, a ‘fetch for value’ , which occurs each time that a parent node of is updated, results in a new sample of which is averaged with the previous ones. This is implemented in standard fashion by incrementally updating the value using a counter and the new sample. These counters are no longer needed when the node is expanded, as then the value of the node is given by the value of its children in the graph.
Choice of Tip Nodes in Anytime AO*
AO* and Anytime AO* leave open the criterion for selecting the tip node to expand. For the experiments, we use a selection criterion aimed at selecting the tip nodes that can have the biggest potential impact in the best partial graph. For this, a function is introduced that measures the change in the value of the node that is needed in order to produce a change in the best partial graph. The function is defined topdown over the explicit graph as:

For the root node, .

For children of node in the best solution graph, is if is not the best action in , else is , for .

For children of node that is not in the best solution graph, is .

For children of node , is .
The tip nodes that are chosen for expansion are the ones that minimize the values . Since this computation is expensive, as it involves a complete traversal of the explicit graph , we select tip nodes for expansion at a time. This selection is implemented by using two priority queues during the graph traversal: one for selecting the best tips in the solution graph (IN), and one for selecting the best tips OUT. The first selected tip is the one with min in the IN queue with probability and in the OUT queue otherwise. Once selected, the tip is removed from the queue, and the process is repeated times.
Experimental Results
We have evaluated Anytime AO* (abbreviated AOT) vs. UCT as an action selection mechanism over a number of MDPs. In each case, we run the planning algorithm for a number of iterations from the current state , apply the best action according to the resulting values, where is the planning horizon, and repeat the loop from the state that results until the goal is reached. The quality profile of the algorithms is given by the average cost to the goal as a function of the time window for action selection. Each data point in the profiles (and table) is the average over 1,000 sampled episodes that finish when the goal is reached or after 100 steps. For the profiles, the axis stands for the average time per action selected, and the axis for the average cost to the goal. AOT was run with parameter (i.e., equal probability for IN and OUT choices), and UCT with the exploration ‘constant’ set to current value of the node; a choice that appears to be standard [Balla and Fern2009, Eyerich, Keller, and Helmert2010]. The parameter for the computation of the ’s was set as a fraction of the number of iterations, .^{4}^{4}4The number of tip nodes in UCT is bounded by the number of iterations (rollouts). In AOT, the number of tips is bounded in the worst case by the number of iterations (expansions) multiplied by , where and are the number of actions and states. The actual codes for UCT and AOT differ from the ones shown in that they deal with graphs instead of trees; i.e., there are duplicates with the same state and depth . The same applies to the computation of the ’s. All experiments were run on Xeon ’Woodcrest’ computers of 2.33 GHz and 8 Gb of RAM.
Ctp.
The Canadian Traveller Problem is a path finding problem over a graph whose edges (roads) may be blocked [Papadimitriou and Yannakakis1991]
. Each edge is blocked with a given prior probability, and the status of an edge (blocked or free) can be sensed noisefree from either end of the edge. The problem is an acyclic POMDP that results in a belief MDP whose states are given by the agent position, and edge beliefs that can take 3 values: prior, known to be blocked, and known to be free. If the number of nodes is
and the number of edges in the graph is , the number of MDP states is . The problem has been addressed recently using a domainspecific implementation of UCT [Eyerich, Keller, and Helmert2010]. We consider instances with 20 nodes from that paper and instances with 10 nodes obtained from the authors. Following Eyerich, Keller and Helmert, the actions are identified with moves to any node in the frontier of the known graph. The horizon can then be set to the number of nodes in the graph.(a) 107  (b) 207 
Quality profiles are shown in Fig. 4 for instances 107 and 207. On each panel, results on the left are for UCT and AOT using the base random policy, while results on the right are for the base optimistic policy that assumes that all edges of unknown status are traversable. The points for UCT correspond to running 10, 50, 100, 500, 1k, 5k, 10k and 50k iterations (rollouts), resulting in the times shown. The points for AOT, correspond to running 10, 50, 100, 500, 1k, 5k, 10k iterations (expansions). Points that time out are not displayed. The curves show that AOT performs better than UCT except up to the time window of 1 second in the instance 207 for the random base policy. We have computed the curves for the 20 instances and the pattern is similar (but not shown for lack of space). A comprehensive view of the results is shown in Table 1. As a reference, we include also the results for two specialized implementations of UCT, UCTB and UCTO [Eyerich, Keller, and Helmert2010], that take advantage of the specific MDP structure of the CTP, use a more informed base policy, and actually solve a slightly simpler version of the problem where the given CTP is solvable. While all the algorithms are evaluated over solvable instances, AOT and our UCT solve the harder problem of getting to the target location if the problem is solvable, and determining otherwise that the problem is unsolvable. In some instances, the edge priors are such that the probability of an instance not being solvable is high. This is shown as the probability of ‘bad weather’ in the second column of the table. In spite of these differences, AOT improves upon the domainspecific implementation of UCT in several cases, and practically dominates the domainindependent version. Since UCTO is the state of the art in CTP, the results show that our domainindependent implementations of both AOT and UCT are good enough, and that AOT in particular appears to be competitive with the state of the art in this domain.
Sailing and Racetrack.
The Sailing domain is similar to the one in the original UCT paper [Kocsis and Szepesvári2006]. The profile for a instance with 80,000 states is shown in the left panel of Fig. 5 for a random base policy. The problem has a discount and the optimal value is . UCT and AOT are run with horizon . Profiles for other instances show the same pattern: AOT is slower to get started because of the more expensive expansions, but then learns faster. The profile for the instance ’bartobig’ of the Racetrack domain [Barto, Bradtke, and Singh1995] is shown in the right panel of Fig. 5. In this case, AOT converges much faster than UCT.
Base Policies, Heuristics, and RTDP.
We also performed experiments comparing the use of heuristics vs. base policies for initializing the value of tip nodes in AOT. We refer to AOT using as a base policy as AOT(), and to AOT using the heuristic as AOT(). Clearly, the overhead per iteration is smaller in AOT() than in AOT() that must do a full rollout of to evaluate a tip node. Results of AOT( vs. AOT() on the instances 201 and 204 of CTP are shown in Fig. 6 with both the zero heuristic and the minmin heuristic [Bonet and Geffner2003]. The base policy is the policy that is greedy with respect to with being the random policy when . The curves also show the performance of LRTDP [Bonet and Geffner2003], used as an anytime action selection mechanism that is run over the same finitehorizon MDP as AOT and with the same heuristic . Interestingly, when LRTDP is used in this form, as opposed to an offline planning algorithm, LRTDP does rather well on CTP too, where there is no clear dominance between AOT(), AOT(), and LRTDP(). The curves for the instance 201 are rather typical of the CTP instances. For the zero heuristic, AOT() does better than LRTDP() which does better than AOT(). On the other hand, for the minmin heuristic, the ranking is reversed but with differences in performance being smaller. There are some exceptions to this pattern where AOT() does better, and even much better than both AOT() and LRTDP() for the two heuristics. One such instance, 204, is shown in the bottom part of Fig. 6.
Figure 7 shows the same comparison for a Racetrack instance and three variations of the minmin heuristic , for . Notice that multiplying an heuristic by a constant has no effect on the policy that is greedy with respect to , and hence no effect on AOT(). On the other hand, these changes affect both AOT() and LRTDP(). As expected the performance of LRTDP() deteriorates for , but somewhat surprisingly, improves for even as the resulting heuristic is not (necessarily) admissible. For small time windows, AOT() does worse than LRTDP(), but for larger windows AOT() catches up, and in two cases surpasses LRTDP(). In these cases, AOT with the greedy base policy appears to do best of all.
Leaf Selection.
We have also tested the value of the leaf selection method used in AOT by comparing the based selection of tips vs. random tip selection. It turns out that the former pays off, in particular, when the base policies are not informed. Some results are shown in Fig. 8.
The results obtained from the experiments above are not conclusive. They show however that the new AOT algorithm, while a simple variation of the standard AO* algorithm, appears to be competitive with both UCT and LRTDP, while producing stateoftheart results in the CTP domain.
Conclusions
The algorithm UCT addresses the problem of anytime action selection over MDPs and related models, combining a nonexhaustive search with the ability to use informed base policies. In this work, we have developed a new algorithm for this task and showed that it compares well with UCT. The new algorithm, Anytime AO* (AOT) is a very small variation of AO* that retains the optimality of AO* and its worst case complexity, yet it does not require admissible heuristics and can use base policies. The work helps to bridge the gap between MonteCarlo Tree Search (MCTS) methods and anytime heuristic search methods, both of which have flourished in recent years, the former over AND/OR Graphs and Game Trees, the latter over OR graphs. The relation of Anytime AO* to both classes of methods suggest also a number of extensions that are worth exploring in the future.
Acknowledgments.
H. Geffner is partially supported by grants TIN200910232, MICINN, Spain, and EC7PMSpaceBook.
References
 [Auer, CesaBianchi, and Fischer2002] Auer, P.; CesaBianchi, N.; and Fischer, P. 2002. Finitetime analysis of the multiarmed bandit problem. Machine learning 47(2):235–256.
 [Balla and Fern2009] Balla, R., and Fern, A. 2009. UCT for tactical assault planning in realtime strategy games. In Proc. IJCAI09, 40–45.
 [Barto, Bradtke, and Singh1995] Barto, A.; Bradtke, S.; and Singh, S. 1995. Learning to act using realtime dynamic programming. Artificial Intelligence 72:81–138.

[Bertsekas, Tsitsiklis, and
Wu1997]
Bertsekas, D.; Tsitsiklis, J.; and Wu, C.
1997.
Rollout algorithms for combinatorial optimization.
J. of Heuristics 3(3):245–262.  [Bonet and Geffner2003] Bonet, B., and Geffner, H. 2003. Labeled RTDP: Improving the convergence of realtime dynamic programming. In Proc. ICAPS, 12–31.
 [Chaslot et al.2008] Chaslot, G.; Winands, M.; Herik, H.; Uiterwijk, J.; and Bouzy, B. 2008. Progressive strategies for montecarlo tree search. New Math. and Natural Comp. 4(3):343.
 [Eyerich, Keller, and Helmert2010] Eyerich, P.; Keller, T.; and Helmert, M. 2010. Highquality policies for the canadian traveler’s problem. In Proc. AAAI.
 [Finnsson and Björnsson2008] Finnsson, H., and Björnsson, Y. 2008. Simulationbased approach to general game playing. In Proc. AAAI, 259–264.
 [Gelly and Silver2007] Gelly, S., and Silver, D. 2007. Combining online and offline knowledge in uct. In Proc. ICML, 273–280.
 [Hansen and Zhou2007] Hansen, E., and Zhou, R. 2007. Anytime heuristic search. J. Artif. Intell. Res. 28:267–297.

[Howard1971]
Howard, R.
1971.
Dynamic Probabilistic Systems–Volume I: Markov Models
. New York: Wiley.  [Kearns, Mansour, and Ng1999] Kearns, M.; Mansour, Y.; and Ng, A. 1999. A sparse sampling algorithm for nearoptimal planning in large MDPs. In Proc. IJCAI99, 1324–1331.
 [Kocsis and Szepesvári2006] Kocsis, L., and Szepesvári, C. 2006. Bandit based MonteCarlo planning. In Proc. ECML2006, 282–293.
 [Koenig and Sun2009] Koenig, S., and Sun, X. 2009. Comparing realtime and incremental heuristic search for realtime situated agents. Autonomous Agents and MultiAgent Systems 18(3):313–341.
 [Likhachev, Gordon, and Thrun2003] Likhachev, M.; Gordon, G.; and Thrun, S. 2003. ARA*: Anytime A* with provable bounds on suboptimality. In Proc. NIPS.
 [Munos and Coquelin2007] Munos, R., and Coquelin, P. 2007. Bandit algorithms for tree search. In Proc. UAI, 67–74.
 [Nilsson1980] Nilsson, N. 1980. Principles of Artificial Intelligence. Tioga.
 [Papadimitriou and Yannakakis1991] Papadimitriou, C., and Yannakakis, M. 1991. Shortest paths without a map. Theoretical Comp. Sci. 84(1):127–150.
 [Ramanujan, Sabharwal, and Selman2010] Ramanujan, R.; Sabharwal, A.; and Selman, B. 2010. On adversarial search spaces and samplingbased planning. In Proc. ICAPS, 242–245.
 [Silver and Veness2010] Silver, D., and Veness, J. 2010. Montecarlo planning in large POMDPs. In Proc. NIPS, 2164–2172.

[Sutton and Barto1998]
Sutton, R., and Barto, A.
1998.
Introduction to Reinforcement Learning
. MIT Press.  [Thayer and Ruml2010] Thayer, J., and Ruml, W. 2010. Anytime heuristic search: Frameworks and algorithms. In Proc. SOCS.
 [Walsh, Goschin, and Littman2010] Walsh, T.; Goschin, S.; and Littman, M. 2010. Integrating samplebased planning and modelbased reinforcement learning. In Proc. AAAI.
 [Yan et al.2005] Yan, X.; Diaconis, P.; Rusmevichientong, P.; and Van Roy, B. 2005. Solitaire: Man versus machine. In Proc. NIPS 17.
Comments
There are no comments yet.