1 Introduction
Monte Carlo Tree Search (MCTS) [Coulom2006] is a stateoftheart planning algorithm [Browne et al.2012, Chaslot et al.2008]. The strength of MCTS is the use of statistical uncertainty to balance exploration versus exploitation [Munos and others2014]. A popular MCTS selection rule is Upper Confidence Bounds for Trees (UCT) [Kocsis and Szepesvári2006, Cazenave and Jouandeau2007], which explores based on the Upper Confidence Bound (UCB) [Auer et al.2002] of the mean action value estimate. More recently, MCTS was also popularized in an iterated planning and learning scheme, where a lowbudget planning iteration is nested in a learning loop. This approach achieved superhuman performance in the games Go, Chess and Shogi [Silver et al.2016, Silver et al.2017].
However, the UCB selection rule only uses a local statistical uncertainty estimate derived from the number of visits to an action node. Thereby, it does not take into account how large the subtree below a specific action is. When we sample a single trace from a very large remaining subtree, we have much more remaining uncertainty than when we sample a trace from a very shallow subtree. However, standard MCTS cannot discriminate these settings. It turns out that MCTS can perform arbitrarily bad when the variation in subtree size between arms is large.
We propose a solution to this problem through an extra backup of an estimate of the size of the subtree below an action. This information is then integrated in an adapted UCB formula to better inform the exploration decision. Next, we show that loops, where the same state reappears in a trace, can be seen as a special case of our framework. Our final algorithm, MCTST+, vastly increases performance in environments with variation in subtree depth and/or many loops, while performing at least on par to standard MCTS on environments that have less of these characteristics. Our experiments indicate that the benefits are mostly present 1) for singleplayer RL tasks with more early termination and loops, and 2) for lower computational budgets, which is especially relevant in realtime search with time limitations (e.g., robotics), and in iterated search and learning paradigms with small nested searches, e.g., in AlphaGo Zero [Silver et al.2017].
The remainder of this paper is organized as follows. Section 2 provides essential preliminaries. Section 3 illustrates the problems caused by variation in subtree depth, and introduces a solution based on subtree depth estimation. Section 4 identifies the problem of loops, and extends the algorithm of the previous section to MCTST+, which naturally deals with loops. The remaining sections 5, 6, 7 and 8 present experiments, related work, discussion and conclusion, respectively. Code to replicate experiments is available from https://github.com/tmoer/mctst.git.
2 Preliminaries
2.1 Markov Decision Process
We adopt a Markov Decision Process (MDP)
[Sutton and Barto2018] defined by the tuple . Here, is a state set, and is a discrete action set. We assume that the MDP is deterministic, with transition function and reward function, . Finally, denotes a discount parameter.At every timestep we observe a state and pick an action , after which the environment returns a reward and next state . We act in the MDP according to a stochastic policy . Define the (policydependent) state value and stateaction value , respectively. Our goal is to find a policy that maximizes this cumulative, discounted sum of rewards.
2.2 Monte Carlo Tree Search
One approach to solving the MDP optimization problem from some state is through Monte Carlo Tree Search [Browne et al.2012]. We here chose to illustrate our work with a variant of the PUCT algorithm [Rosin2011], as used in AlphaGo Zero [Silver et al.2017], but our methodology is equally applicable to other MCTS (select step) variants.
The tree consists of state nodes connected by action links. Each action link stores statistics , where is the visit count, the cumulative return over all rollouts through , and is the mean action value estimate. MCTS repeatedly performs four subroutines [Browne et al.2012]:

[leftmargin=0.5cm]

Select We first descend the known part of the tree based on the tree policy rule:
(1) where is the total number of visits to state , and is a constant that scales exploration. The tree policy naturally balances exploration versus exploitation, as it initially prefers all actions (due to low visit count), but asymptotically only selects the optimal action(s).

Expand Once we encounter a child action edge which has not been tried before (), we expand the tree with a new leaf state according to the transition function. Subsequently, we initialize all the child links (actions) of the new leaf .

Rollout To obtain a fast estimate of the value of , we then make a rollout up to depth with some rollout policy , for example a random policy, and estimate . Instead of the rollout, planninglearning integrations typically plug in a value estimate obtained from a learned value function [Silver et al.2016, Silver et al.2017].

Backup In the last step, we recursively backup our value estimates in the tree. We recursively iterate through the trace, and update, for :
(2) where , and subsequently set
(3) (4) (5)
This procedure is repeated until the overall MCTS trace budget is reached. We then recommend an action at the root, typically the one with the highest visitation count.
3 Variation in Subtree Size
We now focus on a specific aspect that the above MCTS formuation does not account for: variation in the size of the subtree below actions (in the select step). Imagine we have two available actions in a certain state. The first action directly terminates the domain, and sampling it once therefore provides much information. The second action has a large subtree below it, and sampling it once only explores a single realization of all possible traces, with much more remaining uncertainty about the true optimal value. Now the key issue is: standard MCTS does not discriminate both cases, since it only tracks how often a node is visited, but completely ignores the size of the subtree below that action.
Variation in subtree size is widespread in many singleplayer RL tasks. Examples include grid worlds [Sutton and Barto2018], exploration/adventure games (e.g. Montezuma’s Revenge [Bellemare et al.2013]), shooting games (where in some arms we die quickly) [Kempka et al.2016], and robotics tasks (where the robot breaks or environment terminates if we exceed certain physical limitations) [Brockman et al.2016]. In the experimental section we test on different versions of such problems.
When the subtree size below actions varies, then we can vastly gain efficiency by incorporating information about their size. For conceptual illustration, we will first focus on the Chain domain (Figure 1, left) [Osband et al.2016], a wellknown task from RL exploration research. The Chain is a long, narrow path with sparse reward at the end, which gives a very asymmetric tree structure that extends much deeper in one direction (Figure 1, right).
The total number of terminating traces in this domain is for a Chain of length . Exhaustive search therefore solves the task with time complexity. Surprisingly, MCTS actually has exponential time complexity, , on this task. The problem is that MCTS receives returns of 0 for both actions at the root (since the chance of sampling the full correct trace is very small, . Therefore, MCTS keeps spreading its traces at the root, and recursively the same spreading happens at deeper nodes, leading to the exponential complexity. What MCTS lacks is information about the depth of the subtree below an arm. We empirically illustrate this behaviour in Sec. 3.2.
3.1 MCTS with Tree Uncertainty Backup (MCTST)
We now extend the MCTS algorithm to make a soft estimate of the size of the subtree below an action, which we represent as the remaining uncertainty . For each state in the tree, we will estimate and recursively backup , where indicates a completely unexplored subtree below , while indicates a fully enumerated subtree.
We first define the of a new leaf state as:
(6) 
We then recursively backup to previous states in the search tree, i.e., we update from the uncertainties of its successors . We could use a uniform policy for this backup, but one of the strengths of MCTS is that it gradually starts to prefer (i.e., more strongly weigh) the outcomes of good arms. We therefore weigh the backups by the empirical MCTS counts. Moreover, if an action has not been tried yet (and we therefore lack an estimate of ), then we initialize the action as if tried once and with maximum uncertainty (the most conservative estimate). Defining
(7) 
(8) 
then the weighted backup is
(9) 
for given by the deterministic environment dynamics. This backup process is illustrated in Figure 2.
Modified select step
Small reduces our need to visit that subtree again for exploration, as we already (largely) know what will happen there. We therefore modify our tree policy at node to:
(10) 
for the successor state of action in . The introduction of acts as a prior on the upper confidence bound, reducing exploration pressure on those arms of which we have (largely) enumerated the subtree.
Value backup
The normal MCTS backup averages the returns of all traces that passed through a node. However, the mechanism, introduced above, puts extra exploration pressure on actions with a larger subtree below. Now imagine such a deep subtree has poor return. Then, due to , we will still visit the action often, and this will make the state above the action look too poor. When we are overly optimistic on the forward pass, we do not want to commit to always backing up the value estimate of the explored action.
To overcome this issue, we specify a different backup mechanism, that essentially recovers the standard MCTS backup. On the forward pass, we track a second set of counts, , which are incremented as if we acted according to the standard MCTS formula (without ):
(11) 
where denotes the indicator function. We act according to Eq. 10, but on the backward pass use the counts for the value backup:
(12) 
for . This reweighs the means of all child actions according to the visit count they would have received in standard MCTS, which is the same as the standard MCTS backup.
Finally, we do no longer want to recommend an action at the root based on the counts, so we instead recommend the action with the highest mean value at the root.
3.2 Results on Chain
Figure 3 shows the performance of MCTS versus MCTST on the Chain (Fig. 1). Plots progress horizontally for longer lengths of the Chain, i.e., stronger asymmetry and therefore a stronger exploration challenge. In the short Chain of length 10 (Fig. 3, left), we see that both algorithms do learn, although MCTST is already more efficient. For the deeper chains of length 25, 50 and 100 (next three plots), we see that MCTS does not learn at all any more (flat red dotted lines), even for higher budgets. This illustrates the exponential sample complexity (in the length of the Chain) that MCTS starts to suffer from. In contrast, MCTST does consistently learn in the longer chains as well.
4 Loops
We will next generalize the ideas about tree asymmetry to the presence of loops in a domain. A loop occurs when the same state appears twice in the same trace within a single search. In such cases, it never makes sense to further expand the tree below the second appearance. As an example, imagine we need to navigate three steps to the left. If we first plan one step right, then one step back left (a loop), then it does not make sense to continue planning to the left from that point. We could better plan to the left directly from the root itself.
There is an important conceptual difference between a loop and a transposition [Plaat et al.1996]. Transpositions are ways of sharing information between states that were visited in other traces. In the above example, a transposition table stores the estimated value of going left in the start state. In contrast, a loop is a property within a single search, where information sharing has no benefit. Loops are especially frequent in singleplayer RL tasks, for example navigation tasks where we may step back and forth between two states. Note that the detection of loops does require full observability (since otherwise we do not know whether it we truly observe a repeated state, or something relevant changed in the background).
We will illustrate the problem of loops with a variant of the Chain where the ‘wrong’ action at each timestep returns the agent to state without episode termination (Figure 4, left). When we now unfold the search tree (Figure 4, right), we see that the tree is no longer asymmetric, but does have a lot of repeated appearances of state . Standard MCTS cannot detect this problem, and will therefore repeatedly expand the tree in all directions.
4.1 MCTST+: blocking loops.
When we remove all the repeated visits of , then we actually get the same tree as for the normal Chain again. This suggest that our mechanism has a close relation to the appearance of loops as well. A natural solution is to detect duplicate states in a trace, and then set . Thereby, we completely remove the exploration pressure from this arm, i.e., treat the looped state as if it has an empty subtree.
The value/rollout estimate of the duplicate state depends on the sum of reward in the loop , where specifies the subset of the trace containing the loop. For infinite timehorizon problems with (whose return is not guaranteed to be finite itself), we could theoretically repeat the loop forever, and therefore:
(13) 
For finite horizon problems, or problems with , we may approximate the value of the loop based on the number of remaining steps and the discount parameter. However, note that most frequently loops with a net positive or negative return are a domain artifact, as the solution of a (realworld) sequential decision making task is seldom to repeat the same action loop forever.
In larger state spaces, exact loops are rare. We therefore check for approximate loops, where the looped state is very similar to a state above. We mark a new leaf state as looped when for any state above it, , the L2norm with the new expanded state is below a tunable threshold :
(14) 
Once a loop is detected, we set , and apply all methodology from the previous section.
Note that a simpler solution to blocking loops could be to completely remove the parent action of a looped state from the tree. We present the above introduction to i) be robust against situations where the loop is relevant, and ii) to conceptually show what a loop implies: a state with an empty subtree below it ().
4.2 Results on Chain with loops
We illustrate the performance of MCTST+ on the Chain with loops (Figure 4). The results are shown in Figure 5. We observe a similar pattern as in the previous section, where MCTS only (partially) solves the shorter chains, but does not solve the longer chains at all. In contrast, MCTST+ does efficiently solve the longer chains as well. Note that MCTST (without loop detection) does not solve this problem either (curves not shown), as the loops prevent any termination, and therefore all estimates stay at 1.
5 Experiments
The previous experiments, on the Chain and Chain with loops, present extreme cases of variation in subtree depth and the presence of loops. They are example cases to show the worstcase performance of MCTS in such scenarios, but are not very representative of most problems in the RL community. We therefore compare our algorithm to standard MCTS on several reinforcement learning tasks from the OpenAI Gym repository [Brockman et al.2016]: CartPole, FrozenLake and the Atari games Pong and AirRaid.
These results are visualized in Figure 6. We see that MCTST and MCTST+ consistently perform equal to or better than MCTS. This difference seems more pronounced for smaller MCTS search budgets. This seems to make sense, since the machinery is especially applicable when we want to squeeze as much information out of our traces as possible.
Note that the search budgets are relatively small compared to most tree search implementations. We will return to this point in the discussion. The computational overhead of MCTST itself is negligible (compared to the environment simulations). For MCTST+, loop detection does incur some cost in larger state spaces. In the worst case, on Atari, MCTST+ has 10% increase in computation time.
6 Related Work
The closest related work is probably MCTSSolver
[Winands et al.2008], designed for twoplayer, zerosum games. In MCTSSolver, once a subtree is enumerated, the action link above it is associated with its gametheoretical value (forced win) or (forced loss). It then uses specific backup mechanisms (e.g., if one child action is a win, then the parent node is a win, and if all child actions are a loss, then the parent node is a loss). Compared to MCTSSolver, our approach can be seen as a soft variant, where we gradually squeeze arms based on their estimated subtree size, instead of only squeezing completely once we fully enumerated the arm. Moreover, our approach is more generally applicable: it does not have any constraints on the reward functions (like win/loss), nor does it use backup rules that are specific to twoplayer, zerosum games. As such, MCTSSolver would not be applicable to the problems studied in this paper.Other related work has focused on maintaining confidence bounds on the value of internal nodes, first introduced in B [Berliner1981]. For example, scorebounded MCTS [Cazenave and Saffidine2010] propagates explicit upper and lower bounds through the tree, and then prunes the tree based on alphabeta style cuts [Knuth and Moore1975]. This approach is only applicable to twoplayer games with minimax structure, while our approach is more general. Tesauro et al. tesauro2012bayesian present a MCTS variant that propagates Bayesian uncertainty bounds. This approach is robust against variation in subtree size (not against loops), but requires priors on the confidence bounds, and will generally be quite conservative. One of the benefits of MCTS is that it gradually starts to ignore certain subtrees, without ever enumerating them, a property that is preserved in our approach.
While MCTS is a regret minimizing algorithm, a competing formulation, known as bestarm identification [Audibert and Bubeck2010, Kaufmann and Koolen2017], only cares about the final recommendation. Our approach also departs from the regret minimization objective, by putting additional exploration pressure on arms that have more remaining uncertainty. Finally, our solution also bears connections to RL exploration research papers that use the return distribution to enhance exploration [Moerland et al.2018, Tang and Agrawal2018], which implicitly may perform a similar mechanism as described in this paper.
7 Discussion
This paper introduced MCTST+, an MCTS extension that is robust against variation in subtree size and loops. We will briefly cover some potential criticism and future extensions of our approach.
From a games perspective, one could argue that our method is only useful in the endgame, when the search is relatively simple anyway (compared to the midgame). While this is true in twoplayer games, such as Go and Chess, many singleplayer reinforcement learning tasks, as studied in this paper, tend to have terminating arms right from the start (like dying in a shooting game), or many loops (like navigation tasks where we step back and forth). Our results are especially useful for the latter scenarios.
Our methods seems predominantly beneficial with relatively small search budgets per timestep, compared to the budgets typically expended for search on twoplayer games. We do see three important ways in which our approach is relevant. First, realtime search with a limited time budget, as for example present in robotics applications, will benefit from maximum data efficiency. Second, we have recently seen a surge of success in iterated search and learning paradigms, like AlphaGo Zero [Silver et al.2017], which nest a small search within a learning loop. Such approaches definitely require an effective small search. Finally, we believe our approach is also conceptually relevant in itself, since it identifies a second type type of uncertainty not frequently identified in MCTS, nor individually studied.
A limitation of our algorithm may occur when a sparse reward is hiding within an otherwise poorly returning subtree. In such scenarios, we risk squeezing out much exploration pressure based on initial traces that do not hit the sparse reward. However, MCTS itself suffers from the same problem, as its success also builds on the idea that the payoffs of leafs in a subtree show correlation. Although MCTS does have asymptotic guarantees [Kocsis and Szepesvári2006], it will generally also take very long on such sparse reward problems. This is almost inevitable, since these problems have such little structure that they technically require exhaustive search.
Note that in large domains without early termination, like the game of Go, MCTST will behave exactly like MCTS for a long time. As long as there is no expand step that reaches a terminal node, all estimates remain at 1, and MCTST exactly reduces to MCTS. This gives the algorithm a sense of robustness: it exploits variation in subtree depth when possible, but otherwise automatically reduces to standard MCTS.
There are several directions for future work. First, the approach could be generalized to deal with stochastic and partially observable environments. Another direction would be to generalize information about
, for example by training a neural network that predicts this quantity. Finally, the
mechanism may also suggest when a search can be stopped (e.g., all at the root). Time management for MCTS has been studied before, for example by Huang et al. huang2010time.8 Conclusion
This paper introduces MCTST+, an extension to vanilla MCTS that estimates the depth of subtrees below actions, uses these to better target exploration, and uses the same mechanism to deal with loops in the search. Empirical results indicate that MCTST+ performs on par or better than standard MCTS on several illustratory tasks and OpenAI Gym experiments, especially for smaller planning budgets. The method is simple to implement, has negligible computational overhead, and, in the absence of termination, stays equal to standard MCTS. It can be useful in singleplayer RL tasks with frequent termination and loops, realtime planning with limited time budgets, and iterated search and learning paradigms with small nested searches. Together, the paper also provides a conceptual introduction of a type of uncertainty that standard MCTS does not account for.
References
 [Audibert and Bubeck2010] JeanYves Audibert and Sébastien Bubeck. Best arm identification in multiarmed bandits. 2010.
 [Auer et al.2002] Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.

[Bellemare et al.2013]
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, 2013.  [Berliner1981] Hans Berliner. The B* tree search algorithm: A bestfirst proof procedure. In Readings in Artificial Intelligence, pages 79–87. Elsevier, 1981.
 [Brockman et al.2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 [Browne et al.2012] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
 [Cazenave and Jouandeau2007] Tristan Cazenave and Nicolas Jouandeau. On the parallelization of UCT. In proceedings of the Computer Games Workshop, pages 93–101. Citeseer, 2007.
 [Cazenave and Saffidine2010] Tristan Cazenave and Abdallah Saffidine. Score bounded MonteCarlo tree search. In International Conference on Computers and Games, pages 93–104. Springer, 2010.
 [Chaslot et al.2008] Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck. MonteCarlo Tree Search: A New Framework for Game AI. In AIIDE, 2008.
 [Coulom2006] Rémi Coulom. Efficient selectivity and backup operators in MonteCarlo tree search. In International conference on computers and games, pages 72–83. Springer, 2006.
 [Huang et al.2010] ShihChieh Huang, Remi Coulom, and ShunShii Lin. Time management for MonteCarlo tree search applied to the game of Go. In 2010 International Conference on Technologies and Applications of Artificial Intelligence, pages 462–466. IEEE, 2010.
 [Kaufmann and Koolen2017] Emilie Kaufmann and Wouter M Koolen. Montecarlo tree search by best arm identification. In Advances in Neural Information Processing Systems, pages 4897–4906, 2017.
 [Kempka et al.2016] Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaśkowski. Vizdoom: A doombased ai research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pages 1–8. IEEE, 2016.
 [Knuth and Moore1975] Donald E Knuth and Ronald W Moore. An analysis of alphabeta pruning. Artificial intelligence, 6(4):293–326, 1975.
 [Kocsis and Szepesvári2006] Levente Kocsis and Csaba Szepesvári. Bandit based montecarlo planning. In ECML, volume 6, pages 282–293. Springer, 2006.
 [Moerland et al.2018] Thomas M Moerland, Joost Broekens, and Catholijn M Jonker. The Potential of the Return Distribution for Exploration in RL. arXiv preprint arXiv:1806.04242, 2018.
 [Munos and others2014] Rémi Munos et al. From bandits to MonteCarlo Tree Search: The optimistic principle applied to optimization and planning. Foundations and Trends® in Machine Learning, 7(1):1–129, 2014.
 [Osband et al.2016] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and Exploration via Randomized Value Functions. In International Conference on Machine Learning, pages 2377–2386, 2016.
 [Plaat et al.1996] Aske Plaat, Jonathan Schaeffer, Wim Pijls, and Arie De Bruin. Exploiting graph properties of game trees. In AAAI/IAAI, Vol. 1, pages 234–239, 1996.
 [Rosin2011] Christopher D Rosin. Multiarmed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011.
 [Silver et al.2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 [Silver et al.2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 [Sutton and Barto2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An Introduction. MIT press Cambridge, second edition, 2018.
 [Tang and Agrawal2018] Yunhao Tang and Shipra Agrawal. Exploration by distributional reinforcement learning. arXiv preprint arXiv:1805.01907, 2018.
 [Tesauro et al.2012] Gerald Tesauro, VT Rajan, and Richard Segal. Bayesian inference in montecarlo tree search. arXiv preprint arXiv:1203.3519, 2012.
 [Winands et al.2008] Mark HM Winands, Yngvi Björnsson, and JahnTakeshi Saito. MonteCarlo tree search solver. In International Conference on Computers and Games, pages 25–36. Springer, 2008.
Comments
There are no comments yet.