Monte Carlo Tree Search (MCTS) [Coulom2006] is a state-of-the-art planning algorithm [Browne et al.2012, Chaslot et al.2008]. The strength of MCTS is the use of statistical uncertainty to balance exploration versus exploitation [Munos and others2014]. A popular MCTS selection rule is Upper Confidence Bounds for Trees (UCT) [Kocsis and Szepesvári2006, Cazenave and Jouandeau2007], which explores based on the Upper Confidence Bound (UCB) [Auer et al.2002] of the mean action value estimate. More recently, MCTS was also popularized in an iterated planning and learning scheme, where a low-budget planning iteration is nested in a learning loop. This approach achieved super-human performance in the games Go, Chess and Shogi [Silver et al.2016, Silver et al.2017].
However, the UCB selection rule only uses a local statistical uncertainty estimate derived from the number of visits to an action node. Thereby, it does not take into account how large the subtree below a specific action is. When we sample a single trace from a very large remaining subtree, we have much more remaining uncertainty than when we sample a trace from a very shallow subtree. However, standard MCTS cannot discriminate these settings. It turns out that MCTS can perform arbitrarily bad when the variation in subtree size between arms is large.
We propose a solution to this problem through an extra back-up of an estimate of the size of the subtree below an action. This information is then integrated in an adapted UCB formula to better inform the exploration decision. Next, we show that loops, where the same state re-appears in a trace, can be seen as a special case of our framework. Our final algorithm, MCTS-T+, vastly increases performance in environments with variation in subtree depth and/or many loops, while performing at least on par to standard MCTS on environments that have less of these characteristics. Our experiments indicate that the benefits are mostly present 1) for single-player RL tasks with more early termination and loops, and 2) for lower computational budgets, which is especially relevant in real-time search with time limitations (e.g., robotics), and in iterated search and learning paradigms with small nested searches, e.g., in AlphaGo Zero [Silver et al.2017].
The remainder of this paper is organized as follows. Section 2 provides essential preliminaries. Section 3 illustrates the problems caused by variation in subtree depth, and introduces a solution based on subtree depth estimation. Section 4 identifies the problem of loops, and extends the algorithm of the previous section to MCTS-T+, which naturally deals with loops. The remaining sections 5, 6, 7 and 8 present experiments, related work, discussion and conclusion, respectively. Code to replicate experiments is available from https://github.com/tmoer/mcts-t.git.
2.1 Markov Decision Process
We adopt a Markov Decision Process (MDP)[Sutton and Barto2018] defined by the tuple . Here, is a state set, and is a discrete action set. We assume that the MDP is deterministic, with transition function and reward function, . Finally, denotes a discount parameter.
At every time-step we observe a state and pick an action , after which the environment returns a reward and next state . We act in the MDP according to a stochastic policy . Define the (policy-dependent) state value and state-action value , respectively. Our goal is to find a policy that maximizes this cumulative, discounted sum of rewards.
2.2 Monte Carlo Tree Search
One approach to solving the MDP optimization problem from some state is through Monte Carlo Tree Search [Browne et al.2012]. We here chose to illustrate our work with a variant of the PUCT algorithm [Rosin2011], as used in AlphaGo Zero [Silver et al.2017], but our methodology is equally applicable to other MCTS (select step) variants.
The tree consists of state nodes connected by action links. Each action link stores statistics , where is the visit count, the cumulative return over all roll-outs through , and is the mean action value estimate. MCTS repeatedly performs four subroutines [Browne et al.2012]:
Select We first descend the known part of the tree based on the tree policy rule:
where is the total number of visits to state , and is a constant that scales exploration. The tree policy naturally balances exploration versus exploitation, as it initially prefers all actions (due to low visit count), but asymptotically only selects the optimal action(s).
Expand Once we encounter a child action edge which has not been tried before (), we expand the tree with a new leaf state according to the transition function. Subsequently, we initialize all the child links (actions) of the new leaf .
Roll-out To obtain a fast estimate of the value of , we then make a roll-out up to depth with some roll-out policy , for example a random policy, and estimate . Instead of the roll-out, planning-learning integrations typically plug in a value estimate obtained from a learned value function [Silver et al.2016, Silver et al.2017].
Back-up In the last step, we recursively back-up our value estimates in the tree. We recursively iterate through the trace, and update, for :
where , and subsequently set
(3) (4) (5)
This procedure is repeated until the overall MCTS trace budget is reached. We then recommend an action at the root, typically the one with the highest visitation count.
3 Variation in Subtree Size
We now focus on a specific aspect that the above MCTS formuation does not account for: variation in the size of the subtree below actions (in the select step). Imagine we have two available actions in a certain state. The first action directly terminates the domain, and sampling it once therefore provides much information. The second action has a large subtree below it, and sampling it once only explores a single realization of all possible traces, with much more remaining uncertainty about the true optimal value. Now the key issue is: standard MCTS does not discriminate both cases, since it only tracks how often a node is visited, but completely ignores the size of the subtree below that action.
Variation in subtree size is widespread in many single-player RL tasks. Examples include grid worlds [Sutton and Barto2018], exploration/adventure games (e.g. Montezuma’s Revenge [Bellemare et al.2013]), shooting games (where in some arms we die quickly) [Kempka et al.2016], and robotics tasks (where the robot breaks or environment terminates if we exceed certain physical limitations) [Brockman et al.2016]. In the experimental section we test on different versions of such problems.
When the subtree size below actions varies, then we can vastly gain efficiency by incorporating information about their size. For conceptual illustration, we will first focus on the Chain domain (Figure 1, left) [Osband et al.2016], a well-known task from RL exploration research. The Chain is a long, narrow path with sparse reward at the end, which gives a very asymmetric tree structure that extends much deeper in one direction (Figure 1, right).
The total number of terminating traces in this domain is for a Chain of length . Exhaustive search therefore solves the task with time complexity. Surprisingly, MCTS actually has exponential time complexity, , on this task. The problem is that MCTS receives returns of 0 for both actions at the root (since the chance of sampling the full correct trace is very small, . Therefore, MCTS keeps spreading its traces at the root, and recursively the same spreading happens at deeper nodes, leading to the exponential complexity. What MCTS lacks is information about the depth of the subtree below an arm. We empirically illustrate this behaviour in Sec. 3.2.
3.1 MCTS with Tree Uncertainty Back-up (MCTS-T)
We now extend the MCTS algorithm to make a soft estimate of the size of the subtree below an action, which we represent as the remaining uncertainty . For each state in the tree, we will estimate and recursively back-up , where indicates a completely unexplored subtree below , while indicates a fully enumerated subtree.
We first define the of a new leaf state as:
We then recursively back-up to previous states in the search tree, i.e., we update from the uncertainties of its successors . We could use a uniform policy for this back-up, but one of the strengths of MCTS is that it gradually starts to prefer (i.e., more strongly weigh) the outcomes of good arms. We therefore weigh the back-ups by the empirical MCTS counts. Moreover, if an action has not been tried yet (and we therefore lack an estimate of ), then we initialize the action as if tried once and with maximum uncertainty (the most conservative estimate). Defining
then the weighted backup is
for given by the deterministic environment dynamics. This back-up process is illustrated in Figure 2.
Modified select step
Small reduces our need to visit that subtree again for exploration, as we already (largely) know what will happen there. We therefore modify our tree policy at node to:
for the successor state of action in . The introduction of acts as a prior on the upper confidence bound, reducing exploration pressure on those arms of which we have (largely) enumerated the subtree.
The normal MCTS back-up averages the returns of all traces that passed through a node. However, the mechanism, introduced above, puts extra exploration pressure on actions with a larger subtree below. Now imagine such a deep subtree has poor return. Then, due to , we will still visit the action often, and this will make the state above the action look too poor. When we are overly optimistic on the forward pass, we do not want to commit to always backing up the value estimate of the explored action.
To overcome this issue, we specify a different back-up mechanism, that essentially recovers the standard MCTS back-up. On the forward pass, we track a second set of counts, , which are incremented as if we acted according to the standard MCTS formula (without ):
where denotes the indicator function. We act according to Eq. 10, but on the backward pass use the counts for the value back-up:
for . This reweighs the means of all child actions according to the visit count they would have received in standard MCTS, which is the same as the standard MCTS back-up.
Finally, we do no longer want to recommend an action at the root based on the counts, so we instead recommend the action with the highest mean value at the root.
3.2 Results on Chain
Figure 3 shows the performance of MCTS versus MCTS-T on the Chain (Fig. 1). Plots progress horizontally for longer lengths of the Chain, i.e., stronger asymmetry and therefore a stronger exploration challenge. In the short Chain of length 10 (Fig. 3, left), we see that both algorithms do learn, although MCTS-T is already more efficient. For the deeper chains of length 25, 50 and 100 (next three plots), we see that MCTS does not learn at all any more (flat red dotted lines), even for higher budgets. This illustrates the exponential sample complexity (in the length of the Chain) that MCTS starts to suffer from. In contrast, MCTS-T does consistently learn in the longer chains as well.
We will next generalize the ideas about tree asymmetry to the presence of loops in a domain. A loop occurs when the same state appears twice in the same trace within a single search. In such cases, it never makes sense to further expand the tree below the second appearance. As an example, imagine we need to navigate three steps to the left. If we first plan one step right, then one step back left (a loop), then it does not make sense to continue planning to the left from that point. We could better plan to the left directly from the root itself.
There is an important conceptual difference between a loop and a transposition [Plaat et al.1996]. Transpositions are ways of sharing information between states that were visited in other traces. In the above example, a transposition table stores the estimated value of going left in the start state. In contrast, a loop is a property within a single search, where information sharing has no benefit. Loops are especially frequent in single-player RL tasks, for example navigation tasks where we may step back and forth between two states. Note that the detection of loops does require full observability (since otherwise we do not know whether it we truly observe a repeated state, or something relevant changed in the background).
We will illustrate the problem of loops with a variant of the Chain where the ‘wrong’ action at each timestep returns the agent to state without episode termination (Figure 4, left). When we now unfold the search tree (Figure 4, right), we see that the tree is no longer asymmetric, but does have a lot of repeated appearances of state . Standard MCTS cannot detect this problem, and will therefore repeatedly expand the tree in all directions.
4.1 MCTS-T+: blocking loops.
When we remove all the repeated visits of , then we actually get the same tree as for the normal Chain again. This suggest that our mechanism has a close relation to the appearance of loops as well. A natural solution is to detect duplicate states in a trace, and then set . Thereby, we completely remove the exploration pressure from this arm, i.e., treat the looped state as if it has an empty subtree.
The value/roll-out estimate of the duplicate state depends on the sum of reward in the loop , where specifies the subset of the trace containing the loop. For infinite time-horizon problems with (whose return is not guaranteed to be finite itself), we could theoretically repeat the loop forever, and therefore:
For finite horizon problems, or problems with , we may approximate the value of the loop based on the number of remaining steps and the discount parameter. However, note that most frequently loops with a net positive or negative return are a domain artifact, as the solution of a (real-world) sequential decision making task is seldom to repeat the same action loop forever.
In larger state spaces, exact loops are rare. We therefore check for approximate loops, where the looped state is very similar to a state above. We mark a new leaf state as looped when for any state above it, , the L2-norm with the new expanded state is below a tunable threshold :
Once a loop is detected, we set , and apply all methodology from the previous section.
Note that a simpler solution to blocking loops could be to completely remove the parent action of a looped state from the tree. We present the above introduction to i) be robust against situations where the loop is relevant, and ii) to conceptually show what a loop implies: a state with an empty subtree below it ().
4.2 Results on Chain with loops
We illustrate the performance of MCTS-T+ on the Chain with loops (Figure 4). The results are shown in Figure 5. We observe a similar pattern as in the previous section, where MCTS only (partially) solves the shorter chains, but does not solve the longer chains at all. In contrast, MCTS-T+ does efficiently solve the longer chains as well. Note that MCTS-T (without loop detection) does not solve this problem either (curves not shown), as the loops prevent any termination, and therefore all estimates stay at 1.
The previous experiments, on the Chain and Chain with loops, present extreme cases of variation in subtree depth and the presence of loops. They are example cases to show the worst-case performance of MCTS in such scenarios, but are not very representative of most problems in the RL community. We therefore compare our algorithm to standard MCTS on several reinforcement learning tasks from the OpenAI Gym repository [Brockman et al.2016]: CartPole, FrozenLake and the Atari games Pong and AirRaid.
These results are visualized in Figure 6. We see that MCTS-T and MCTS-T+ consistently perform equal to or better than MCTS. This difference seems more pronounced for smaller MCTS search budgets. This seems to make sense, since the machinery is especially applicable when we want to squeeze as much information out of our traces as possible.
Note that the search budgets are relatively small compared to most tree search implementations. We will return to this point in the discussion. The computational overhead of MCTS-T itself is negligible (compared to the environment simulations). For MCTS-T+, loop detection does incur some cost in larger state spaces. In the worst case, on Atari, MCTS-T+ has 10% increase in computation time.
6 Related Work
The closest related work is probably MCTS-Solver[Winands et al.2008], designed for two-player, zero-sum games. In MCTS-Solver, once a subtree is enumerated, the action link above it is associated with its game-theoretical value (forced win) or (forced loss). It then uses specific back-up mechanisms (e.g., if one child action is a win, then the parent node is a win, and if all child actions are a loss, then the parent node is a loss). Compared to MCTS-Solver, our approach can be seen as a soft variant, where we gradually squeeze arms based on their estimated subtree size, instead of only squeezing completely once we fully enumerated the arm. Moreover, our approach is more generally applicable: it does not have any constraints on the reward functions (like win/loss), nor does it use back-up rules that are specific to two-player, zero-sum games. As such, MCTS-Solver would not be applicable to the problems studied in this paper.
Other related work has focused on maintaining confidence bounds on the value of internal nodes, first introduced in B [Berliner1981]. For example, score-bounded MCTS [Cazenave and Saffidine2010] propagates explicit upper and lower bounds through the tree, and then prunes the tree based on alpha-beta style cuts [Knuth and Moore1975]. This approach is only applicable to two-player games with minimax structure, while our approach is more general. Tesauro et al. tesauro2012bayesian present a MCTS variant that propagates Bayesian uncertainty bounds. This approach is robust against variation in subtree size (not against loops), but requires priors on the confidence bounds, and will generally be quite conservative. One of the benefits of MCTS is that it gradually starts to ignore certain subtrees, without ever enumerating them, a property that is preserved in our approach.
While MCTS is a regret minimizing algorithm, a competing formulation, known as best-arm identification [Audibert and Bubeck2010, Kaufmann and Koolen2017], only cares about the final recommendation. Our approach also departs from the regret minimization objective, by putting additional exploration pressure on arms that have more remaining uncertainty. Finally, our solution also bears connections to RL exploration research papers that use the return distribution to enhance exploration [Moerland et al.2018, Tang and Agrawal2018], which implicitly may perform a similar mechanism as described in this paper.
This paper introduced MCTS-T+, an MCTS extension that is robust against variation in subtree size and loops. We will briefly cover some potential criticism and future extensions of our approach.
From a games perspective, one could argue that our method is only useful in the endgame, when the search is relatively simple anyway (compared to the midgame). While this is true in two-player games, such as Go and Chess, many single-player reinforcement learning tasks, as studied in this paper, tend to have terminating arms right from the start (like dying in a shooting game), or many loops (like navigation tasks where we step back and forth). Our results are especially useful for the latter scenarios.
Our methods seems predominantly beneficial with relatively small search budgets per timestep, compared to the budgets typically expended for search on two-player games. We do see three important ways in which our approach is relevant. First, real-time search with a limited time budget, as for example present in robotics applications, will benefit from maximum data efficiency. Second, we have recently seen a surge of success in iterated search and learning paradigms, like AlphaGo Zero [Silver et al.2017], which nest a small search within a learning loop. Such approaches definitely require an effective small search. Finally, we believe our approach is also conceptually relevant in itself, since it identifies a second type type of uncertainty not frequently identified in MCTS, nor individually studied.
A limitation of our algorithm may occur when a sparse reward is hiding within an otherwise poorly returning subtree. In such scenarios, we risk squeezing out much exploration pressure based on initial traces that do not hit the sparse reward. However, MCTS itself suffers from the same problem, as its success also builds on the idea that the pay-offs of leafs in a subtree show correlation. Although MCTS does have asymptotic guarantees [Kocsis and Szepesvári2006], it will generally also take very long on such sparse reward problems. This is almost inevitable, since these problems have such little structure that they technically require exhaustive search.
Note that in large domains without early termination, like the game of Go, MCTS-T will behave exactly like MCTS for a long time. As long as there is no expand step that reaches a terminal node, all estimates remain at 1, and MCTS-T exactly reduces to MCTS. This gives the algorithm a sense of robustness: it exploits variation in subtree depth when possible, but otherwise automatically reduces to standard MCTS.
There are several directions for future work. First, the approach could be generalized to deal with stochastic and partially observable environments. Another direction would be to generalize information about
, for example by training a neural network that predicts this quantity. Finally, themechanism may also suggest when a search can be stopped (e.g., all at the root). Time management for MCTS has been studied before, for example by Huang et al. huang2010time.
This paper introduces MCTS-T+, an extension to vanilla MCTS that estimates the depth of subtrees below actions, uses these to better target exploration, and uses the same mechanism to deal with loops in the search. Empirical results indicate that MCTS-T+ performs on par or better than standard MCTS on several illustratory tasks and OpenAI Gym experiments, especially for smaller planning budgets. The method is simple to implement, has negligible computational overhead, and, in the absence of termination, stays equal to standard MCTS. It can be useful in single-player RL tasks with frequent termination and loops, real-time planning with limited time budgets, and iterated search and learning paradigms with small nested searches. Together, the paper also provides a conceptual introduction of a type of uncertainty that standard MCTS does not account for.
- [Audibert and Bubeck2010] Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. 2010.
- [Auer et al.2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
[Bellemare et al.2013]
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
Journal of Artificial Intelligence Research, 47:253–279, 2013.
- [Berliner1981] Hans Berliner. The B* tree search algorithm: A best-first proof procedure. In Readings in Artificial Intelligence, pages 79–87. Elsevier, 1981.
- [Brockman et al.2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- [Browne et al.2012] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
- [Cazenave and Jouandeau2007] Tristan Cazenave and Nicolas Jouandeau. On the parallelization of UCT. In proceedings of the Computer Games Workshop, pages 93–101. Citeseer, 2007.
- [Cazenave and Saffidine2010] Tristan Cazenave and Abdallah Saffidine. Score bounded Monte-Carlo tree search. In International Conference on Computers and Games, pages 93–104. Springer, 2010.
- [Chaslot et al.2008] Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck. Monte-Carlo Tree Search: A New Framework for Game AI. In AIIDE, 2008.
- [Coulom2006] Rémi Coulom. Efficient selectivity and backup operators in Monte-Carlo tree search. In International conference on computers and games, pages 72–83. Springer, 2006.
- [Huang et al.2010] Shih-Chieh Huang, Remi Coulom, and Shun-Shii Lin. Time management for Monte-Carlo tree search applied to the game of Go. In 2010 International Conference on Technologies and Applications of Artificial Intelligence, pages 462–466. IEEE, 2010.
- [Kaufmann and Koolen2017] Emilie Kaufmann and Wouter M Koolen. Monte-carlo tree search by best arm identification. In Advances in Neural Information Processing Systems, pages 4897–4906, 2017.
- [Kempka et al.2016] Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaśkowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pages 1–8. IEEE, 2016.
- [Knuth and Moore1975] Donald E Knuth and Ronald W Moore. An analysis of alpha-beta pruning. Artificial intelligence, 6(4):293–326, 1975.
- [Kocsis and Szepesvári2006] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In ECML, volume 6, pages 282–293. Springer, 2006.
- [Moerland et al.2018] Thomas M Moerland, Joost Broekens, and Catholijn M Jonker. The Potential of the Return Distribution for Exploration in RL. arXiv preprint arXiv:1806.04242, 2018.
- [Munos and others2014] Rémi Munos et al. From bandits to Monte-Carlo Tree Search: The optimistic principle applied to optimization and planning. Foundations and Trends® in Machine Learning, 7(1):1–129, 2014.
- [Osband et al.2016] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and Exploration via Randomized Value Functions. In International Conference on Machine Learning, pages 2377–2386, 2016.
- [Plaat et al.1996] Aske Plaat, Jonathan Schaeffer, Wim Pijls, and Arie De Bruin. Exploiting graph properties of game trees. In AAAI/IAAI, Vol. 1, pages 234–239, 1996.
- [Rosin2011] Christopher D Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011.
- [Silver et al.2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- [Silver et al.2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
- [Sutton and Barto2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An Introduction. MIT press Cambridge, second edition, 2018.
- [Tang and Agrawal2018] Yunhao Tang and Shipra Agrawal. Exploration by distributional reinforcement learning. arXiv preprint arXiv:1805.01907, 2018.
- [Tesauro et al.2012] Gerald Tesauro, VT Rajan, and Richard Segal. Bayesian inference in monte-carlo tree search. arXiv preprint arXiv:1203.3519, 2012.
- [Winands et al.2008] Mark HM Winands, Yngvi Björnsson, and Jahn-Takeshi Saito. Monte-Carlo tree search solver. In International Conference on Computers and Games, pages 25–36. Springer, 2008.