MCTS, and especially UCT  appears in numerous search applications, such as . Although these methods are shown to be successful empirically, most authors appear to be using UCT “because it has been shown to be successful in the past”, and “because it does a good job of trading off exploration and exploitation”. While the latter statement may be correct for the Multi-armed Bandit problem and for the UCB1 algorithm , we argue that a simple reconsideration from basic principles can result in schemes that outperform UCT.
The core issue is that in MCTS for adversarial search and search in “games against nature” the goal is typically to find the best first action of a good (or even optimal) policy, which is closer to minimizing the simple regret, rather than the cumulative regret minimized by UCB1. However, the simple and the cumulative regret cannot be minimized simultaneously; moreover,  shows that in many cases the smaller the cumulative regret, the greater the simple regret.
We begin with background definitions and related work. VOI estimates for arm pulls in MAB are presented, and a VOI-aware sampling policy is suggested, both for the simple regret in MAB and for MCTS. Finally, the performance of the proposed sampling policy is evaluated on sets of Bernoulli arms and on Computer GO, showing the improved performance.
2 Background and Related Work
Monte-Carlo tree search was initially suggested as a scheme for finding approximately optimal policies for Markov Decision Processes (MDP). MCTS explores an MDP by performing rollouts—trajectories from the current state to a state in which a termination condition is satisfied (either the goal or a cutoff state).
Taking a sequence of samples in order to minimize the regret of a decision based on the samples is captured by the Multi-armed Bandit problem (MAB) . In MAB, we have a set of arms. Each arm can be pulled multiple times. When the th arm is pulled, a random reward from an unknown stationary distribution is encountered. In the cumulative setting, all encountered rewards are collected. UCB1  was shown to be near-optimal in this respect. UCT, an extension of UCB1 to MCTS is described in , and shown to outperform many state of the art search algorithms in both MDP and adversarial search [5, 4]. In the simple regret setting, the agent gets to collect only the reward of the last pull.
The simple regret of a sampling policy for MAB is the expected difference between the best expected reward and the expected reward of the empirically best arm :
Strategies that minimize the simple regret are called pure exploration strategies .
3 Upper Bounds on Value of Information
The intrinsic VOI of pulling an arm is the expected decrease in the regret compared to selecting the best arm without pulling any arm at all. Two cases are possible:
the arm with the highest sample mean is pulled, and becomes lower than of the second-best arm ;
another arm is pulled, and becomes higher than .
The myopic VOI estimate is of limited applicability to Monte-Carlo sampling, since the effect of a single sample is small, and the myopic VOI estimate will often be zero. However, for the common case of a fixed budget of samples per node, can be estimated as the intrinsic VOI of pulling the th arm for the rest of the budget. Let us denote the current number of samples of the th arm by , and the remaining number of samples by :
is bounded from above as
where is the sample mean of the th arm after samples.
The probabilities in equations (2) are bounded from above as
4 VOI-based Sample Allocation
Following the principles of rational metareasoning, for pure exploration in Multi-armed Bandits an arm with the highest VOI should be pulled at each step. The upper bounds established in Corollary 1 can be used as VOI estimates. In MCTS, pure exploration takes place at the first step of a rollout, where an action with the highest utility must be chosen. MCTS differs from pure exploration in Multi-armed Bandits in that the distributions of the rewards are not stationary. However, VOI estimates computed as for stationary distributions work well in practice. As illustrated by the empirical evaluation (Section 5
), estimates based on upper bounds on the VOI result in a rational sampling policy exceeding the performance of some state-of-the-art heuristic algorithms.
5 Empirical Evaluation
5.1 Selecting The Best Arm
The sampling policies are first compared on random Multi-armed bandit problem instances. Figure 1 shows results for randomly-generated Multi-armed bandits with 32 Bernoulli arms, with the mean rewards of the arms distributed uniformly in the range , for a range of sample budgets , with multiplicative step of . The experiment for each number of samples was repeated 10000 times. UCB1 is always considerably worse than the VOI-aware sampling policy.
5.2 Playing Go Against UCT
The policies were also compared on Computer Go, a search domain in which UCT-based MCTS has been particularly successful . A modified version of Pachi , a state of the art Go program, was used for the experiments. The UCT engine was extended with a VOI-aware sampling policy, and a time allocation mode ensuring that both the original UCT policy and the VOI-aware policy use the same average number of samples per node was added. (While the UCT engine is not the most powerful engine of Pachi, it is still a strong player; on the other hand, additional features of more advanced engines would obstruct the MCTS phenomena which are the subject of the experiment.)
The engines were compared on the 9x9 board, for 5000, 7000, 10000, and 15000 samples per ply, each experiment was repeated 1000 times. Figure 2 shows the winning rate of VOI against UCT vs. the number of samples. For most numbers of samples per node, VOI outperforms UCT.
6 Summary and Future Work
This work suggested a Monte-Carlo sampling policy in which sample selection is based on upper bounds on the value of information. Empirical evaluation showed that this policy outperforms heuristic algorithms for pure exploration in MAB, as well as for MCTS.
MCTS still remains a largely unexplored field of application of VOI-aware algorithms. More elaborate VOI estimates, taking into consideration re-use of samples in future search states should be considered. The policy introduced in the paper differs from the UCT algorithm only at the first step, where the VOI-aware decisions are made. Consistent application of principles of rational metareasoning at all steps of a rollout may further improve the sampling.
-  Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer, ‘Finite-time analysis of the Multiarmed bandit problem’, Mach. Learn., 47, 235–256, (May 2002).
-  Petr Braudiš and Jean Loup Gailly, ‘Pachi: State of the art open source Go program’, in ACG 13, (2011).
-  Sébastien Bubeck, Rémi Munos, and Gilles Stoltz, ‘Pure exploration in finitely-armed and continuous-armed bandits’, Theor. Comput. Sci., 412(19), 1832–1852, (2011).
-  Patrick Eyerich, Thomas Keller, and Malte Helmert, ‘High-quality policies for the canadian travelers problem’, in In Proc. AAAI 2010, pp. 51–58, (2010).
-  Sylvain Gelly and Yizao Wang, ‘Exploration exploitation in Go: UCT for Monte-Carlo Go’, Computer, (2006).
-  Nicholas Hay and Stuart J. Russell, ‘Metareasoning for Monte Carlo tree search’, Technical Report UCB/EECS-2011-119, EECS Department, University of California, Berkeley, (Nov 2011).
Wassily Hoeffding, ‘Probability inequalities for sums of bounded random variables’,Journal of the American Statistical Association, 58(301), pp. 13–30, (1963).
Eric J. Horvitz, ‘Reasoning about beliefs and actions under computational
resource constraints’, in
Proceedings of the 1987 Workshop on Uncertainty in Artificial Intelligence, pp. 429–444, (1987).
-  Levente Kocsis and Csaba Szepesvári, ‘Bandit based Monte-Carlo planning’, in ECML, pp. 282–293, (2006).
-  Stuart Russell and Eric Wefald, Do the right thing: studies in limited rationality, MIT Press, Cambridge, MA, USA, 1991.
-  Joannès Vermorel and Mehryar Mohri, ‘Multi-armed bandit algorithms and empirical evaluation’, in ECML, pp. 437–448, (2005).