1 Introduction
Monte Carlo Tree Search (MCTS) is a popular treebased search strategy within the framework of reinforcement learning (RL), which estimates the optimal value of a state and action by building a tree with Monte Carlo simulation. It has been widely used in sequential decision makings, including scheduling problems, inventory, production management, and realworld games, such as Go, Chess, Tictactoe and Chinese Checkers. See
Browne et al. (2012), Fu (2018) and Świechowski et al. (2021)for thorough overviews. MCTS uses little or no domain knowledge and self learns by running more simulations. Many variations have been proposed for MCTS to improve its performance. In particular, deep neural networks are combined into MCTS to achieve a remarkable success in the game of Go
(Silver et al. 2016, 2017).A basic MCTS is to build a game tree from the root node in an incremental and asymmetric manner, where nodes correspond to states and edges correspond to possible stateaction pairs. For each round of MCTS, a tree policy is used to find a node from which a rollout (simulation) is then performed, and nodes in the collected search path is updated according to the received terminal reward. Moves are made during the rollout by a default policy, which in the simplest case is to make uniform random moves. Different from depthlimited minimax search that needs to evaluate values of intermediate states, only the reward of the terminal state at the end of each rollout is evaluated in MCTS, which greatly reduces the amount of domain knowledge required. The best action of the root node is selected based on the information collected from simulations after computational budget is exhausted. The tree policy plays a vital role in the success of MCTS since it determines how the tree is built and computational budget is allocated in simulations. The key issue is to balance the exploration of nodes that have not been well sampled yet and the exploitation of nodes that appear to be promising. In our work, we propose a new tree policy to improve the performance of MCTS.
One of the popular tree policies in MCTS is the Upper Confidence Bounds for Trees (UCT) algorithm, which is proposed by applying the Upper Confidence Bound (UCB1) algorithm (Auer et al. 2002)—originally designed for stochastic multiarmed bandit (MAB) problems—to each node of the tree (Kocsis and Szepesvári 2006, Kocsis et al. 2006). Stochastic MAB is a wellknown sequential decision problem in which the goal is to maximize the expected total reward in finite rounds by choosing amongst finitely many actions (also known as arms of slot machines in the MAB literature) to sample. There are other variants of banditbased methods developed for the tree policy. Auer et al. (2002) introduce UCB1Tuned in order to tune the bounds of UCB1 more finely. Tesauro et al. (2010) suggest a Bayesian framework inspired by its more accurate estimation of values and uncertainties of nodes under limited computational budget. Teytaud and Flory (2011) employ the ExplorationExploitation with Exponential weights in conjunction with UCT to deal with partially observable games with simultaneous moves. Mansley et al. (2011) combine the Hierarchical Optimistic Optimisation into the rollout planning, overcoming the limitation of UCT for a continuous decision space. Teraoka et al. (2014)
propose a tree policy by selecting the node with the largest confidence interval inspired by the Best Arm Identification (BAI) problem in the MAB literature
(Bubeck and CesaBianchi 2012), and Kaufmann and Koolen (2017) further extend their results to a tighter upper bound. However, both tree policies are pure exploration policies and only developed for the minmax game trees.Although the goal in MCTS is very similar to the MAB problem, i.e., choosing an action at given state with the best average reward, their setups have many differences. Stochastic rewards are collected at all rounds in MAB, whereas in MCTS, the reward of the goal is collected only at the end of the algorithm. Most banditbased methods assume that rewards are bounded and known—typically assumed to be —however, a more general tree search problem has an unknown and unbounded range of values of nodes. A common objective function of banditbased methods is the cumulative regret, i.e., the expected sum of difference between the performance of the best arm and that of the chosen arm for sampling. Li et al. (2021)
show that the algorithms designed to minimize regret tend to discourage exploration. In addition to the differences mentioned above, most banditbased tree policies only consider the average value and the number of visits of nodes, which do not utilize other available information such as variances. These findings lead us to formulate the tree policy as a statistical ranking and selection (R&S) problem
(Chen and Lee 2011, B. and O. 2012)that has been actively studied in simulation optimization. In statistical R&S, the goal is to efficiently allocate limited computational budget to finitely many actions (also known as alternatives in the R&S literature) so that the probability of correct selection (PCS) for the best action can be maximized. The samples for any action are usually assumed to be independent and identically Gaussian distributed with known variances, and a reward is collected after computational budget is exhausted. Despite the same goal of BAI and R&S, different assumptions on distributions of samples are made. In particular, the former assumes samples to be bounded or subGaussian distributed.
In our work, we aim to maximize the PCS for the optimal action at the root node of the tree. We propose a dynamic sampling tree policy by applying the Asymptotically Optimal Allocation Policy (AOAP) algorithm (Peng et al. 2018), which is originally designed for statistical R&S problems. AOAP is a myopic sampling procedure that maximizes a value function approximation onestep look ahead. The closest work to our paper is Li et al. (2021), where they propose a tree policy by applying the Optimal Computing Budget Allocation (OCBA) algorithm (Chen et al. 2000, Chen and Lee 2011) to each node of the tree. The key algorithmic differences from ours lie in: OCBA is developed based on a static optimization problem and is designed to reach a good asymptotic behavior, whereas AOAP is derived in a stochastic dynamic programming framework that can capture the finitesample behavior of a sampling policy. To implement OCBA in a fully sequential manner, they combine it with a “most starving” sequential rule. Our proposed tree policy removes the known and bounded assumption of the node value, and balances exploration and exploitation to efficiently identify the optimal action. We demonstrate the efficiency of our new tree policy through numerical experiments on Tictactoe and Gomoku.
The rest of the paper is organised as follows. Section 2 formulates the proposed problem. The new tree policy and convergence results are proposed in Section 3. Section 4 provides numerical results. The last section concludes the paper.
2 Problem Formulation
We consider the setup of a finitehorizon discretetime Markov decision process (MDP). An MDP is described by a fourtuple with a horizon length , where is the set of states, is the set of actions, is the Markovian transition kernel,
is a random bounded reward function. The random reward can be discrete (win/draw/loss), continuous or a vector of reward values relative to each agent for more complex multiagent domains. We assume that
and are finite sets and is deterministic, i.e., , , . The assumption of deterministic transition is reasonable since traditional MCTS is introduced in the context of deterministic games with a tree representation. At each stage, the system is in state . After taking an action , the state transits to next state and an immediate reward is generated according to . A stationary policy specifies the probability of performing action given current state . The value function for each state under policy is defined as . The stateaction value function is defined as . The optimal value function under the optimal policy is defined as , . The following Bellman equation holds: , where is the next state reached by applying action on state .For the tree search problem, let and be a state and an action at depth , , respectively. We model the best action identification for every explored state node in the tree policy of MCTS as separate R&S problems. All actions of current state node are treated as alternatives. The optimal value of state node is unknown. Each stateaction pair has an unknown value , , which is estimated by random samples , where is the number of visits to the next state after taking action at state in rollouts, is the search path collected at the th rollout, is total rollouts or simulations (also known as the number of total simulation budget in the R&S literature), and is an indicator function that equals to 1 when the event in the bracket is true and equals to 0 otherwise. We assume that , ,
are independent and identically distributed normal random variables, i.e.,
with a known stateaction variance , where we suppress for simplicity of notation. The sample variance is used as a plugin for in practice, i.e., , where is sample mean. Under a Bayesian framework, we assume the prior distribution ofis a conjugate prior of the sampling distribution of
, which is also a normal distribution
. Then the posterior distribution of is , , , with posterior stateaction variance(1) 
and posterior stateaction mean
(2) 
Note that if , then and and such a case is called uninformative. We aim to identify the best action that achieves the highest stateaction value at the initial state , i.e., finding , where is the set of actions at state . A correct selection of the best action occurs when , where is the estimated best action that achieves the highest posterior mean at the initial state after rollouts. The PCS for selecting can be expressed as
We aim to find an efficient dynamic sampling tree policy such that the can be maximized. Compared with minimizing the expected cumulative regret in the canonical MAB problem, maximizing results in an allocation of limited computational budget in a way that optimally balances exploration and exploitation. Based on the information collected from simulations, sampling policy is a sequence of mappings, where allocates the th computational budget to an action of the initial state based on the information collected throughout the first rollouts, where denotes the cardinality of a set. The expected payoff for a dynamic sampling tree policy can be recursively defined in a stochastic dynamic programming problem by
and for ,
where is the th sample for allocated action . Then an optimal dynamic sampling tree policy can be defined as the solution of the stochastic dynamic programming problem: , where
contains the prior information. Such a stochastic dynamic programming problem can be viewed as a MDP, and then the optimality condition of a dynamic sampling tree policy is governed by the Bellman equation of the MDP. However, solving such a MDP typically suffers from curseofdimensionality. In the R&S literature,
Peng et al. (2018) find a suitable value function approximation (VFA) for the Bellman equations and use a further approximation for the VFA, which leads to the socalled AOAP algorithm that maximizes a VFA onestep look ahead. Inspired by their work, we propose a tree policy by applying the AOAP algorithm to each node of the tree, leading to a dynamic sampling tree policy for MCTS.3 A New Tree Policy
In this section, we first briefly describe the AOAP algorithm under the tree search setup. Then we propose a new tree policy for MCTS that finds the best action at each state node.
In the tree policy of MCTS, for each visited state node in the search path at the th rollout, the AOAP algorithm first identifies the action with the largest posterior stateaction mean , and then calculates the following equations:
(3) 
and for , ,
(4)  
where
After calculating values of , , the AOAP algorithm selects the action with the largest , i.e., sample
(5) 
The MCTS algorithm using the AOAP as a tree policy is named as AOAPMCTS. Compared with UCT, the tree policy based on AOAP utilizes posterior means and posterior variances, which incorporate average value, variances and the number of visits of nodes. The proposed tree policy attempts to balance the exploration of nodes with high variances and exploitation of nodes with high stateaction values. In implementation, if more than one action has the same maximal posterior stateaction mean or has the same value of , the tie can be broken by choosing randomly or the action with the highest , that is, choosing the action with low frequency of visits and large posterior stateaction variance. In addition, notice that we use variance information of a node as a denominator in calculation of both posterior stateaction mean and variance, and in order to ensure the variance is positive, a small positive real number can be introduced when .
We highlight some major modifications to the canonical MCTS when using AOAP as the tree policy. First, , and , are required to store for each node in the tree. The prior information and can be specified and adjusted in implementation. Second, in order to calculate for each stateaction node, each state node is required to be wellexpanded when it is visited, and each stateaction pair , and its corresponding child stateaction node is required to be added to the search path. Each stateaction pair is required to be sampled times, i.e., a state node is expandable when one of its child nodes is visited less than times. Third, after receiving the terminal reward of the collected search path at th rollout, all values of nodes in the collected search path are updated in reversed order through: for and let be the leaf node in ,
(6) 
(7) 
(8) 
(9) 
(10) 
(11) 
(12) 
Algorithm 3 shows the pseudocode of the AOAPMCTS algorithm. The AOAPMCTS algorithm is run with rollouts from the root state node , after which a game tree is built and the estimated optimal action is found corresponding to an action of the root node with the highest posterior stateaction mean. Notice that since we consider deterministic transitions, the tree is fixed once the root node is chosen. When a node in the tree is visited, the tree policy first determines whether the node is expandable. If there are stateaction pairs that are not yet part of the tree, one of those is chosen randomly and added to the tree, and if there are stateaction pairs are visited less than times, one of those is chosen randomly. If all stateaction pairs are wellexpanded, AOAP is used to find the allocated one. is the node reached during the tree policy stage corresponding to state at depth . A simulation is run from according to a default policy, until a terminal node has been reached. The reward
of the terminal state is then backpropagated to all nodes collected in the search path during this round to update the node statistics.
Algorithm 1 The AOAPMCTS algoritm
We show theoretical results regarding AOAPMCTS.
Proposition 1
The proposed AOAPMCTS is consistent, i.e., , , .
At each explored state node in the tree policy, the best action is identified by the AOAP algorithm. As shown in Peng et al. (2018), AOAP is consistent, i.e., every alternative will be sampled infinitely often almost surely as the number of computational budget goes to infinity, so that the best alternative can be definitely selected. Following their analysis, Proposition 1 can be proved by induction. We leave the proof to future work.
4 Numerical Experiments
In this section, we conduct numerical experiments to test the performances of different tree policies for MCTS. We apply our proposed algorithm to the games of Tictactoe and Gomoku. The proposed AOAPMCTS is compared with UCT in Kocsis and Szepesvári (2006), OCBAMCTS in Li et al. (2021)
and TTTSMCTS, which runs tree policy by the TopTwo Thompson Sampling (TTTS) in
Russo (2020). We describe the three tree policies as follows:
UCT: The policy selects the action with the highest upper confidence bound, i.e.,
where is the exploration weight. We choose in implementation.

OCBAMCTS: The policy solves a set of equations and selects the action that is the most starving, i.e., , let and , ,

TTTSMCTS: The policy first samples , from , and finds . Then the policy samples , from the same distribution until , where . The allocated action is determined by randomly choosing from and . Since the second stage of the policy can be timeconsuming when the action space is large, we truncate it with 10 rounds in implementation, i.e., if can not be found in 10 rounds, we determine by the second largest value of .
Experiment 1: Tictactoe Tictactoe is a game played on a threebythree board by two players, who alternately place the marks ‘X’ and ‘O’ in one of the nine spaces in the board. The player who succeeds in placing three of their marks in a horizontal, vertical, or diagonal row is the winner. If both players act optimally, the game will always end in a draw.
Experiment 1.1: Precision In this experiment, we focus on the precision of MCTS in finding the optimal move under different tree policies. The effectiveness of a policy is measured by PCS. Given a place marked by Player 1, we apply different tree policies to identify the optimal move for Player 2. Figure 1 shows two board setups, where we use black and white to represent ‘X’ and ‘O’, respectively, for ease of presentation.
The optimal move for Player 2 is unique in setup 1, whereas any of the four moves in the corner space is optimal for Player 2 in setup 2. The setup 2 is an easier setting since Player 2 has a 50% chance of choosing an optimal move even if choosing randomly. At the end of the game, if Player 2 wins, the reward of terminal state is 1, and if it leads to a draw, the reward is 0.5; otherwise, the reward is 0. We consider two policies for Player 1 under both setups: one is playing randomly, i.e., with equal probability to mark any feasible space, the other is playing UCT, which chooses the move that minimizes the lower confidence bound, i.e.,
in order to minimize the reward of Player 2. We set for all policies, and set , , for AOAPMCTS. The PCS for the optimal move of Player 2 are estimated based on 100,000 independent macro experiments. We plot the PCS of all policies under each setup as a function of the number of rollouts, ranging from 80 to 300. The results are shown in Figure 2.
We can see that AOAPMCTS performs the best among all tree policies and it has a better performance when the number of rollouts is relatively low. The policies based on R&S (i.e., AOAPMCTS and OCBAMCTS) have better performances than the policies based on MAB (i.e., UCT and TTTSMCTS) as the number of rollouts increases. TTTSMCTS has a better performance than OCBAMCTS when the number of rollouts is low. The performances of all policies become comparable as the number of rollouts grows. AOAPMCTS achieves 33.2%, 2.8%, 19.2% and 1.9% better than UCT in (a)(d) settings, respectively. The gap of policies is smaller when Player 1 plays UCT, since Player 1 has a better chance to take optimal action in this case. Although the differences between all policies in Setup 2 are not as significant as that in Setup 1, AOAP still performs the best.
Experiment 1.2: Windrawlose In this experiment, we focus on the number of win, draw and lose when Player 1 plays against with Player 2. Both players play randomly or one of the four tree policies. Since opponent’s policies are unknown, the player’s policy is trained against a random or UCT opponent. The algorithmic constants are the same as in Experiment 1, except . The number of rollouts to determine a move at a state is set to 200. The number of win, draw and lose of Player 1 are estimated by 1000 independent rounds. The results are shown in Table 1 and Table 2, where the trivariate vector in each blank comprises of number of win, draw and lose, respectively. The last column of each Table shows the net win of a policy, calculated by the cumulative wins minus the cumulative loses of both players.
[width=7em] Player 1Player 2  Random  UCT  OCBAMCTS  TTTSMCTS  AOAPMCTS  Net Win 

Random  (526,449,25)  (425,552,23)  (504,475,21)  (454,520,26)  (469,513,18)  483 
UCT  (604,391,5)  (446,545,9)  (404,588,8)  (319,671,10)  (408,586,6)  142 
OCBAMCTS  (574,415,11)  (439,551,10)  (522,467,11)  (497,491,12)  (527,463,10)  156 
TTTSMCTS  (557,431,12)  (455,538,7)  (484,506,10)  (501,490,9)  (402,589,9)  135 
AOAPMCTS  (555,430,15)  (583,403,14)  (493,499,8)  (513,477,10)  (469,522,9)  334 
[width=7em] Player 1Player 2  Random  UCT  OCBAMCTS  TTTSMCTS  AOAPMCTS  Net Win 

Random  (434,545,21)  (307,658,35)  (291,672,37)  (288,689,23)  (288,693,19)  633 
UCT  (476,507,17)  (340,649,11)  (278,713,9)  (258,730,12)  (204,787,9)  2 
OCBAMCTS  (415,579,6)  (314,682,4)  (286,703,11)  (312,678,10)  (270,721,9)  227 
TTTSMCTS  (416,572,12)  (297,696,7)  (279,710,11)  (293,692,15)  (277,715,8)  93 
AOAPMCTS  (435,551,14)  (307,685,8)  (278,708,14)  (341,643,16)  (279,704,17)  315 
From Tables 1 and 2, we can see that the net win of AOAPMCTS is the highest among all policies. OCBAMCTS has a better performance than TTTSMCTS and UCT. The net wins of AOAPMCTS and TTTSMCTS trained against a UCT opponent are lower than that of AOAPMCTS and TTTSMCTS trained against a random opponent, showing that both policies are relatively conservative when the opponent has a better chance to take an optimal action.
Experiment 1.3: Behaviors In this experiment, we analyze the behaviors of four tree policies by observing the boards at the terminal state in games of Tictactoe. Some terminal boards are shown in Figure 3.
From (b), (d) and (e) in Figure 3, we find that the performance of UCT does not vary much as the game goes on. The behavior of TTTSMCTS is similar to UCT, but it has a better performance than UCT. OCBAMCTS tends to have a better performance at the beginning of the game, but it sometimes fails to intercept the opponent’s moves in time, leading to a failure, e.g., Figure 3 (a) and (b). AOAPMCTS can aggressively intercept opponents’ moves or greedily win adaptively. Although it sometimes do not choose the optimal action at the beginning of the game, its performance becomes better as the game goes on, e.g., Figure 3 (d).
Experiment 2: Gomoku We consider a game played on a larger board, which is called Gomoku. It is played on a fifteenbyfifteen board by two players. Players alternate turns to place a stone of their color on an empty intersection. Black plays first. The winner is the first player to form an unbroken chain of only five stones horizontally, vertically, or diagonally. We restrict the board size to eightbyeight for ease of computation.
Experiment 2.1: Precision In this experiment, we focus on the precision of MCTS in finding the optimal move under different tree policies. The effectiveness of a policy is measured by PCS. The algorithmic constants are the same as in Experiment 1, except . The true optimal move in Gomoku at a given state is more difficult to determine compared with Tictactoe. In order to identify the true optimal moves, we use two random policies to play against each other, and record the change in number of win of each move using two neural networks. A move is considered to be optimal if its number of win increases more than 50%, and the number of such a move does not exceed the half number of the Gomoku board. PCS are estimated by 100 independent states of board. Each board is estimated based on 100,000 independent macro experiments. We plot the PCS of all policies under as a function of the number of rollouts, ranging from 0 to 10000. The results are shown in Figure 4.
We can see that AOAPMCTS performs the best among all tree policies. OCBAMCTS has a better performance than TTTSMCTS that is better than UCT and Random.
Experiment 2.2: Windrawlose We focus on the number of win, draw and lose when Player 1 plays against with Player 2. The setups of the experiment are the same as in Experiment 1.2. The algorithmic constants are the same as in Experiment 1.2, except . The number of rollouts to determine a move at a state is set to 2000. The results are shown in Table 3 and Table 4.
[width=7em] Player 1Player 2  Random  UCT  OCBAMCTS  TTTSMCTS  AOAPMCTS  Net Win 

Random  (501,0,499)  (392,5,603)  (401,5,594)  (481,4,515)  (450,13,537)  1549 
UCT  (540,4,456)  (472,14,514)  (412,4,584)  (471,11,518)  (471,0,529)  374 
OCBAMCTS  (660,7,333)  (580,10,610)  (560,5,435)  (521,12,467)  (432,14,554)  635 
TTTSMCTS  (660,11,329)  (531,5,464)  (480,14,506)  (511,0,489)  (542,1,457)  377 
AOAPMCTS  (641,0,359)  (571,13,416)  (591,3,406)  (552,3,445)  (551,2,447)  635 
[width=7em] Player 1Player 2  Random  UCT  OCBAMCTS  TTTSMCTS  AOAPMCTS  Net Win 

Random  (500,7,493)  (510,9,481)  (432,49,519)  (331,4,665)  (211,7,782)  1814 
UCT  (481,2,511)  (521,12,477)  (421,3,576)  (341,17,652)  (350,6,644)  1619 
OCBAMCTS  (610,11,379)  (661,5,334)  (511,7,482)  (540,5,455)  (481,3,516)  901 
TTTSMCTS  (691,7,302)  (550,3,447)  (501,16,483)  (510,3,487)  (361,9,630)  717 
AOAPMCTS  (621,19,360)  (681,8,311)  (661,9,330)  (541,2,457)  (531,6,463)  2215 
We can see that AOAPMCTS performs the best among all policies, and OCBAMCTS has a better performance than TTTSMCTS and UCT. Compared with Experiment 1.2, the advantage of AOAPMCTS is more significant in the larger board.
Experiment 2.3: Behaviors In this experiment, we analyze the behaviors of four tree policies by observing the boards at the terminal state in games of Gomoku. Some terminal boards are shown in Figure 5.
The behavior of each policy observed from Figure 5 is the same as in Figure 2. UCT does not vary much as the game goes on. The behavior of TTTSMCTS is similar to UCT, but it has a better performance than UCT. OCBAMCTS tends to have a better performance at the beginning of the game, but it sometimes fails to intercept the opponent’s moves in time, leading to a failure, e.g., (a) and (c) in Figure 5. AOAPMCTS can aggressively intercept opponents’ moves or greedily win adaptively. Although it sometimes do not choose the optimal action at the beginning of the game, its performance becomes better as the game goes on.
5 Conclusion
The paper studies the tree policy for Monte Carlo Tree Search. We formulate the tree policy in MCTS as a ranking and selection problem. We propose an efficient dynamic sampling tree policy named as AOAPMCTS, which maximizes the probability of correct selection of the best action at the root state. Numerical experiments demonstrate that AOAPMCTS is more efficient than other tested tree policies. Future research includes the theoretical analysis of the proposed tree policy. The normal assumption of samples for deserves verification. How to guarantee sampling precision under limited computational budget could also be a future work.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grants 71901003 and 72022001.
References
 Finitetime analysis of the multiarmed bandit problem. Machine learning 47 (2), pp. 235–256. Cited by: §1.
 ”Ranking and selection”. John Wiley and Sons, In Chapter 4 in Optimal Learning, 7188: New York. Cited by: §1.
 A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games 4 (1), pp. 1–43. Cited by: §1.
 Regret analysis of stochastic and nonstochastic multiarmed bandit problems. arXiv preprint arXiv:1204.5721. Cited by: §1.
 Stochastic simulation optimization: an optimal computing budget allocation. Vol. 1, World Scientific, Singapore. Cited by: §1, §1.
 Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Journal of Discrete Event Dynamic Systems 10 (3), pp. 251–270. Cited by: §1.
 Monte carlo tree search: a tutorial. In 2018 Winter Simulation Conference (WSC), B. Johansson (Ed.), Gothenburg, Sweden, pp. 222–236. Cited by: §1.
 Montecarlo tree search by best arm identification. Advances in Neural Information Processing Systems 30. Cited by: §1.
 Improved montecarlo search. Univ. Tartu, Estonia, Tech. Rep 1. Cited by: §1.
 Bandit based montecarlo planning. In European conference on machine learning, pp. 282–293. Cited by: §1, §4.
 An optimal computing budget allocation tree policy for monte carlo tree search. IEEE Transactions on Automatic Control, early access. Note: doi : 10.1109/TAC.2021.3088792 Cited by: §1, §1, §4.
 Samplebased planning for continuous action markov decision processes. In TwentyFirst International Conference on Automated Planning and Scheduling, Freiburg, Germany. Cited by: §1.
 Ranking and selection as stochastic control. IEEE Transactions on Automatic Control 63 (8), pp. 2359–2373. Cited by: §1, §2, §3.
 Simple bayesian algorithms for bestarm identification. Operations Research 68 (6), pp. 1625–1647. Cited by: §4.
 Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §1.
 Mastering the game of go without human knowledge. nature 550 (7676), pp. 354–359. Cited by: §1.
 Monte carlo tree search: a review of recent modifications and applications. arXiv preprint arXiv:2103.04931. Cited by: §1.
 Efficient sampling method for monte carlo tree search problem. IEICE TRANSACTIONS on Information and Systems 97 (3), pp. 392–398. Cited by: §1.

Bayesian inference in montecarlo tree search.
In
Proceedings of Conference on Uncertainty in Artificial Intelligence
, Catalina Island, California, pp. 580–588. Cited by: §1. 
Upper confidence trees with short term partial information.
In
European Conference on the Applications of Evolutionary Computation
, pp. 153–162. Cited by: §1.