I Introduction
Monte Carlo tree search (MCTS) algorithms [1, 2], often in combination with learning algorithms, provide stateoftheart AI in many games and other domains [3, 4, 5, 6]
. The most straightforward implementations of MCTS use large numbers of playouts where actions are selected uniformly at random to estimate the value of the starting state of those playouts. Playouts using handcrafted heuristics, learned policies, or search to more closely resemble realistic lines of play can often significantly increase playing strength, even if the increased computational cost leads to a reduction in the number of playouts
[7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 5].The majority of policy learning approaches use supervised learning with human expert moves as training targets, or traditional reinforcement learning (RL) update rules
[20], but the most impressive results have been obtained using the Expert Iteration framework, where MCTS and a learned policy iteratively improve each other through selfplay [4, 5, 6]. In this framework, a policy is trained to mimic the MCTS search behaviour using a crossentropy loss, and the policy is used to bias the MCTS search. Note that playouts are sometimes replaced altogether by trained value function estimators, leaving only the selection phase of MCTS to be biased by a trained policy [4, 6], but a learned policy may also be used to run playouts [5].The selection phase of MCTS provides a balance between exploration and exploitation; exploration consists of searching parts of the game tree that have not yet been thoroughly searched, and exploitation consists of searching parts of the game tree that appear the most promising based on the search process so far. Using the search behaviour of MCTS as an update target for a policy means that this policy is trained to have a similar balance between exploration and exploitation as the MCTS algorithm.
Within the context of the Digital Ludeme Project [21], we aim to learn policies based on interpretable features [22] for stateaction pairs, where future goals of the project include extracting explainable strategies from learned policies, and estimating similarities or distances between different (variants of) games in terms of strategies. For the purpose of these goals, we do not expect the exploratory behaviour that is learned with the standard crossentropy loss to be desirable.
We formulate a new training objective for policies. A policy that optimises this objective can intuitively be understood as one that selects actions such that MCTS is subsequently expected to be capable of performing well. Unlike the case where the MCTS search behaviour is used as training target, this optimisation criterion does not encourage any level of exploration. We derive an expression for the gradient of this objective with respect to a differentiable policy’s parameters, which allows for training using gradient descent.
Like the standard updates used to optimise the crossentropy loss in Expert Iteration [4, 5, 6], these updates are guided by “advice” generated by MCTS. This is hypothesized to be important for a stable and robust selfplay learning process, with a reduced risk of overfitting to the selfplay opponent. The primary difference is that this advice consists of value estimates, rather than a distribution over actions.
We empirically compare policies trained to optimise the proposed objective function, with policies trained on the standard crossentropy loss, across a variety of deterministic, perfectinformation, twoplayer board games. The proposed objective consistently leads to policies that are at least as strong, and in some games significantly stronger, than the crossentropy loss. We also confirm that the resulting policies lead to significantly lower entropy in distributions over actions, which suggests that learned policies are less exploratory. Finally, we compare the resulting distributions of weights learned for different features, and the performance of MCTS agents biased by policies trained on the different objectives.
Ii Background
This section formalises the concepts from reinforcement learning (RL) theory required in this paper. We assume a standard singleagent setting. When subsequently applying these concepts to multiplayer, adversarial game settings, any states in which a learning agent is not the player to move are ignored, and moves selected by opponents are simply assumed to be a part of the “environment” and its transition dynamics.
Iia Markov Decision Processes
We use the standard singleagent, fullyobservable, episodic Markov decision process (MDP) setting, where
denotes a set of states, and denotes a set of actions. At discrete time steps , the agent observes states . Whenever is not terminal, the agent selects an action from the set of actions that are legal in , which leads to an observed reward . We assume that there is a fixed starting state . Given a current state and action, the probability of observing any arbitrary successor state
and reward is given by .Let denote some policy, such that denotes the probability of selecting an action in a state , and . The value of a state under policy is given by (1):
(1) 
where denotes a discount factor (in the board games applications considered in this paper, typically ). We define for in any episode where is a terminal state. The value of an action in a state under policy is given by (2):
(2) 
where covers all actions where .
IiB Policy Gradients
Let denote the expected performance, in terms of returns per episode, of a policy :
(3) 
A common goal in RL is to find a policy such that this objective is maximised. Suppose that
is a differentiable function, parameterised by a vector
, such that exists. Then, the Policy Gradient Theorem [23] states that:(4) 
where gives a discounted weighting of states according to how likely they are to be reached in trajectories following . Samplebased estimators of this gradient allow for the objective to be optimised directly, using stochastic gradient ascent to adjust the policy parameters [24, 25, 20].
IiC Monte Carlo Tree Search Value Estimates
Most variants of Monte Carlo tree search (MCTS) [3] can be viewed as RL approaches which, based on simulated experience, learn onpolicy value estimates for the states represented by nodes in the search tree that is gradually built up [26]. Let denote a state from which we run an MCTS search process (meaning that corresponds to the root node). Then we can formally describe a policy :
(5) 
where denotes the number of times that the search process selected in the node representing , and denotes the rollout policy.
Suppose that value estimates
in nodes of the search tree are computed, as is customary, as the averages of backpropagated scores, or using some other approach that can be viewed as implementing onpolicy backups – such as SarsaUCT(
) [26]. These value estimates are then unbiased estimators of
, as defined in (1). We typically expect these value estimates to be unreliable and exhibit high variance deep in the search tree, but, given a sufficiently high MCTS iteration count, they may be more reliable close to the root node.
Iii Policy Gradient with MCTS Value Estimates
Unlike the standard crossentropy loss used in Expert Iteration, optimising the policy gradient objective of (3) does not incentivise an element of exploration in trained policies. However, this objective focuses on the longterm performance of the standalone policy being trained. Suppose that it is infeasible to learn a good distribution over actions in some state – for instance because there are no features available that allow distinguishing between any actions in . Reaching will then be detrimental to the longterm performance of according to (3), and actions leading to will therefore be disincentivized, even if they may otherwise clearly be a part of the principal variation. This is problematic when we aim to use for purposes such as strategy extraction (even if only for some parts of the state space), rather than using it for standalone gameplaying.
Iiia Objective Function
To address the issues illustrated above, we propose to maximise the objective function given by (6), where is the apprentice policy to be trained, parameterised by a vector :
(6) 
where denotes that, for all , we run an MCTS process and sample from . We refer to this as the TreeSearch Policy Gradient (TSPG) objective function. Intuitively, sampling actions for from MCTS can be understood as stating that it is only important for to be welltrained in states that are likely to be reached when playing according to MCTS processes prior to time . Sampling actions for from MCTS in this objective can be understood as stating that is not required to be capable of playing well for the remainder of an episode, but only needs to be able to select actions such that MCTS would be expected to perform well in subsequent states.
Suppose that there is a small game tree, in which MCTS can easily find an optimal line of play, but where that optimal line of play leads to a subtree in which a parameterised policy cannot play well. This may, for instance, be due to a lack of representational capacity of itself (i.e. using a simple linear function), or due to using a restricted set of input features that is insufficient for states or actions in that subtree to be distinguished from each other. A standard RL objective function, such as the one in (3), would lead to a policy that learns to avoid that subtree altogether, because the same policy cannot guarantee longterm success in that subtree. We argue that this is detrimental for our goal of interpretable strategy extraction, because it leads to a poor strategy in the root of such a game tree. In contrast, the TSPG objective still allows for a strong strategy to be learned for states other than those in the problematic subtree.
IiiB Policy Gradient
Our derivation of an expression for the gradient of this objective with respect to the parameters takes inspiration from the original proof for the policy gradient theorem [23]. We start by defining as the expected value of sampling a single action from in state , and sampling actions from MCTS search processes for the remainder of the episode:
(7)  
where is used as a shorthand notation to indicate that a separate policy , involving a separate complete search process, is used at every time . The gradient of this function with respect to is given by:
(8)  
where we assume that . Note that this assumption may be violated in practice by making use of in the playouts of MCTS processes, but it is not feasible to accurately estimate the gradient of the performance of MCTS with respect to parameters used in playouts. We can avoid violating the assumption by freezing the versions of parameters used for biasing any MCTS process, and clearing any old experience when updating parameters used by MCTS, but in practice we expect this to be detrimental to learning speed. Also note that this assumption is very similar to the omission of the term in the OffPolicy PolicyGradient Theorem, where is a parameter vector and is a target policy [27].
Now, we rewrite the TSPG objective function to a more convenient expression, starting from (6):
(9)  
where . Taking the gradient with respect to gives:
(10)  
where again we assume that has no effect on MCTS processes by taking .
The analytical expression of the gradient of the TSPG objective in (10) is exact if the involved MCTS processes are unaffected by , or an approximation otherwise. Note that it has a similar form to the original policy gradient expression in (4). The weighting of states and the value estimates are now both provided by , but the only required gradient is for (which, by assumption, is differentiable).
IiiC Estimating the Gradient
In the Expert Iteration framework [4, 5, 6], experience is typically generated by playing selfplay games where actions are selected proportional to the visit counts in root states after running MCTS processes. This corresponds precisely to the definition of policies given in (5). It is customary to store states encountered in such a selfplay process in a dataset – keeping only one randomlyselected state per full game, to avoid excessive correlations between instances – and sample batches from
for stochastic gradient descent updates. Sampling batches of states
leads to unbiased estimates of the gradient expression in (10):(11) 
Optimisation of the crossentropy loss typically used in Expert Iteration requires storing MCTS visit counts for all in the dataset , alongside the states . Instead of storing visit counts, our approach requires storing MCTS value estimates for all actions – these are simply the statevalue estimates of all successors of . These values can be plugged into (11) as unbiased estimators for .
We now have an unbiased estimator of the gradient which can be readily computed from data collected as in the standard Expert Iteration selfplay framework. The form of this estimator most closely resembles that of the Mean ActorCritic [28], in the sense that we explicitly sum over all actions rather than sampling trajectories with actions selected according to . As in the gradient estimator of the Mean ActorCritic, it is unnecessary to subtract a statedependent baseline from for variance reduction, as is typically done in samplebased estimators of policy gradients [23, 25].
Iv Learning Offsets from Exploratory Policy
A differentiable policy
is typically implemented to compute logits
, where is a trainable parameter vector and is a feature vector for a stateaction pair . Probabilities are subsequently computed using the softmax function; . In preliminary testing, we found that there is a risk for strong features that are only discovered and added in the middle of a selfplay training process [29] to remain unused. When this happens, it appears like the learning approach remains stuck in what used to be a local optimum given an older feature set, even though newlyadded features should enable escaping that local optimum. First, we elaborate on why this can happen, and subsequently propose an approach to address this issue.Iva Gradients for Lowprobability Actions
Suppose that uses the softmax function, as described above. Then, the gradient of with respect to the parameter of the parameter vector is given by
(12) 
where the Kronecker delta is equal to if , or otherwise, and denotes the feature value for the stateaction pair .
This is the gradient that is multiplied by in (11) to compute the update for the parameter corresponding to the feature . In cases where features value correlate strongly with stateaction values , we would intuitively expect to obtain consistent, highvalue gradient estimates to rapidly adapt . However, if previous learning steps – possibly taken before the feature was being used at all – resulted in a parameter vector such that is low (i.e., ), this gradient will also be close to zero and learning progresses very slowly.
An example in which we were consistently able to observe this problem is the game of Yavalath [30], in which players win the game by constructing lines of four pieces of their colour, but immediately lose if they first construct a line of three pieces of their colour. Fig. 1 provides a graphical representation of three features that could be used to detect winning and/or losing moves. The top feature detects winning moves that place a piece to complete a line of four, and the bottom two features detect losing moves that place pieces to complete lines of three. Note that the features that detect losing moves can be viewed as more “general” features, in the sense that they will also always be active in situations where the windetecting feature is active.
When the set of features is automatically grown over time during selfplay, and more “specific” features are constructed by combining multiple more “general” features [29], the lossdetecting features are often discovered before the windetecting features. These features are – as expected – quickly associated with negative weights, resulting in low probabilities of playing actions in which lossdetecting features are active. When a windetecting feature is discovered at a later point in time, the lossdetecting features result in low probabilities for most situations in which the windetecting feature also applies, leading to gradients and update steps close to despite a strong correlation between feature activity and high values (winning games).
IvB Exploratory Policy as Baseline
In most (samplebased) policy gradient methods [24, 23, 25], there is no longer a term in the gradient estimator. Instead of summing over all actions, updates are typically performed for actions sampled according to , which leads to a term in the gradient estimator. This gradient, when combined with a softmaxbased policy , no longer leads to the issue described above. However, there is a closelyrelated issue in that actions with low probabilities are rarely sampled at all; this problem is generally viewed as a lack of exploration. This is commonly addressed by introducing an entropy regularization term in the objective function, which punishes lowentropy policies [31]. That solution is not acceptable for our goals, because it forces an element of exploration in the learned policies – this is precisely the property inherent in the standard crossentropybased approach of Expert Iteration that we aim to avoid. Instead, we propose to use the parameters of a more exploratory policy as a baseline, and train offsets from those parameters using our new policy gradient approach.
Consider a softmaxbased policy , parameterised by a vector , trained to minimise the standard crossentropy loss normally used in Expert Iteration. For any given state , this loss is given by (13), where and
, respectively, denote discrete probability distributions (vectors) over all actions in the state
.(13) 
Suppose that is defined as a softmax over linear functions of stateaction features, parameterised by trainable parameters , as described in the beginning of this section. Then, the gradient of this loss is given by (14):
(14) 
Note that, unlike the gradient in (12), this gradient does not suffer from the problem that the magnitudes of gradientbased updates are close to when the trainable policy (in this case ) has (incorrectly) converged to parameters that result in nearzero probabilities for certain stateaction pairs. In the example situation described above for Yavalath, we indeed find that a policy trained to minimise this crossentropy loss is capable of learning high weights for windetecting features quickly after the feature itself is first introduced.
We propose to exploit this advantage of the crossentropy loss by defining the logits that are plugged into the softmax of a TSPGbased policy (trained to maximise the TSPG objective of (6)) as follows:
(15) 
Here, denotes a parameter vector of a policy trained to minimise the crossentropy loss – a more “exploratory” policy which learns to mimic the exploratory behaviour of MCTS. When training the policy to maximise (6), we freeze and only allow the parameters to be adjusted. This leaves all the gradients and estimators in Section III unchanged. The parameters can be viewed as a smart “initialisation” of parameters, which is dynamic and can change over time due to its own learning process. The parameters can be viewed as “offsets”, and the sum of parameters are then the parameters that actually optimise the TSPG objective.
V Experiments
This section describes a number of experiments carried out to compare policies trained to minimise the standard crossentropy loss of (13) with policies trained to maximise the TSPG objective of (6). All experiments are carried out using a variety of deterministic, adversarial, twoplayer, perfect information board games.
Va Setup
All policies are trained using selfplay Expert Iteration processes [4, 5, 6]. The policies are all defined as linear functions of stateaction features [22], transformed into probability distributions using a softmax, as described in Section IV. The sets of features grow automatically throughout selfplay [29].
Experience is generated in selfplay, where all players are identical MCTS agents. They use the same PUCT strategy as AlphaGo Zero [4] for the selection phase, with an exploration constant of , and a policy trained to minimise crossentropy loss providing bias. All value estimates are in the range , where corresponds to losses, to ties, and to wins. In the selection phase, unvisited actions are not automatically prioritised; they are assigned a value estimate equal to the value estimate of the parent node. We experiment with policies trained on the crossentropy objective, as well as policies trained on the TSPG objective, for the playout phase. Every turn, MCTS reuses the relevant subtree of the complete search tree generated in previous turns, and runs additional MCTS iterations ( in Hex on the board, due to high computation time). Actions in selfplay are selected proportional to the MCTS visit counts (i.e. sampled from the distributions in root states ).
Every training run described in this section consists of sequential games of selfplay. For every state encountered in selfplay, we store a tuple in an experience buffer, where denotes the distribution induced by the visit counts of MCTS, and denotes a vector of value estimates for all actions . Note that the choice to store every encountered state, rather than only one state per full game of selfplay, may lead to a poor estimate of the desired distribution over states due to high correlations, but is better in terms of sample efficiency. The maximum size of the experience buffer, which operates as a FIFO queue, is .
After every turn in selfplay, we run a single minibatch gradient descent (or ascent) update per vector of parameters that we aim to optimise (first updating any parameters for crossentropy losses, and then any parameters for the TSPG objective). Gradients are averaged over minibatches of up to
samples, sampled uniformly at random from the experience buffer. Updates are performed using a centered variant of RMSProp
[32], with a base learning rate of , a momentum of , a discounting factor of , and a constant of added to the denominator for stability. After every full game of selfplay, we add a new feature to the set of features [29].All selfplay games are automatically terminated after 150 moves. In the playout phase of MCTS, playouts are terminated and declared a tie after moves have been selected according to the playout policy.
Some of the experiments involve evaluating the playing strength of different variants of MCTS after selfplay training as described above. We use Biased MCTS to refer to a version of MCTS that is identical to the agents used to generate selfplay experience as described above, except for that it selects actions to maximise visit count, rather than selecting actions proportional to visit counts, in evaluation games. We use UCT to refer to a standard implementation of MCTS [1, 3], using the UCB1 strategy [33] with an exploration constant of in the selection phase of MCTS, and selecting actions uniformly at random in the playout phase. We also allow UCT to reuse search trees from previous turns.
VB Results
In the first experiment, we compare the raw playing strength of standalone policies trained to either minimise the standard crossentropy loss, or to maximise the TSPG objective. At various checkpoints during the selfplay learning process (after 1, 25, 50, 100, and 200 games of selfplay), we run evaluation games between softmaxbased policies using the parameters learned at that checkpoint for either objective. We use to denote the policy trained on the crossentropy loss. This is also the same policy that is used throughout selfplay to bias the selection phase. We use to denote the policy trained on the TSPG objective. Finally, we use (double) to denote a policy that – like – uses the parameters of as a baseline (see Subsection IVB), but – unlike – again uses the crossentropy loss to compute offsets from the baseline parameters.
Fig. 2 depicts learning curves, with the win percentages of and (double) against measured at the different checkpoints. We repeat the complete training process from scratch five times with different random seeds, and play 200 evaluation games for each repetition. This leads to five different estimates of each win percentage, each of which is itself measured across 200 evaluation games. We use the sample bootstrap method to estimate confidence intervals [34, 35] from these five estimates of win percentage per checkpoint, which are depicted as shaded areas.
It is clear from the figure that consistently outperforms , in many games by a significant margin. We also observe that (double) occasionally outperforms , but generally by a smaller margin than .
Table I shows win percentages in evaluation games of a Biased MCTS agent versus UCT. We compare two variants of the Biased MCTS; one where the crossentropybased (double) policy is used to run MCTS playouts, and one where the TSPGbased policy is used to run MCTS playouts. In both cases, we use the final parameters learned after 200 games of selfplay. Because our focus in this paper is on evaluating the quality of learned policies or strategies, we run these evaluation games with equal MCTS iteration count limits for all players. Note that this is not representative of playing strength under equal time constraints, since Biased MCTS generally takes more time to run than UCT. However, we do in most games find that Biased MCTS still outperforms UCT under equal time constraints (with most results being slightly improved since our previouslypublished results [29]).
Similar to the evaluation in the previous subsection, we include all the different parameters learned from the five different repetitions of training runs in the evaluation. For each vector of parameters resulting from a different repetition, we run 40 evaluation games, for a total of 200 evaluation games across the five repetitions. The different estimates of win percentages from different repetitions are used to construct bootstrap confidence intervals, which are shown in brackets in the table. In most games, we observe that both variants of Biased MCTS significantly outperform UCT, but playouts from the crossentropbased (double) policy often appear to be slightly more informative to the MCTS agent than playouts based on the TSPG objective.
Win ( bootstrap conf. interval)  

Game (board size)  (double) playouts  playouts 
Breakthrough  
Connect 4  
Fanorona  
Gomoku  
Hex  
Hex  
Knightthrough  
Othello  
Teeko  
Yavalath 
Fig. 3 depicts how the entropy in distributions over actions as computed by a number of different policies varies throughout different stages of the different games. The entropy values are normalised to adjust for differences in the number of legal actions between different games and different stages of the same game. These entropy values were recorded in the evaluation games of Biased MCTS vs. UCT, for which win percentages are shown in Table I. In most stages of most games, we find that UCT has the highest entropy, followed (often closely) by , followed by Biased MCTS, finally followed by .
Fig. 4 depicts kernel density estimates for the distributions of values in the learned parameter vectors after 200 games of selfplay when optimising for the crossentropy loss () or the TSPG objective () in the game of Othello. We observe that the crossentropy loss leads to a higher peak of parameter values close to , and a shorter range of more extreme parameter values far away from . In all other games (plots omitted to save space), we consistently observed similar differences between the two distributions.
Vi Discussion
The clear advantage in playing strength that has over in Fig. 2 suggests that the TSPG objective is better suited for learning strong strategies, likely due to the lack of incentive to explore in the objective. The (double) policy slightly outperforms in some games, which suggests that some small gains in playing strength may simply be due to the increased number of gradient descent update steps that are taken by (double) in comparison to .
The results in Table I suggest that, despite the higher playing strength of , (double) may be more informative when used as a playout policy for MCTS agents. It has previously been observed [10, 12, 17] that policies optimised for “balance”, rather than standalone playing strength, may result in more informative evaluations from MCTS playouts. Our results suggest that the crossentropy loss may similarly lead to more balanced policies, leading to a decreased likelihood of biased evaluations.
The entropy plots in Fig. 3 show that the distributions over actions recommended by tend to have the lowest entropy, which means that more often approaches deterministic policies, by assigning the majority of the probability mass to only one or a few actions. We expect this to be beneficial for extraction of interpretable strategies from trained policies, because it means that there is more often a clear ranking of actions, and little ambiguity as for which action to pick in any given game state.
An interesting observation is that is explicitly optimised (through the crossentropy loss) for having distributions close to those of Biased MCTS, but it still often has significantly higher entropy than Biased MCTS. In terms of entropy, the distributions resulting from appear to be closer to those of Biased MCTS in many games, despite not being directly optimised for that target.
The results in Fig. 4 suggest that optimising for TSPG rather than crossentropy loss may make it easier to obtain a clear ranking of features, due to differences between feature weights being more exaggerated, and fewer different features having highly similar weights. We again expect this to be beneficial for interpretation of learned strategies. A comparison to results published on learning balanced playout policies in Go [12] supports the observation described above that the crossentropy loss may lead to more “balanced” [10] policies.
Vii Conclusion
We proposed a novel objective function, referred to as the TSPG objective, for policies in Markov decision processes. Intuitively, a policy that maximises this objective function can be understood as one that selects actions such that, in expectation, an MCTS agent can perform well when playing out the remainder of the episode. We derive a policy gradient expression, which can be estimated using value estimates resulting from MCTS processes. Policies can be trained to optimise this objective using selfplay, similar to crossentropybased policies in AlphaGo Zero and related research [4, 5, 6]. We argue that, due to the lack of a level of exploration in this objective’s training target, it is more suitable for goals such as interpretable strategy extraction [21, 22].
Across a variety of different board games, we empirically demonstrate that the TSPG objective tends to lead to stronger standalone policies than the crossentropy loss. Their distributions over actions tend to have significantly lower entropy, which may make it easier to extract clear, unambiguous advice or strategies from them. The TSPG objective also leads to a wider range of different values for feature weights, which can make it easier to separate features from each other based on their perceived importance.
In future work, we aim to extract interpretable strategies from learned policies, for instance by analysing the contribution [36] of individual features to the predictions made for specific game positions, or larger sets of positions. The feature representation [22] that we use is generally applicable across many different games, and allows for easy visualisation, which will be beneficial in this regard.
Acknowledgment.
This research is part of the European Research Councilfunded Digital Ludeme Project (ERC Consolidator Grant #771292) run by Cameron Browne at Maastricht University’s Department of Data Science and Knowledge Engineering.
References
 [1] L. Kocsis and C. Szepesvári, “Bandit based MonteCarlo planning,” in Mach. Learn.: ECML 2006, ser. LNCS, J. Fürnkranz, T. Scheffer, and M. Spiliopoulou, Eds. Springer, Berlin, Heidelberg, 2006, vol. 4212, pp. 282–293.
 [2] R. Coulom, “Efficient selectivity and backup operators in MonteCarlo tree search,” in Computers and Games, ser. LNCS, H. J. van den Herik, P. Ciancarini, and H. H. L. M. Donkers, Eds., vol. 4630. Springer Berlin Heidelberg, 2007, pp. 72–83.
 [3] C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “A survey of Monte Carlo tree search methods,” IEEE Trans. Comput. Intell. AI Games, vol. 4, no. 1, pp. 1–49, 2012.
 [4] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of Go without human knowledge,” Nature, vol. 550, pp. 354–359, 2017.

[5]
T. Anthony, Z. Tian, and D. Barber, “Thinking fast and slow with deep learning and tree search,” in
Adv. in Neural Inf. Process. Syst. 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 5360–5370.  [6] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, “A general reinforcement learning algorithm that masters chess, shogi, and Go through selfplay,” Science, vol. 362, no. 6419, pp. 1140–1144, 2018.
 [7] S. Gelly, Y. Wang, R. Munos, and O. Teytaud, “Modification of UCT with patterns in MonteCarlo Go,” INRIA, Paris, Tech. Rep. RR6062, 2006.
 [8] R. Coulom, “Computing “ELO ratings” of move patterns in the game of Go,” ICGA Journal, vol. 30, no. 4, pp. 198–208, 2007.
 [9] S. Gelly and D. Silver, “Combining online and offline knowledge in UCT,” in Proc. 24th Int. Conf. Mach. Learn., 2007, pp. 273–280.
 [10] D. Silver and G. Tesauro, “MonteCarlo simulation balancing,” in Proc. 26th Int. Conf. Mach. Learn., 2009, pp. 945–952.
 [11] H. Baier and P. D. Drake, “The power of forgetting: Improving the lastgoodreply policy in Monte Carlo Go,” IEEE Trans. Comput. Intell. AI Games, vol. 2, no. 4, pp. 303–309, 2010.
 [12] S.C. Huang, R. Coulom, and S.S. Lin, “MonteCarlo simulation balancing in practice,” in Computers and Games. CG 2010., ser. LNCS, H. J. van den Herik, H. Iida, and A. Plaat, Eds., vol. 6515. Springer, Berlin, Heidelberg, 2011, pp. 81–92.
 [13] M. H. M. Winands and Y. Björnsson, “based playouts in MonteCarlo tree search,” in Proc. 2011 IEEE Conf. Comput. Intell. Games. IEEE, 2011, pp. 110–117.
 [14] J. A. M. Nijssen and M. H. M. Winands, “Playout search for MonteCarlo tree search in multiplayer games,” in Adv. in Computer Games. ACG 2011., ser. LNCS, H. J. van den Herik and A. Plaat, Eds., vol. 7168. Springer, Berlin, Heidelberg, 2012.
 [15] D. Silver, R. S. Sutton, and M. Müller, “Temporaldifference search in computer Go,” Mach. Learn., vol. 87, no. 2, pp. 183–219, 2012.
 [16] T. Graf and M. Platzner, “Adaptive playouts for online learning of policies during Monte Carlo tree search,” Theoretical Comput. Sci., vol. 644, pp. 53–62, 2016.
 [17] ——, “MonteCarlo simulation balancing revisited,” in Proc. 2016 IEEE Conf. Comput. Intell. Games. IEEE, 2016, pp. 186–192.

[18]
D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,”
Nature, vol. 529, no. 7587, pp. 484–489, 2016.  [19] T. Cazenave, “Playout policy adaptation with move features,” Theoretical Comput. Sci., vol. 644, pp. 43–52, 2016.
 [20] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA: MIT Press, 2018.
 [21] C. Browne, “Modern techniques for ancient games,” in Proc. 2018 IEEE Conf. Comput. Intell. Games. IEEE, 2018, pp. 490–497.
 [22] C. Browne, D. J. N. J. Soemers, and E. Piette, “Strategic features for general games,” in Proc. 2nd Workshop on Knowledge Extraction from Games (KEG), 2019, pp. 70–75.
 [23] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Adv. in Neural Inf. Process. Syst. 12, S. A. Solla, T. K. Leen, and K. Müller, Eds. MIT Press, 2000, pp. 1057–1063.
 [24] R. J. Williams, “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning,” Mach. Learn., vol. 8, no. 34, pp. 229–256, 1992.
 [25] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “Highdimensional continuous control using generalized advantage estimation,” in Int. Conf. Learning Representations (ICLR 2016), 2016.
 [26] T. Vodopivec, S. Samothrakis, and B. Šter, “On Monte Carlo tree search and reinforcement learning,” J. Artificial Intell. Res., pp. 881–936, 2017.
 [27] T. Degris, M. White, and R. S. Sutton, “Offpolicy actorcritic,” in Proc. 29th Int. Conf. Mach. Learn., J. Langford and J. Pineau, Eds. Omnipress, 2012, pp. 457–464.
 [28] C. Allen, K. Asadi, M. Roderick, A. Mohamed, G. Konidaris, and M. Littman, “Mean actor critic,” 2018. [Online]. Available: https://arxiv.org/abs/1709.00503
 [29] D. J. N. J. Soemers, É. Piette, and C. Browne, “Biasing MCTS with features for general games,” in 2019 IEEE Congr. Evol. Computation, 2019, in press. [Online]. Available: https://arxiv.org/abs/1903.08942v1
 [30] C. Browne, “Automatic generation and evaluation of recombination games,” Ph.D. dissertation, Queensland University of Technology, Brisbane, Australia, 2008.
 [31] Z. Ahmed, N. L. Roux, M. Norouzi, and D. Schuurmans, “Understanding the impact of entropy on policy optimization,” 2019. [Online]. Available: https://arxiv.org/abs/1811.11214v3
 [32] A. Graves, “Generating sequences with recurrent neural networks,” 2013. [Online]. Available: https://arxiv.org/abs/1308.0850v5
 [33] P. Auer, N. CesaBianchi, and P. Fischer, “Finitetime analysis of the multiarmed bandit problem,” Mach. Learn., vol. 47, no. 2–3, pp. 235–256, 2002.
 [34] B. Efron and R. J. Tibshirani, An introduction to the bootstrap. CRC Press, 1994.
 [35] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” in Proc. 32nd AAAI Conf. Artificial Intell. AAAI, 2018, pp. 3207–3214.
 [36] S. M. Lundberg and S.I. Lee, “A unified approach to interpreting model predictions,” in Adv. in Neural Inf. Process. Syst. 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 4765–4774.