1 Introduction
One of the most difficult tasks in artificial intelligence is the sequential decision making problem (Littman, 1996), whose applications include robotics and games. As for games, the successes are numerous. Machine surpasses man for several games, such as backgammon, checkers, chess, and go (Silver et al., 2017b). A major class of games is two-player games with perfect information, that is to say, games in which players play in turn, without any chance or hidden information. There are still many challenges for these games. For example, for the game of Hex, computers have only been able to beat strong humans since 2020 (Cazenave et al., 2020). For general game playing (Genesereth et al., 2005) (even restricted to games with perfect information): man is always superior to machine on an unknown game (when man and machine have a relatively short learning time to master the rules of the game). In this article, we focus on two-player zero-sum games with perfect information, although most of the contributions in this article should be applicable or easily adaptable to a more general framework.
The first approaches used to design a game-playing program are based on a game tree search algorithm, such as minimax, combined with a handcrafted game state evaluation function based on expert knowledge. A notable use of this technique is the Deep Blue chess program (Campbell et al., 2002). However, the success of Deep Blue is largely due to the raw power of the computer, which could analyze two hundred million game states per second. In addition, this approach is limited by having to design an evaluation function manually (at least partially). This design is a very complex task, which must, in addition, be carried out for each different game. Several works have thus focused on the automatic learning of evaluation functions (Mandziuk, 2010). One of the first successes of learning evaluation functions is on the Backgammon game (Tesauro, 1995). However, for many games, such as Hex or Go, minimax-based approaches, with or without machine learning, have failed to overcome man. Two causes have been identified (Baier and Winands, 2018). Firstly, the very large number of possible actions at each game state prevents an exhaustive search at a significant depth (the game can only be anticipated a few turns in advance). Secondly, for these games, no sufficiently powerful evaluation function could be identified. An alternative approach to solve these two problems has been proposed, giving notably good results to Hex and Go, called Monte Carlo Tree Search and denoted MCTS (Coulom, 2006; Browne et al., 2012). This algorithm explores the game tree non-uniformly, which is a solution to the problem of the very large number of actions. In addition, it evaluates the game states from victory statistics of a large number of random end-game simulations. It does not need an evaluation function. This was not enough, however, to go beyond the level of human players. Several variants of Monte Carlo tree search were then proposed, using in particular knowledge to guide the exploration of the game tree and/or random end-game simulations (Browne et al., 2012). Recent improvements in Monte Carlo tree research have focused on the automatic learning of MCTS knowledge and their uses. This knowledge was first generated by supervised learning (Clark and Storkey, 2015; Gao et al., 2017, 2018; Cazenave, 2018; Tian and Zhu, 2015)
then by supervised learning followed by
reinforcement learning (Silver et al., 2016), and finally by only reinforcement learning (Silver et al., 2017b; Anthony et al., 2017; Silver et al., 2018). This allowed programs to reach and surpass the level of world champion at the game of Go with the latest versions of the program Alphago (Silver et al., 2016, 2017b). In particular, Alphago zero (Silver et al., 2017b), which only uses reinforcement learning, did not need any knowledge to reach its level of play. This last success, however, required 29 million games. This approach has also been applied to chess (Silver et al., 2017a). The resulting program broke the best chess program (which is based on minimax).It is therefore questionable whether minimax is totally out of date or whether the spectacular successes of recent programs are more based on reinforcement learning than Monte Carlo tree search. In particular, it is interesting to ask whether reinforcement learning would enhance minimax enough to make it competitive with Monte Carlo tree search on games where it dominates minimax so far, such as Go or Hex.
In this article, we therefore focus on reinforcement learning within the minimax framework. We propose and asses new techniques for reinforcement learning of evaluation functions. Then, we apply them to design new program-players to the game of Hex (without using other knowledge than the rules of the game). We compare this program-player to Mohex 2.0 (Huang et al., 2013), the champion at Hex (size 11 and 13) of the Computer Olympiad from 2013 to 2017 (Hayward and Weninger, 2017), which is also the strongest player program publicly available.
In the next section, we briefly present the game algorithms and in particular minimax with unbounded depth on which we base several of our experiments. We also present reinforcement learning in games, the game of Hex and the state of the art of game programs on this game. In the following sections, we propose different techniques aimed at improving learning performances and we expose the experiments carried out using these techniques. In particular, in Section 3, we extends the tree bootstrapping (tree learning) technique to the context of reinforcement learning without knowledge based on non-linear functions. In Section 4, we present a new search algorithm, a variant of unbounded minimax called descent, intended to be used during the learning process. In Section 5, we introduce reinforcement heuristics. Their usage is a simple way to use general or dedicated knowledge in reinforcement learning processes. We study several reinforcement heuristics in the context of different games. In Section 6, we propose another variant of unbounded minimax, which plays the safest action instead of playing the best action. This modified search is intended to be used after the learning process. In Section 7, we introduce a new action selection distribution and we apply it with all the previous techniques to design program-players to the game of Hex (size 11 and 13). Finally, in the last section, we conclude and expose the different research perspectives.
2 Background and Related Work
In this section, we briefly present game tree search algorithms, reinforcement learning in the context of games and their applications to Hex (for more details about game algorithms, see (Yannakakis and Togelius, 2018)).
Games can be represented by their game tree (a node corresponds to a game state and the children of a node are the states that can be reached by an action). From this representation, we can determine the action to play using a game tree search algorithm. In order to win, each player tries to maximize his score (i.e. the value of the game state for this player at the end of the game). As we place ourselves in the context of two-player zero-sum games, to maximize the score of a player is to minimize the score of his opponent (the score of a player is the negation of the score of his opponent).
2.1 Game Tree Search Algorithms
The central algorithm is minimax which recursively determines the value of a node from the value of its children and the functions and , up to a limit recursion depth. With this algorithm, the game tree is uniformly explored. A better implementation of minimax uses alpha-beta pruning (Knuth and Moore, 1975; Yannakakis and Togelius, 2018) which makes it possible not to explore the sections of the game tree which are less interesting given the values of the nodes already met and the properties of and . Many variants and improvements of minimax have been proposed (Millington and Funge, 2009). For instance, iterative deepening (Slate and Atkin, 1983; Korf, 1985) allows one to use minimax with a time limit. It sequentially performs increasing depth alpha-beta searches as long as there is time. It is generally combined with the move ordering technique (Fink, 1982), which consists of extending the best move from the previous search first, which accelerates the new search. Some variants perform a search with unbounded depth (that is, the depth of their search is not fixed) (Van Den Herik and Winands, 2008; Schaeffer, 1990; Berliner, 1981). Unlike minimax with or without alpha-beta pruning, the exploration of these algorithms is non-uniform. One of these algorithms is the best-first minimax search (Korf and Chickering, 1996). To avoid any confusion with some best-first approaches at fixed depth, we call this algorithm Unbound Best-First Minimax, or more succinctly . iteratively extends the game tree by adding the children of one of the leaves of the game tree having the same value as that of the root (minimax value). These leaves are the states obtained after having played one of the best sequences of possible actions given the current partial knowledge of the game tree. Thus, this algorithm iteratively extends the a priori best sequences of actions. These best sequences usually change at each extension. Thus, the game tree is non-uniformly explored by focusing on the a priori most interesting actions without exploring just one sequence of actions. In this article, we use the anytime version of (Korf and Chickering, 1996), i.e. we leave a fixed search time for to decide the action to play. We also use transposition tables (Greenblatt et al., 1988; Millington and Funge, 2009) with , which makes it possible not to explicitly build the game tree and to merge the nodes corresponding to the same state. Algorithm 1 is the used implementation of in this paper111This implementation is a slight variant of Korf and Chickering algorithm. Their algorithm is very slightly more efficient but it offers less freedom: our algorithm behaves slightly differently depending on how we decide between two states having the same value. The exploration of states is identical between their algorithm and ours when with our algorithm equality is decided in deepest first. Our variant has been discovered independently of Korf and Chickering work..
2.2 Learning of Evaluation Functions
Reinforcement learning of evaluation functions can be done by different techniques (Mandziuk, 2010; Silver et al., 2017b; Anthony et al., 2017; Young et al., 2016). The general idea of reinforcement learning of state evaluation functions is to use a game tree search algorithm and an adaptive evaluation function , of parameter
, (for example a neural network) to play a sequence of games (for example against oneself, which is the case in this article). Each game will generate pairs
where is a state and the value of calculated by the chosen search algorithm using the evaluation function . The states generated during one game can be the states of the sequence of states of the game (Tesauro, 1995; Veness et al., 2009). For example, in the case of root bootstrapping (technique that we call root learning in this article), the set of pairs used during the learning phase is with the set of states of the sequence of the game. In the case of the tree bootstrapping (tree learning) technique (Veness et al., 2009), the generated states are the states of the game tree built to decide which actions to play (which includes the states of the sequence of states of the game): with the set of states of the partial game tree of the game. Thus, contrary to root bootstrapping, tree bootstrapping does not discard most of the information used to decide the actions to play. The values of the generated states can be their minimax values in the partial game tree built to decide which actions to play (Veness et al., 2009; Tesauro, 1995). Work on tree bootstrapping has been limited to reinforcement learning of linear functions of state features. It has not been formulated or studied in the context of reinforcement learning without knowledge and based on non-linear functions. Note that, in the case of Alphago Zero, the value of each generated state, the states of the sequence of the game, is the value of the terminal state of the game (Silver et al., 2017b). We call this technique terminal learning.Generally between two consecutive games (between match phases), a learning phase occurs, using the pairs of the last game. Each learning phase consists in modifying so that for all pairs , sufficiently approaches to constitute a good approximation. Note that, in the context of a variant, learning phases can use the pairs of several games. This technique is called experience replay (Mnih et al., 2015). Note that, adaptive evaluation functions only serve to evaluate non-terminal states since we know the true value of terminal states.
2.3 Action Selection Distribution
One of the problems related to reinforcement learning is the exploration-exploitation dilemma (Mandziuk, 2010). It consists of choosing between exploring new states to learn new knowledge and exploiting the acquired knowledge. Many techniques have been proposed to deal with this dilemma (Mellor, 2014). However, most of these techniques do not scale because their application requires memorizing all the encountered states. For this reason, in the context of games with large numbers of states, some approaches use probabilistic exploration (Young et al., 2016; Silver et al., 2017b; Mandziuk, 2010; Schraudolph et al., 2001)
. With this approach, to exploit is to play the best action and to explore is to play uniformly at random. More precisely, a parametric probability distribution is used to associate with each action its probability of being played. The parameter associated with the distribution corresponds to the exploration rate (between
and ), which we denote (the exploitation rate is therefore , which we denote ). The rate is often experimentally fixed. Simulated annealing (Kirkpatrick et al., 1983) can, however, be applied to avoid choosing a value for this parameter. In this case, at the beginning of reinforcement learning, the parameter is (we are just exploring). It gradually decreases until reaching at the end of learning. The simplest action selection distribution is -greedy (Young et al., 2016) (of parameter ). With this distribution, the action is chosen uniformly with probability and the best action is chosen with probability (see also Algorithm 2).The -greedy distribution has the disadvantage of not differentiating the actions (except the best action) in terms of probabilities. Another distribution is often used, correcting this disadvantage. This is the softmax distribution (Schraudolph et al., 2001; Mandziuk, 2010). It is defined by with the number of children of the current state , the probability of playing the action , the value of the state obtained after playing in , , and a parameter called temperature ( : exploitation, : exploration).
2.4 Game of Hex
The game of Hex (Browne, 2000) is a two-player combinatorial strategy game. It is played on an empty hexagonal board. We say that a board is of size . The board can be of any size, although the classic sizes are , and . In turn, each player places a stone of his color on an empty cell (each stone is identical). The goal of the game is to be the first to connect the two opposite sides of the board corresponding to its color. Figure 1 illustrates an end game. Although these rules are simplistic, Hex tactics and strategies are complex. The number of states and the number of actions per state are very large, similar to the game of Go. From the board size , the number of states is, for example, higher than that of chess (Table of (Van Den Herik et al., 2002)). For any board size, the first player has a winning strategy (Berlekamp et al., 2003) which is unknown, except for board sizes smaller than or equal to (Pawlewicz and Hayward, 2013) (the game is weaky solved up to the size ). In fact, resolving a particular state is PSPACE-complete (Reisch, 1981; Bonnet et al., 2016). There is a variant of Hex using a swap rule. With this variant, the second player can play in first action a special action, called swap, which swaps the color of the two players (i.e. they swap their pieces and their sides). This rule prevents the first move from being too advantageous.

2.5 Hex Programs
Many Hex player programs have been developed. For example, Mohex 1.0 (Huang et al., 2013) is a program based on Monte Carlo tree search. It also uses many techniques dedicated to Hex, based on specific theoretical results. In particular, it is able to quickly determine a winning strategy for some states (without expanding the search tree) and to prune at each state many actions that it knows to be inferior. It also uses ad hoc knowledge to bias simulations of Monte Carlo tree search.
Mohex 2.0 (Huang et al., 2013) is an improvement of Mohex 1.0 that uses learned knowledge through supervised learning (namely correlations between victory and board patterns) to guide both tree exploration and simulations.
Other work then focused on predicting best actions, through supervised learning of a database of games, using a neural network (Michalski et al., 2013; LeCun et al., 2015; Goodfellow et al., 2016). The neural network is used to learn a policy
, i.e. a prior probability distribution on the actions to play. These prior probabilities are used to guide the exploration of Monte Carlo tree search. First, there is Mohex-CNN
(Gao et al., 2017)which is an improvement of Mohex 2.0 using a convolutional neural network
(Krizhevsky et al., 2012). A new version of Mohex was then proposed: Mohex-3HNN (Gao et al., 2018). Unlike Mohex-CNN, it is based on a residual neural network (He et al., 2016). It calculates, in addition to the policy, a value for states and actions. The value of states replaces the evaluation of states based on simulations of Monte Carlo tree search. Adding a value to actions allows Mohex-HNN to reduce the number of calls of the neural network, improving performance. Mohex-3HNN is the best Hex program. It wons Hex size 11 and 13 tournaments at 2018 Computer Olympiad (Gao et al., 2019).Programs which learn the evaluation function by reinforcement have also been designed. These programs are NeuroHex (Young et al., 2016), EZO-CNN (Takada et al., 2017), DeepEzo (Takada et al., 2019) and ExIt (Anthony et al., 2017). They learn from self-play. Unlike the other three programs, NeuroHex performs supervised learning (of a common Hex heuristic) followed by reinforcement learning. NeuroHex also starts its games with a state from a database of games. EZO-CNN and DeepEzo use knowledge to learn winning strategies in some states. DeepEzo also uses knowledge during confrontations. ExIt learns a policy in addition to the value of states and it is based on MCTS. It is the only program to have learned to play Hex without using knowledge. This result is, however, limited to the board size . A comparison of the main characteristics of these different programs is presented in Table 1.
Programs | Size | Search | Learning | Network | Use |
---|---|---|---|---|---|
Mohex-CNN | 13 | MCTS | supervised | convolutional | policy |
Mohex-3HNN | 13 | MCTS | supervised | residual | policy, state, action |
NeuroHex | 13 | none | supervised, reinforcement | convolutional | state |
EZO-CNN | 7, 9, 11 | Minimax | reinforcement | convolutional | state |
DeepEZO | 13 | Minimax | reinforcement | convolutional | policy, state |
ExIt | 9 | MCTS | reinforcement | convolutional | policy, state |
3 Data Use in Game Learning
In this section, we adapt and study tree learning (see Section 2.2) in the context of reinforcement learning and the use of non-linear adaptive evaluation functions. For this, we compare it to root learning and terminal learning in this context. We start by adapting tree learning, root learning, and terminal learning. Next, we describe the experiment protocol common to several sections of this article. Finally, we expose the comparison of tree learning with root learning and terminal learning.
3.1 Tree Learning
As we saw in Section 2.2, tree learning consists in learning the value of the states of the partial game tree obtained at the end of the game. Root learning consists in learning the values of the states of the sequence of states of the game (the value of each state is its value in the search tree). Terminal learning consists in learning the values of the sequence of states of the game but the value of each state is the value of the terminal state of the game (i.e. the gain of the game). Data to learn after each game, can be modified by some optional data processing methods, such as experience replay (see Section 2.2). The learning phase uses a particular update method so that the adaptive evaluation function fit the chosen data. The adaptation of tree learning, root learning, and terminal learning are given respectively in Algorithm 3, Algorithm 4, and Algorithm 5. In this article, we use experience replay as data processing method (see Algorithm 6 ; its parameter are the memory size and the sampling rate
). In addition, we use a stochastic gradient descent as update method (see Algorithm
7 ; its parameter is the batch size). Formally, in Algorithm 3, Algorithm 4, and Algorithm 5, we have: processing( is experience_replay(, , ) and update(, ) is stochastic_gradient_descent(, , ). Finally, we use -greedy as default action selection method (i.e. action_selection(, , ) is -greedy(, ) ( stores the children value function ; see Algorithm 2)).and implemented with tensorflow.
is the batch size.3.2 Common Experiment Protocol
The experiments of several sections share the same protocol. It is presented in this section. The protocol is used to compare different variants of reinforcement learning algorithms. A variant corresponds to a certain combination of elementary algorithms. More specifically, a combination consists of the association of a search algorithm (iterative deepening alpha-beta (with move ordering), MCTS (UCT with as exploration constant), , …), of an action selection method (-greedy distribution (used by default), softmax distribution, …), a terminal evaluation function (the classic game gain (used by default), …), and a procedure for selecting the data to be learned (root learning, tree learning, or terminal learning). The protocol consists in carrying out a reinforcement learning of hours for each variant. At several stages of the learning process, matches are performed using the adaptive evaluation functions obtained by the different variants. Each variant is then characterized by a winning percentage at each stage of the reinforcement learning process. More formally, we denote by the evaluation generated by the combination at the hour . Each combination is evaluated every hour by a winning percentage. The winning percentage of a combination at a hour (i.e. of ) is computed from matches against each combination at final time , i.e. against each (there is one match in first player and another in second player per pair of combination). The matches are made by using alpha-beta at depth .
This protocol is repeated several times for each experiment in order to reduce the statistical noise in the winning percentages obtained for each variant (the obtained percentage is the average of the percentages of repetitions). The winning percentages are then represented in a graph showing the evolution of the winning percentages during training.
In addition to the curve, the different variants are also compared in relation to their final winning percentage, i.e. at the end of the learning process. Unlike the experiment of the evolution of winning percentages, in the comparison of the different variants at the final stage, each evaluation confronts each other evaluation of all the repetitions. In other words, this experiment consists of performing an all-play-all tournament with all the evaluation functions generated during the different repetitions. The presented winning percentage of a combination is still the average over the repetitions. The matches are also made by using alpha-beta at depth . These percentages are shown in tables.
3.2.1 Technical Details
The used parameters are: search time per action , batch size , memory size , sampling rate (see Section 3.1). Moreover, the used adaptive evaluation function for each combination is a convolutional neural network (Krizhevsky et al., 2012) having three convolution layers222There is an exception: for the game Surkarta, there is only two convolution layers. followed by a fully connected hidden layer. For each convolutional layer, the kernel size is and the filter number is
. The number of neurons in the fully connected layer is
. The margin of each layer is zero. After each layer except the last one, the ReLU activation function
(Glorot et al., 2011) is used. The output layer contains a neuron. When the classical terminal evaluation is used, is the output activation function. Otherwise, there is no activation function for the output.3.3 Comparison of Learning Data Selection Algorithms
We now compare tree learning, root learning and terminal learning, using the protocol of Section 3.2. Each combination uses either tree learning or root learning or terminal learning. Moreover, each combination uses either iterative deepening alpha-beta (denoted by ) or MCTS. Furthermore, each combination uses -greedy as action selection method (see Section 3.1) and the classical terminal evaluation ( if the first player wins, if the first player loses, in case of a draw). There are a total of combinations. The experiment was repeated times. The winning percentage of a combination for each game and for each evaluation step (i.e. each hour) is therefore calculated from matches. The winning percentage curves are shown in Figure 2. The final winning percentages are shown in Table 2. Each percentage of the table has required matches. In all games, except Clobber and Amazons, tree learning with MCTS and with have the best winning percentages. In Clobber, the percentages are very tight. In Amazons, the best percentage is for with tree learning and the second is MCTS with terminal learning (the latter being just higher than MCTS with tree learning). Finally, apart from Surakarta, Hex, and Outer Open Gomoku, it is tree learning with which obtains the best percentage. On all games, by averaging the MCTS percentage with that of , tree learning is better than root learning or terminal learning. On average, using tree learning (with MCTS or ID) increases the winning percentage by around compared to root learning or terminal learning. The remarks are the same for the learning curves, with the difference that MCTS with tree learning is slightly better than with tree learning in Santorini, and MCTS with terminal learning is more clearly slightly the best combination in Amazons and Clobber. In conclusion, tree learning performs much better than root learning or terminal learning, although terminal learning seems slightly better in Clobber and Amazons.

tree learning | root learning | terminal learning | ||||
---|---|---|---|---|---|---|
MCTS | MCTS | MCTS | ||||
Surakarta | ||||||
Othello | ||||||
Hex | ||||||
Outer Open Gomoku | ||||||
Clobber | ||||||
Breakthrough | ||||||
Amazons | ||||||
Santorini | ||||||
Lines of Action | ||||||
mean |
4 Tree Search Algorithms for Game Learning
In this section, we introduce a new tree search algorithm, that we call descent, dedicated to be used during the learning process. It requires tree learning (combining it with root learning or terminal learning is of no interest). After presenting descent, we compare it to MCTS with root learning and with tree learning, to iterative deepening alpha-beta with root learning and with tree learning and to with tree learning.
4.1 Descent: Generate Better Data
Thus, we present descent. It is a modification of which builds a different, deeper, game tree, to be combined with tree learning. The idea of descent is to combine with deterministic end-game simulations providing interesting values from the point of view of learning. The algorithm descent (Algorithm 8) recursively selects the best child of the current node, which becomes the new current node. It adds the children of the current node if they are not in the tree. It performs this recursion from the root (the current state of the game) until reaching a terminal node (an end game). It then updates the value of the selected nodes (minimax value). The algorithm descent repeats this recursive operation starting from the root as long as there is some search time left. Descent is almost identical to . The only difference is that descent performs an iteration until reaching a terminal state while performs this iteration until reaching a leaf of the tree ( stops the iteration much earlier). In other words, during an iteration, just extends one of the leaves of the game tree while descent recursively extends the best child from this leaf until reaching the end of the game. The algorithm descent has the advantage of , i.e. to perform a longer search to determine a better action to play. By learning the values of the game tree (by using for example tree learning), it also has the advantage of a minimax search at depth , i.e. to raise the values of the terminal nodes to the other nodes more quickly. In addition, the states thus generated are closer to the terminal states. Their values are therefore better approximations.
4.2 Comparison of Search Algorithms for Game Learning
We now compare descent with tree learning to MCTS with root learning and with tree learning, to iterative deepening alpha-beta with root learning and with tree learning, and to with tree learning, using the protocol of Section 3.2. Each combination uses one of these tree search algorithms combined with tree/root learning. There are a total of combinations. The experiment was repeated times. The winning percentage of a combination for each game and for each evaluation step (i.e. each hour) is therefore calculated from matches. The winning percentage curves are shown in Figure 3. The final winning percentages are shown in Table 3. Each percentage of the table has required matches. It is descent which gets the best curves on all games. For two games (Surakarta and Outer Open Gomoku), the difference with is very narrow but the results remain better than the classic approaches (MCTS and alpha-beta). On each game, descent obtains a final percentage higher than all the other combinations (the percentage is equal to that of in the case of Santorini). On average over all games, descent has win and is above , the second best combination, by and with tree learning, the third best combination, by .

tree learning | root learning | |||||
---|---|---|---|---|---|---|
descent | MCTS | MCTS | ||||
Surakarta | ||||||
Othello | ||||||
Hex | ||||||
O. O. Gomoku | ||||||
Clobber | ||||||
Breakthrough | ||||||
Amazons | ||||||
Santorini | ||||||
Lines of Action | ||||||
mean |
Remark 1.
In the previous section, in Clobber and Amazons, MCTS with terminal learning has scored relatively higher percentages than on the other games, rivaling tree learning. We can then wonder if on these two games, MCTS with terminal learning could compete with descent or UBFM. This is not the case: the experiment of this section was carried out again for these two games, replacing MCTS (resp. ID) with root learning by MCTS (resp. ID) with terminal learning and the result is analogous.
In conclusion, descent (with tree learning) is undoubtedly the best combination. (with tree learning) is the second best combination, sometimes very close to descent performances and sometimes very far, but always superior to other combinations (slightly or largely depending on the game), apart on Clobber.
5 Reinforcement Heuristic to Improve Learning Performance
In this section, we propose the technique of reinforcement heuristic, which consists to replace the classical terminal evaluation function – that we denote by , which returns if the first player wins, if the second player wins, and in case of a draw (Young et al., 2016; Silver et al., 2017b; Gao et al., 2018) – by another heuristic to evaluate terminal states during the learning process. By using this technique, non-terminal states are therefore evaluated differently, partial game trees and thus matches during the learning process are different, which can impact the learning performances. We start by offering several reinforcement heuristics. Then, we propose a complementary technique, that we call completion, which corrects state evaluation functions taking into account the resolution of states. Finally, we compare the reinforcement heuristics that we propose to the classical terminal evaluation function.
5.1 Some Reinforcement Heuristics
Thus, we start by proposing different reinforcement heuristics.
5.1.1 Scoring
Some games have a natural reinforcement heuristic: the game score. For example, in the case of the game Othello (and in the case of the game Surakarta), the game score is the number of its pieces minus the number of pieces of his opponent (the goal of the game is to have more pieces than its opponent at the end of the game). The scoring heuristic used as a reinforcement heuristic consists of evaluating the terminal states by the final score of the game. With this reinforcement heuristic, the adaptive evaluation function will seek to learn the score of states. In the context of an algorithm based on minimax, the score of a non-terminal state is the minimax value of the subtree starting from this state whose terminal leaves are evaluated by their scores. After training, the adaptive evaluation function then contains more information than just an approximation of the result of the game, it contains an approximation of the score of the game. If the game score is intuitive, this should improve learning performances.
Remark 2.
In the context of the game of the Amazons, the score is the size of the territory of the winning player, i.e. the squares which can be reached by a piece of the winning player. This is approximately the number of empty squares.
5.1.2 Additive and Multiplicative Depth Heuristics
Now we offer the following reinforcement heuristic: the depth heuristic. It consists in giving a better value to the winning states close to the start of the game than to the winning states far from the start. Reinforcement learning with the depth heuristic, it is learning the duration of matches in addition to their results. This learned information is then used to try to win as quickly as possible and try to lose as late as possible. The hypothesis of this heuristic is that a state close to the end of the game has a more precise value than a state more distant and that the duration of the game is easily learned. Under this assumption, with this heuristic, we will take less risk to try to win as quickly as possible and to lose as late as possible. In addition, with a long game, a player in difficulty will have more opportunities to regain the upper hand. We propose two realizations of the depth heuristic: the additive depth heuristic, that we denote by , and the multiplicative depth heuristic, that we denote by . The evaluation function returns the value if the first player wins, the value if the second player wins, and in case of a draw, with where is the maximum number of playable actions in a game and is the number of actions played since the beginning of the game. For the game of Hex, is the number of empty cells on the board plus . For the games where is very large or difficult to compute, we can instead use with a constant approximating (close to the empirical average length of matches).The evaluation function is identical except that satisfies .
Remark 3.
Note that the idea of fast victory and slow defeat has already been proposed but not used in a learning process (Cazenave et al., 2016).
5.1.3 Cummulative Mobility
The next reinforcement heuristic that we propose is cummulative mobility. It consists in favoring the games where the player has more possibility of action and where his opponent has less. The implementation used in this article is as following. The value of a terminal state is if the first player wins, if the second player wins, and in case of a draw, where is the sum of the number of available actions in each turn of the first player since the start of the game and is the sum of the number of available actions in each turn of the second player since the start of the game.
5.1.4 Piece Counting: Presence
Finally, we propose as reinforcement heuristic: the presence heuristic. It consists in taking into account the number of pieces of each player and starts from the assumption that the more a player has pieces the more this one has an advantage. There are several implementations for this heuristic, we use in this article the following implementation: the heuristic value is if the first player wins, if the second player wins, and in case of a draw, where is the number of pieces of the first player and is the number of pieces of the second player. Note that in the games Surakarta and Othello, the score corresponds to a presence heuristic.
5.2 Completion
Relying solely on the value of states calculated from the terminal evaluation function and the adaptive evaluation function can sometimes lead to certain aberrant behaviors. More precisely, if we only seek to maximize the value of states, we will then choose to play a state rather than another state if is of greater value than even if is a winning resolved state (a state is resolved if we know the result of the game starting from this state in which the two players play optimally). A search algorithm can resolve a state. This happens when all the leaves of the subtree starting from this state are terminal. Choosing rather than , a winning resolved state, is an error333There is perhaps, in certain circumstances, an interest in making this error from the point of view of learning. when is not resolved (or when is resolved and is not winning). By choosing , guarantee of winning is lost. The left graph of Figure 4 illustrates such a scenario.

It is therefore necessary to take into account both the value of states and the resolution of states. The completion technique, which we propose in this section, is one way of doing it. It consists, on the one hand, in associating with each state a resolution value . The value of a leaf state is if the state is not resolved or if it is resolved as a draw, if it is resolved as a winning state and if the state is resolved as a losing state. The value of a non-leaf state is computed as the minimax value of the partial game tree where the leaves are evaluated by their resolution value. It consists, on the other hand, to compare states from pairs , by using the lexicographic order (instead of just compare states from the value ). We then seek to maximize the pair, in particular to decide which action to play. The right graph of Figure 4 illustrates the use of completion. The use of the resolution of states also makes it possible to stop the search in the resolved subtrees and thus to save computing time. Descent algorithm modified to use the completion and the resolution stop is described in Algorithm 9. With completion, descent always chooses an action leading to a winning resolved state and never chooses, if possible, an action leading to a losing resolved state.
We also propose to use the resolution of states with action selections, to reduce the duration of games and therefore a priori the duration of the learning process: always play an action leading to a winning resolved state if it exists and never play an action leading to a losing resolved state if possible. Thus, if among the available actions we know that one of the actions is winning, we play it. If there is none, we play according to the chosen action selection method among the actions not leading to a losing resolved state (if possible). We call it completed action selection.
5.3 Comparison of Reinforcement Heuristics
We now compare the different heuristics that we have proposed to the classical terminal evaluation function on different games, using the protocol of Section 3.2. Each combination uses descent with completion (Algorithm 9) and completed -greedy (see Algorithm 2 and Section 5.2). Each combination uses a different terminal evaluation function. These terminal evaluations are the classical (“binary”) evaluation function , the additive depth heuristic, the multiplicative depth heuristic, the scoring heuristic, the cummulative mobility, and the presence heuristic. Other parameters are the same as Section 3.3. There are, at most, a total of combinations per game (on some games, some heuristics are not evaluated because they are trivially of no interest or equivalent to another heuristic). The experiment was repeated times. The winning percentage of a combination for each game and for each evaluation step (i.e. each hour) is therefore calculated from to matches. The final winning percentages are shown in Table 4. Each percentage of the table has required between and matches. On average and in of the games, the classic terminal heuristic has the worst percentage. In Clobber and Othello, it is the second worst. In Lines of Action, it is the third worst. In scoring games, scoring is the best heuristic, as we might expect. Leaving aside the score heuristic, with the exception of Surakarta, it is one of the two depth heuristics that has the best winning percentage. On average, using a depth heuristic instead of using the classic evaluation increases the winning percentage by , and using the best depth heuristic increases the winning percentage by . The winning percentage curves are shown in Figure 5. The final percentages summarize the curves quite well. Note however, on the one hand, that the clear impact compared to the other heuristics (except score) of the additive depth heuristic to Breakthrough, Amazons, Othello, Hex, and Santorini and of the multiplicative depth heuristic to Clobber, Hex, and Open Outer Gomoku. Note, on the other hand, that the classic heuristic is down on all games, except on Othello, Clobber and particularly Lines of Action. In conclusion, the use of generic reinforcement heuristics has significantly improved performances and the depth heuristics are prime candidates as a powerful generic reinforcement heuristic.

depth | ||||||
---|---|---|---|---|---|---|
classic | score | additive | multiplicative | mobility | presence | |
Surakarta | score | |||||
Othello | score | |||||
Hex | X | X | X | |||
Outer Open Gomoku | X | X | X | |||
Clobber | X | X | ||||
Breakthrough | X | |||||
Amazons | X | |||||
Santorini | X | X | ||||
Lines of Action | X | |||||
mean |
6 Search Algorithms for Game Playing
In this section, we propose another variant of , dedicated to be used in competition mode. Then, we compare it with other tree search algorithms.
6.1 Unbound Best-First Minimax with Safe Decision:
Thus, we propose a modification of , denoted . It aims to provide a safer game. The action chooses to play is the one that leads to the state of best value. In some cases, the (a priori) best action can lead to a state that has not been sufficiently visited (such as a non-terminal leaf). Choosing this action is therefore a risky decision. We propose, to avoid this problem, a different decision that aims to play the safest action, in the same way as MCTS (max child selection (Browne et al., 2012)). If no action leads to a winning resolved state, the action chosen by is the one that has been the most selected (since the current state of the game) during the exploration of the game tree. In case of a tie, decides by choosing the one that leads to the state of best value. This decision is safer because the number of times an action is selected is the number of times that this action is more interesting than the others.
Example 4.
The current player has the choice between two actions and . The action leads to a state of value and was selected times (from the current state and from the beginning of the game). The action leads to a state of value and was selected times. chooses the action while chooses the action .
The algorithm of with completion is described in Algorithm 11 (which uses Algorithm 10 ; resolution stop is not used for the succinctness of the presentation).
6.2 Comparison of Search Algorithms for Game Playing
We now compare the winning percentages of different game algorithms, by using the learned evaluation functions of Section 4.2 (obtained at the end of the learning process, i.e. ). We compare with and iterative deepening alpha-beta with move ordering, denoted (each of these algorithms uses completion). For each game, each combination confronts the combinations , with a search time of per action, where , is minimax at depth , is one of the final evaluation functions of Section 4.2, and is the number of one of the repetitions (). The winning percentage of is the average of the winning percentage of over the functions and over the repetitions . The winning percentages are described in Table 5. For each game, the winning percentage of a search algorithm is calculated from matches. On all games except Clobber, gets the best winning percentage (on two games, Hex and Outer Open Gomoku, and have the same percentage). On Clobber, it is which obtains the best percentage, but only more than . On average, across all games, is better than and better than .
Then a variation of this experiment was performed. For each game, each combination confronts all the others, but the used evaluation functions are restricted to those generated from the learning algorithm descent and the search time is per action. The corresponding winning percentages are described in Table 6. For each game, the winning percentage of a search algorithm is calculated from matches. In all games, except Clobber and Santorini, it is again which obtains the best winning percentage. At Clobber and Santorini, it is which obtains the best percentage. On average across all games, is better than and better than . In conclusion, in the context of these experiments, is the best search algorithm.
Outer Open Gomoku | Clobber | Breakthrough | Santorini | Hex | |
---|---|---|---|---|---|
Lines of Action | Othello | Amazons | Surakarta | mean | |
Outer Open Gomoku | Clobber | Breakthrough | Santorini | Hex | |
---|---|---|---|---|---|
Lines of Action | Othello | Amazons | Surakarta | mean | |
7 Ordinal Distribution and Application to Hex
In this section, we propose the last technique, a new action selection distribution, and we apply it with all the previous techniques to design program-players to the game of Hex.
7.1 Ordinal Action Distribution
Thus, we propose an alternative probability distribution (see Section 2.3), that we call ordinal distribution. This distribution does not depend on the value of states. However, it depends on the order of their values. Its formula is:
with the number of children of the root, , the -th best child of the root, the probability of playing the action leading to the child and the exploitation parameter (). Algorithm 12 describes the action selection method resulting from the use of the ordinal distribution with an optimized calculation.
Remark 5.
In an experiment using the common protocol, not presented in this article, ordinal distribution has been mostly better than softmax distribution, but lower than -greedy. However, during long-time learning processes at Hex (similar to the experiments of the following sections), ordinal distribution has performed best.
7.2 A Long Training for Hex
We now apply all the techniques that we have proposed to carry out a long self-play reinforcement learning on Hex size 11. More precisely, we use completed descent (Algorithm 9) with tree learning (Algorithm 3), completed ordinal distribution (see Section 5.2 and Algorithm 12), and the additive depth heuristic (see Section