Learning to Play Two-Player Perfect-Information Games without Knowledge

08/03/2020
by   Quentin Cohen-Solal, et al.
0

In this paper, several techniques for learning game state evaluation functions by reinforcement are proposed. The first is a generalization of tree bootstrapping (tree learning): it is adapted to the context of reinforcement learning without knowledge based on non-linear functions. With this technique, no information is lost during the reinforcement learning process. The second is a modification of minimax with unbounded depth extending the best sequences of actions to the terminal states. This modified search is intended to be used during the learning process. The third is to replace the classic gain of a game (+1 / -1) with a reinforcement heuristic. We study particular reinforcement heuristics such as: quick wins and slow defeats ; scoring ; mobility or presence. The four is another variant of unbounded minimax, which plays the safest action instead of playing the best action. This modified search is intended to be used after the learning process. The five is a new action selection distribution. The conducted experiments suggest that these techniques improve the level of play. Finally, we apply these different techniques to design program-players to the game of Hex (size 11 and 13) surpassing the level of Mohex 2.0 with reinforcement learning from self-play without knowledge. At Hex size 11 (without swap), the program-player reaches the level of Mohex 3HNN.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

08/30/2018

Application of Self-Play Reinforcement Learning to a Four-Player Game of Imperfect Information

We introduce a new virtual environment for simulating a card game known ...
03/05/2019

Towards Understanding Chinese Checkers with Heuristics, Monte Carlo Tree Search, and Deep Reinforcement Learning

The game of Chinese Checkers is a challenging traditional board game of ...
08/11/2020

HEX and Neurodynamic Programming

Hex is a complex game with a high branching factor. For the first time H...
01/12/2021

Automated Synthesis of Steady-State Continuous Processes using Reinforcement Learning

Automated flowsheet synthesis is an important field in computer-aided pr...
09/11/2021

Discovery and Equilibrium in Games with Unawareness

Equilibrium notions for games with unawareness in the literature cannot ...
01/10/1999

KnightCap: A chess program that learns by combining TD(lambda) with game-tree search

In this paper we present TDLeaf(lambda), a variation on the TD(lambda) a...
05/26/2019

SAI: a Sensible Artificial Intelligence that plays with handicap and targets high scores in 9x9 Go (extended version)

We develop a new model that can be applied to any perfect information tw...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the most difficult tasks in artificial intelligence is the sequential decision making problem (Littman, 1996), whose applications include robotics and games. As for games, the successes are numerous. Machine surpasses man for several games, such as backgammon, checkers, chess, and go (Silver et al., 2017b). A major class of games is two-player games with perfect information, that is to say, games in which players play in turn, without any chance or hidden information. There are still many challenges for these games. For example, for the game of Hex, computers have only been able to beat strong humans since 2020 (Cazenave et al., 2020). For general game playing (Genesereth et al., 2005) (even restricted to games with perfect information): man is always superior to machine on an unknown game (when man and machine have a relatively short learning time to master the rules of the game). In this article, we focus on two-player zero-sum games with perfect information, although most of the contributions in this article should be applicable or easily adaptable to a more general framework.

The first approaches used to design a game-playing program are based on a game tree search algorithm, such as minimax, combined with a handcrafted game state evaluation function based on expert knowledge. A notable use of this technique is the Deep Blue chess program (Campbell et al., 2002). However, the success of Deep Blue is largely due to the raw power of the computer, which could analyze two hundred million game states per second. In addition, this approach is limited by having to design an evaluation function manually (at least partially). This design is a very complex task, which must, in addition, be carried out for each different game. Several works have thus focused on the automatic learning of evaluation functions (Mandziuk, 2010). One of the first successes of learning evaluation functions is on the Backgammon game (Tesauro, 1995). However, for many games, such as Hex or Go, minimax-based approaches, with or without machine learning, have failed to overcome man. Two causes have been identified (Baier and Winands, 2018). Firstly, the very large number of possible actions at each game state prevents an exhaustive search at a significant depth (the game can only be anticipated a few turns in advance). Secondly, for these games, no sufficiently powerful evaluation function could be identified. An alternative approach to solve these two problems has been proposed, giving notably good results to Hex and Go, called Monte Carlo Tree Search and denoted MCTS (Coulom, 2006; Browne et al., 2012). This algorithm explores the game tree non-uniformly, which is a solution to the problem of the very large number of actions. In addition, it evaluates the game states from victory statistics of a large number of random end-game simulations. It does not need an evaluation function. This was not enough, however, to go beyond the level of human players. Several variants of Monte Carlo tree search were then proposed, using in particular knowledge to guide the exploration of the game tree and/or random end-game simulations (Browne et al., 2012). Recent improvements in Monte Carlo tree research have focused on the automatic learning of MCTS knowledge and their uses. This knowledge was first generated by supervised learning (Clark and Storkey, 2015; Gao et al., 2017, 2018; Cazenave, 2018; Tian and Zhu, 2015)

then by supervised learning followed by

reinforcement learning (Silver et al., 2016), and finally by only reinforcement learning (Silver et al., 2017b; Anthony et al., 2017; Silver et al., 2018). This allowed programs to reach and surpass the level of world champion at the game of Go with the latest versions of the program Alphago (Silver et al., 2016, 2017b). In particular, Alphago zero (Silver et al., 2017b), which only uses reinforcement learning, did not need any knowledge to reach its level of play. This last success, however, required 29 million games. This approach has also been applied to chess (Silver et al., 2017a). The resulting program broke the best chess program (which is based on minimax).

It is therefore questionable whether minimax is totally out of date or whether the spectacular successes of recent programs are more based on reinforcement learning than Monte Carlo tree search. In particular, it is interesting to ask whether reinforcement learning would enhance minimax enough to make it competitive with Monte Carlo tree search on games where it dominates minimax so far, such as Go or Hex.

In this article, we therefore focus on reinforcement learning within the minimax framework. We propose and asses new techniques for reinforcement learning of evaluation functions. Then, we apply them to design new program-players to the game of Hex (without using other knowledge than the rules of the game). We compare this program-player to Mohex 2.0 (Huang et al., 2013), the champion at Hex (size 11 and 13) of the Computer Olympiad from 2013 to 2017 (Hayward and Weninger, 2017), which is also the strongest player program publicly available.

In the next section, we briefly present the game algorithms and in particular minimax with unbounded depth on which we base several of our experiments. We also present reinforcement learning in games, the game of Hex and the state of the art of game programs on this game. In the following sections, we propose different techniques aimed at improving learning performances and we expose the experiments carried out using these techniques. In particular, in Section 3, we extends the tree bootstrapping (tree learning) technique to the context of reinforcement learning without knowledge based on non-linear functions. In Section 4, we present a new search algorithm, a variant of unbounded minimax called descent, intended to be used during the learning process. In Section 5, we introduce reinforcement heuristics. Their usage is a simple way to use general or dedicated knowledge in reinforcement learning processes. We study several reinforcement heuristics in the context of different games. In Section 6, we propose another variant of unbounded minimax, which plays the safest action instead of playing the best action. This modified search is intended to be used after the learning process. In Section 7, we introduce a new action selection distribution and we apply it with all the previous techniques to design program-players to the game of Hex (size 11 and 13). Finally, in the last section, we conclude and expose the different research perspectives.

2 Background and Related Work

In this section, we briefly present game tree search algorithms, reinforcement learning in the context of games and their applications to Hex (for more details about game algorithms, see (Yannakakis and Togelius, 2018)).

Games can be represented by their game tree (a node corresponds to a game state and the children of a node are the states that can be reached by an action). From this representation, we can determine the action to play using a game tree search algorithm. In order to win, each player tries to maximize his score (i.e. the value of the game state for this player at the end of the game). As we place ourselves in the context of two-player zero-sum games, to maximize the score of a player is to minimize the score of his opponent (the score of a player is the negation of the score of his opponent).

2.1 Game Tree Search Algorithms

The central algorithm is minimax which recursively determines the value of a node from the value of its children and the functions and , up to a limit recursion depth. With this algorithm, the game tree is uniformly explored. A better implementation of minimax uses alpha-beta pruning (Knuth and Moore, 1975; Yannakakis and Togelius, 2018) which makes it possible not to explore the sections of the game tree which are less interesting given the values of the nodes already met and the properties of and . Many variants and improvements of minimax have been proposed (Millington and Funge, 2009). For instance, iterative deepening (Slate and Atkin, 1983; Korf, 1985) allows one to use minimax with a time limit. It sequentially performs increasing depth alpha-beta searches as long as there is time. It is generally combined with the move ordering technique (Fink, 1982), which consists of extending the best move from the previous search first, which accelerates the new search. Some variants perform a search with unbounded depth (that is, the depth of their search is not fixed) (Van Den Herik and Winands, 2008; Schaeffer, 1990; Berliner, 1981). Unlike minimax with or without alpha-beta pruning, the exploration of these algorithms is non-uniform. One of these algorithms is the best-first minimax search (Korf and Chickering, 1996). To avoid any confusion with some best-first approaches at fixed depth, we call this algorithm Unbound Best-First Minimax, or more succinctly . iteratively extends the game tree by adding the children of one of the leaves of the game tree having the same value as that of the root (minimax value). These leaves are the states obtained after having played one of the best sequences of possible actions given the current partial knowledge of the game tree. Thus, this algorithm iteratively extends the a priori best sequences of actions. These best sequences usually change at each extension. Thus, the game tree is non-uniformly explored by focusing on the a priori most interesting actions without exploring just one sequence of actions. In this article, we use the anytime version of (Korf and Chickering, 1996), i.e. we leave a fixed search time for to decide the action to play. We also use transposition tables (Greenblatt et al., 1988; Millington and Funge, 2009) with , which makes it possible not to explicitly build the game tree and to merge the nodes corresponding to the same state. Algorithm 1 is the used implementation of in this paper111This implementation is a slight variant of Korf and Chickering algorithm. Their algorithm is very slightly more efficient but it offers less freedom: our algorithm behaves slightly differently depending on how we decide between two states having the same value. The exploration of states is identical between their algorithm and ours when with our algorithm equality is decided in deepest first. Our variant has been discovered independently of Korf and Chickering work..

Function UBFM_iteration()
       if terminal() then
            return
      else
             if  then
                   foreach  actions() do
                        
            else
                   best_action() UBFM_iteration()
             best_action() return
Function best_action()
       if first_player() then
            return
      else
             return
Function UBFM(, )
       time() while time() do UBFM_iteration() return best_action()
Algorithm 1 (Unbounded Best-First Minimax) algorithm : it computes the best action to play in the generated non-uniform partial game tree ( : state obtained after playing the action in the state ; : value obtained after playing in ; is the used evaluation function (first player point of view) ; : keys of the transposition table (global variable) ; : search time per action).

2.2 Learning of Evaluation Functions

Reinforcement learning of evaluation functions can be done by different techniques (Mandziuk, 2010; Silver et al., 2017b; Anthony et al., 2017; Young et al., 2016). The general idea of reinforcement learning of state evaluation functions is to use a game tree search algorithm and an adaptive evaluation function , of parameter

, (for example a neural network) to play a sequence of games (for example against oneself, which is the case in this article). Each game will generate pairs

where is a state and the value of calculated by the chosen search algorithm using the evaluation function . The states generated during one game can be the states of the sequence of states of the game (Tesauro, 1995; Veness et al., 2009). For example, in the case of root bootstrapping (technique that we call root learning in this article), the set of pairs used during the learning phase is with the set of states of the sequence of the game. In the case of the tree bootstrapping (tree learning) technique (Veness et al., 2009), the generated states are the states of the game tree built to decide which actions to play (which includes the states of the sequence of states of the game): with the set of states of the partial game tree of the game. Thus, contrary to root bootstrapping, tree bootstrapping does not discard most of the information used to decide the actions to play. The values of the generated states can be their minimax values in the partial game tree built to decide which actions to play (Veness et al., 2009; Tesauro, 1995). Work on tree bootstrapping has been limited to reinforcement learning of linear functions of state features. It has not been formulated or studied in the context of reinforcement learning without knowledge and based on non-linear functions. Note that, in the case of Alphago Zero, the value of each generated state, the states of the sequence of the game, is the value of the terminal state of the game (Silver et al., 2017b). We call this technique terminal learning.

Generally between two consecutive games (between match phases), a learning phase occurs, using the pairs of the last game. Each learning phase consists in modifying so that for all pairs , sufficiently approaches to constitute a good approximation. Note that, in the context of a variant, learning phases can use the pairs of several games. This technique is called experience replay (Mnih et al., 2015). Note that, adaptive evaluation functions only serve to evaluate non-terminal states since we know the true value of terminal states.

2.3 Action Selection Distribution

One of the problems related to reinforcement learning is the exploration-exploitation dilemma (Mandziuk, 2010). It consists of choosing between exploring new states to learn new knowledge and exploiting the acquired knowledge. Many techniques have been proposed to deal with this dilemma (Mellor, 2014). However, most of these techniques do not scale because their application requires memorizing all the encountered states. For this reason, in the context of games with large numbers of states, some approaches use probabilistic exploration (Young et al., 2016; Silver et al., 2017b; Mandziuk, 2010; Schraudolph et al., 2001)

. With this approach, to exploit is to play the best action and to explore is to play uniformly at random. More precisely, a parametric probability distribution is used to associate with each action its probability of being played. The parameter associated with the distribution corresponds to the exploration rate (between

and ), which we denote (the exploitation rate is therefore , which we denote ). The rate is often experimentally fixed. Simulated annealing (Kirkpatrick et al., 1983) can, however, be applied to avoid choosing a value for this parameter. In this case, at the beginning of reinforcement learning, the parameter is (we are just exploring). It gradually decreases until reaching at the end of learning. The simplest action selection distribution is -greedy (Young et al., 2016) (of parameter ). With this distribution, the action is chosen uniformly with probability and the best action is chosen with probability (see also Algorithm 2).

Function _greedy(, )
       if probability  then
             if first_player() then
                   return
            else
                   return
      else
             return uniformly chosen.
Algorithm 2 -greedy algorithm with simulated annealing used in the experiments of this article (: time elapsed since the start of the reinforcement learning process ; : chosen total duration of the learning process ; : value of the state after playing the action ).

The -greedy distribution has the disadvantage of not differentiating the actions (except the best action) in terms of probabilities. Another distribution is often used, correcting this disadvantage. This is the softmax distribution (Schraudolph et al., 2001; Mandziuk, 2010). It is defined by with the number of children of the current state , the probability of playing the action , the value of the state obtained after playing in , , and a parameter called temperature ( : exploitation, : exploration).

2.4 Game of Hex

The game of Hex (Browne, 2000) is a two-player combinatorial strategy game. It is played on an empty hexagonal board. We say that a board is of size . The board can be of any size, although the classic sizes are , and . In turn, each player places a stone of his color on an empty cell (each stone is identical). The goal of the game is to be the first to connect the two opposite sides of the board corresponding to its color. Figure 1 illustrates an end game. Although these rules are simplistic, Hex tactics and strategies are complex. The number of states and the number of actions per state are very large, similar to the game of Go. From the board size , the number of states is, for example, higher than that of chess (Table  of (Van Den Herik et al., 2002)). For any board size, the first player has a winning strategy (Berlekamp et al., 2003) which is unknown, except for board sizes smaller than or equal to (Pawlewicz and Hayward, 2013) (the game is weaky solved up to the size ). In fact, resolving a particular state is PSPACE-complete (Reisch, 1981; Bonnet et al., 2016). There is a variant of Hex using a swap rule. With this variant, the second player can play in first action a special action, called swap, which swaps the color of the two players (i.e. they swap their pieces and their sides). This rule prevents the first move from being too advantageous.

Figure 1: A Hex end game of size 11 (white wins)

2.5 Hex Programs

Many Hex player programs have been developed. For example, Mohex 1.0 (Huang et al., 2013) is a program based on Monte Carlo tree search. It also uses many techniques dedicated to Hex, based on specific theoretical results. In particular, it is able to quickly determine a winning strategy for some states (without expanding the search tree) and to prune at each state many actions that it knows to be inferior. It also uses ad hoc knowledge to bias simulations of Monte Carlo tree search.

Mohex 2.0 (Huang et al., 2013) is an improvement of Mohex 1.0 that uses learned knowledge through supervised learning (namely correlations between victory and board patterns) to guide both tree exploration and simulations.

Other work then focused on predicting best actions, through supervised learning of a database of games, using a neural network (Michalski et al., 2013; LeCun et al., 2015; Goodfellow et al., 2016). The neural network is used to learn a policy

, i.e. a prior probability distribution on the actions to play. These prior probabilities are used to guide the exploration of Monte Carlo tree search. First, there is Mohex-CNN

(Gao et al., 2017)

which is an improvement of Mohex 2.0 using a convolutional neural network

(Krizhevsky et al., 2012). A new version of Mohex was then proposed: Mohex-3HNN (Gao et al., 2018). Unlike Mohex-CNN, it is based on a residual neural network (He et al., 2016). It calculates, in addition to the policy, a value for states and actions. The value of states replaces the evaluation of states based on simulations of Monte Carlo tree search. Adding a value to actions allows Mohex-HNN to reduce the number of calls of the neural network, improving performance. Mohex-3HNN is the best Hex program. It wons Hex size 11 and 13 tournaments at 2018 Computer Olympiad (Gao et al., 2019).

Programs which learn the evaluation function by reinforcement have also been designed. These programs are NeuroHex (Young et al., 2016), EZO-CNN (Takada et al., 2017), DeepEzo (Takada et al., 2019) and ExIt (Anthony et al., 2017). They learn from self-play. Unlike the other three programs, NeuroHex performs supervised learning (of a common Hex heuristic) followed by reinforcement learning. NeuroHex also starts its games with a state from a database of games. EZO-CNN and DeepEzo use knowledge to learn winning strategies in some states. DeepEzo also uses knowledge during confrontations. ExIt learns a policy in addition to the value of states and it is based on MCTS. It is the only program to have learned to play Hex without using knowledge. This result is, however, limited to the board size . A comparison of the main characteristics of these different programs is presented in Table 1.

Programs Size Search Learning Network Use
Mohex-CNN 13 MCTS supervised convolutional policy
Mohex-3HNN 13 MCTS supervised residual policy, state, action
NeuroHex 13 none supervised, reinforcement convolutional state
EZO-CNN 7, 9, 11 Minimax reinforcement convolutional state
DeepEZO 13 Minimax reinforcement convolutional policy, state
ExIt 9 MCTS reinforcement convolutional policy, state
Table 1: Comparison of the main features of the latest Hex programs. These characteristics are respectively the board sizes on which learning is based, the used tree search algorithm, the type of learning, the type of neural network and its use (to approximate the values of states, actions and/or policy.

3 Data Use in Game Learning

In this section, we adapt and study tree learning (see Section 2.2) in the context of reinforcement learning and the use of non-linear adaptive evaluation functions. For this, we compare it to root learning and terminal learning in this context. We start by adapting tree learning, root learning, and terminal learning. Next, we describe the experiment protocol common to several sections of this article. Finally, we expose the comparison of tree learning with root learning and terminal learning.

3.1 Tree Learning

As we saw in Section 2.2, tree learning consists in learning the value of the states of the partial game tree obtained at the end of the game. Root learning consists in learning the values of the states of the sequence of states of the game (the value of each state is its value in the search tree). Terminal learning consists in learning the values of the sequence of states of the game but the value of each state is the value of the terminal state of the game (i.e. the gain of the game). Data to learn after each game, can be modified by some optional data processing methods, such as experience replay (see Section 2.2). The learning phase uses a particular update method so that the adaptive evaluation function fit the chosen data. The adaptation of tree learning, root learning, and terminal learning are given respectively in Algorithm 3, Algorithm 4, and Algorithm 5. In this article, we use experience replay as data processing method (see Algorithm 6 ; its parameter are the memory size and the sampling rate

). In addition, we use a stochastic gradient descent as update method (see Algorithm 

7 ; its parameter is the batch size). Formally, in Algorithm 3, Algorithm 4, and Algorithm 5, we have: processing( is experience_replay(, , ) and update(, ) is stochastic_gradient_descent(, , ). Finally, we use -greedy as default action selection method (i.e. action_selection(, , ) is -greedy(, ) ( stores the children value function ; see Algorithm 2)).

Function tree_learning()
       time() while time()  do
             initial_game_state() while terminal() do
                   search(, , , , ) action_selection(, , )
             processing() update(, )
      
Algorithm 3 Tree learning (tree bootstrapping) algorithm (: learning duration ; : state obtained after playing the action in the state ; : value of state in the game tree according to the last tree search ; : index of the transposition table (set of states which are non-leaves or terminal) ; : adaptive evaluation function used for evaluating the non-terminal leaves of the game tree ; ; evaluation of terminal states (e.g. the gain of the game) ; : transposition table (contains the function and other functions depending on the used search algorithm ; search(, , , , ): a seach algorithm (it extends the game tree from , by adding new states in and labeling its states, in particular, by a value , stored in , using as evaluation of the non-terminal leaves and as evaluation of terminal states) ; action_selection(, , ): decides the action to play in the state depending on the current game tree (i.e. depending on and ) ; processing(): various optional data processing such as data augmentation (adding symmetrical states, …), experience replay, … ; update(): updates the parameter of in order for is closer to for each .
Function root_learning()
       time() while time()  do
             initial_game_state() while terminal() do
                   search(, , , , ) action_selection(, , )
             processing() update(, )
      
Algorithm 4 Root learning (root bootstrapping) algorithm (: learning duration ; : state obtained after playing the action in the state ; : value of state in the game tree according to the last tree search ; : index of the transposition table ; : adaptive evaluation function used for evaluating the non-terminal leaves of the game tree ; ; evaluation function of terminal states ; : transposition table (contains the function and other functions depending on the used search algorithm ; search(, , , , ): a seach algorithm (it extends the game tree from , by adding new states in and labeling them, in particular, by a value , stored in , using as evaluation of the non-terminal leaves and as evaluation of terminal states) ; action_selection(, , ): decides the action to play in the state depending on the current game tree (i.e. depending on and ) ; processing(): various optional data processing such as data augmentation (adding symmetrical states, …), experience replay, … ; update(): updates the parameter of in order for is closer to for each .
Function terminal_learning()
       time() while time()  do
             initial_game_state() while terminal() do
                   search(, , , , ) action_selection(, , )
             processing() update(, )
      
Algorithm 5 Terminal learning algorithm (: learning duration ; : state obtained after playing the action in the state ; : value of state in the game tree according to the last tree search ; : index of the transposition table ; : adaptive evaluation function used for evaluating the non-terminal leaves of the game tree ; ; evaluation function of terminal states ; : transposition table (contains the function and other functions depending on the used search algorithm ; search(, , , , ): a seach algorithm (it extends the game tree from , by adding new states in and labeling them, in particular, by a value , stored in , using as evaluation of the non-terminal leaves and as evaluation of terminal states) ; action_selection(, , ): decides the action to play in the state depending on the current game tree (i.e. depending on and ) ; processing(): various optional data processing such as data augmentation (adding symmetrical states, …), experience replay, … ; update(): updates the parameter of in order for is closer to for each .
Function experience_replay(, , )
       add the elements of in if  then
             remove the oldest items of to have
      if  then
             return
      return a list of random items of whose size is
Algorithm 6 Experience replay (replay buffer) algorithm used in the experiments of this article. is the memory size and is the sampling rate. is the memory buffer (global variable initialized by an empty queue). If the number of data is less than , then it returns all data (no sampling). Otherwise, it returns random elements.
Function stochastic_gradient_descent(, , )
       Split in disjoint sets, denoted , such that and for each foreach  do
             minimize by using Adam and regularization
      
Algorithm 7 Stochastic gradient descent algorithm used in the experiments of this article. It is based on Adam optimization (epoch per update) (Kingma and Ba, 2014) and regularization (with as parameter) (Ng, 2004)

and implemented with tensorflow.

is the batch size.

3.2 Common Experiment Protocol

The experiments of several sections share the same protocol. It is presented in this section. The protocol is used to compare different variants of reinforcement learning algorithms. A variant corresponds to a certain combination of elementary algorithms. More specifically, a combination consists of the association of a search algorithm (iterative deepening alpha-beta (with move ordering), MCTS (UCT with as exploration constant), , …), of an action selection method (-greedy distribution (used by default), softmax distribution, …), a terminal evaluation function (the classic game gain (used by default), …), and a procedure for selecting the data to be learned (root learning, tree learning, or terminal learning). The protocol consists in carrying out a reinforcement learning of hours for each variant. At several stages of the learning process, matches are performed using the adaptive evaluation functions obtained by the different variants. Each variant is then characterized by a winning percentage at each stage of the reinforcement learning process. More formally, we denote by the evaluation generated by the combination at the hour . Each combination is evaluated every hour by a winning percentage. The winning percentage of a combination at a hour (i.e. of ) is computed from matches against each combination at final time , i.e. against each (there is one match in first player and another in second player per pair of combination). The matches are made by using alpha-beta at depth .

This protocol is repeated several times for each experiment in order to reduce the statistical noise in the winning percentages obtained for each variant (the obtained percentage is the average of the percentages of repetitions). The winning percentages are then represented in a graph showing the evolution of the winning percentages during training.

In addition to the curve, the different variants are also compared in relation to their final winning percentage, i.e. at the end of the learning process. Unlike the experiment of the evolution of winning percentages, in the comparison of the different variants at the final stage, each evaluation confronts each other evaluation of all the repetitions. In other words, this experiment consists of performing an all-play-all tournament with all the evaluation functions generated during the different repetitions. The presented winning percentage of a combination is still the average over the repetitions. The matches are also made by using alpha-beta at depth . These percentages are shown in tables.

3.2.1 Technical Details

The used parameters are: search time per action , batch size , memory size , sampling rate (see Section 3.1). Moreover, the used adaptive evaluation function for each combination is a convolutional neural network (Krizhevsky et al., 2012) having three convolution layers222There is an exception: for the game Surkarta, there is only two convolution layers. followed by a fully connected hidden layer. For each convolutional layer, the kernel size is and the filter number is

. The number of neurons in the fully connected layer is

. The margin of each layer is zero. After each layer except the last one, the ReLU activation function

(Glorot et al., 2011) is used. The output layer contains a neuron. When the classical terminal evaluation is used, is the output activation function. Otherwise, there is no activation function for the output.

3.3 Comparison of Learning Data Selection Algorithms

We now compare tree learning, root learning and terminal learning, using the protocol of Section 3.2. Each combination uses either tree learning or root learning or terminal learning. Moreover, each combination uses either iterative deepening alpha-beta (denoted by ) or MCTS. Furthermore, each combination uses -greedy as action selection method (see Section 3.1) and the classical terminal evaluation ( if the first player wins, if the first player loses, in case of a draw). There are a total of combinations. The experiment was repeated times. The winning percentage of a combination for each game and for each evaluation step (i.e. each hour) is therefore calculated from matches. The winning percentage curves are shown in Figure 2. The final winning percentages are shown in Table 2. Each percentage of the table has required matches. In all games, except Clobber and Amazons, tree learning with MCTS and with have the best winning percentages. In Clobber, the percentages are very tight. In Amazons, the best percentage is for with tree learning and the second is MCTS with terminal learning (the latter being just higher than MCTS with tree learning). Finally, apart from Surakarta, Hex, and Outer Open Gomoku, it is tree learning with which obtains the best percentage. On all games, by averaging the MCTS percentage with that of , tree learning is better than root learning or terminal learning. On average, using tree learning (with MCTS or ID) increases the winning percentage by around compared to root learning or terminal learning. The remarks are the same for the learning curves, with the difference that MCTS with tree learning is slightly better than with tree learning in Santorini, and MCTS with terminal learning is more clearly slightly the best combination in Amazons and Clobber. In conclusion, tree learning performs much better than root learning or terminal learning, although terminal learning seems slightly better in Clobber and Amazons.

Figure 2: Evolutions of the winning percentages of the combinations of the experiment of Section 3.3, i.e. MCTS (dotted line) or iterative deepening alpha-beta (continuous line) with tree learning (blue line) or root learning (red line) or terminal learning (green line). The display uses a simple moving average of 6 data.
tree learning root learning terminal learning
MCTS MCTS MCTS
Surakarta
Othello
Hex
Outer Open Gomoku
Clobber
Breakthrough
Amazons
Santorini
Lines of Action
mean
Table 2: Final winning percentages of the combinations of the experiment of Section 3.3 (: iterative deepening alpha-beta)

4 Tree Search Algorithms for Game Learning

In this section, we introduce a new tree search algorithm, that we call descent, dedicated to be used during the learning process. It requires tree learning (combining it with root learning or terminal learning is of no interest). After presenting descent, we compare it to MCTS with root learning and with tree learning, to iterative deepening alpha-beta with root learning and with tree learning and to with tree learning.

4.1 Descent: Generate Better Data

Function descent_iteration(, , , , )
       if terminal() then
            
      else
             if  then
                   foreach  actions() do
                         if terminal() then
                              
                        else
                              
             best_action() descent_iteration() best_action()
      return
Function best_action()
       if first_player() then
            return
      else
            return
Function descent(, , , , , )
       time() while time() do descent_iteration(, , , , ) return ,
Algorithm 8 Descent tree search algorithm ( : state obtained after playing the action in the state ; : value obtained after playing in ; : value of ; : adaptive evaluation function ; : evaluation of terminal states ; and are from the point of view of the first player ; : index of the transposition table (set of states which are non-leaves or terminal) ; : search time per action ; : transposition table, ).

Thus, we present descent. It is a modification of which builds a different, deeper, game tree, to be combined with tree learning. The idea of descent is to combine with deterministic end-game simulations providing interesting values from the point of view of learning. The algorithm descent (Algorithm 8) recursively selects the best child of the current node, which becomes the new current node. It adds the children of the current node if they are not in the tree. It performs this recursion from the root (the current state of the game) until reaching a terminal node (an end game). It then updates the value of the selected nodes (minimax value). The algorithm descent repeats this recursive operation starting from the root as long as there is some search time left. Descent is almost identical to . The only difference is that descent performs an iteration until reaching a terminal state while performs this iteration until reaching a leaf of the tree ( stops the iteration much earlier). In other words, during an iteration, just extends one of the leaves of the game tree while descent recursively extends the best child from this leaf until reaching the end of the game. The algorithm descent has the advantage of , i.e. to perform a longer search to determine a better action to play. By learning the values of the game tree (by using for example tree learning), it also has the advantage of a minimax search at depth , i.e. to raise the values of the terminal nodes to the other nodes more quickly. In addition, the states thus generated are closer to the terminal states. Their values are therefore better approximations.

4.2 Comparison of Search Algorithms for Game Learning

We now compare descent with tree learning to MCTS with root learning and with tree learning, to iterative deepening alpha-beta with root learning and with tree learning, and to with tree learning, using the protocol of Section 3.2. Each combination uses one of these tree search algorithms combined with tree/root learning. There are a total of combinations. The experiment was repeated times. The winning percentage of a combination for each game and for each evaluation step (i.e. each hour) is therefore calculated from matches. The winning percentage curves are shown in Figure 3. The final winning percentages are shown in Table 3. Each percentage of the table has required matches. It is descent which gets the best curves on all games. For two games (Surakarta and Outer Open Gomoku), the difference with is very narrow but the results remain better than the classic approaches (MCTS and alpha-beta). On each game, descent obtains a final percentage higher than all the other combinations (the percentage is equal to that of in the case of Santorini). On average over all games, descent has win and is above , the second best combination, by and with tree learning, the third best combination, by .

Figure 3: Evolutions of the winning percentages of the combinations of the experiment of Section 4.2, i.e. of descent (dashed line), (dotted dashed line), MCTS (dotted line), iterative deepening alpha-beta (continuous line) with tree learning (blue line) or root learning (red line). The display uses a simple moving average of 6 data.
tree learning root learning
descent MCTS MCTS
Surakarta
Othello
Hex
O. O. Gomoku
Clobber
Breakthrough
Amazons
Santorini
Lines of Action
mean
Table 3: Final winning percentages of the combinations of the experiment of Section 4.2 (: iterative deepening alpha-beta)
Remark 1.

In the previous section, in Clobber and Amazons, MCTS with terminal learning has scored relatively higher percentages than on the other games, rivaling tree learning. We can then wonder if on these two games, MCTS with terminal learning could compete with descent or UBFM. This is not the case: the experiment of this section was carried out again for these two games, replacing MCTS (resp. ID) with root learning by MCTS (resp. ID) with terminal learning and the result is analogous.

In conclusion, descent (with tree learning) is undoubtedly the best combination. (with tree learning) is the second best combination, sometimes very close to descent performances and sometimes very far, but always superior to other combinations (slightly or largely depending on the game), apart on Clobber.

5 Reinforcement Heuristic to Improve Learning Performance

In this section, we propose the technique of reinforcement heuristic, which consists to replace the classical terminal evaluation function – that we denote by , which returns if the first player wins, if the second player wins, and in case of a draw (Young et al., 2016; Silver et al., 2017b; Gao et al., 2018) – by another heuristic to evaluate terminal states during the learning process. By using this technique, non-terminal states are therefore evaluated differently, partial game trees and thus matches during the learning process are different, which can impact the learning performances. We start by offering several reinforcement heuristics. Then, we propose a complementary technique, that we call completion, which corrects state evaluation functions taking into account the resolution of states. Finally, we compare the reinforcement heuristics that we propose to the classical terminal evaluation function.

5.1 Some Reinforcement Heuristics

Thus, we start by proposing different reinforcement heuristics.

5.1.1 Scoring

Some games have a natural reinforcement heuristic: the game score. For example, in the case of the game Othello (and in the case of the game Surakarta), the game score is the number of its pieces minus the number of pieces of his opponent (the goal of the game is to have more pieces than its opponent at the end of the game). The scoring heuristic used as a reinforcement heuristic consists of evaluating the terminal states by the final score of the game. With this reinforcement heuristic, the adaptive evaluation function will seek to learn the score of states. In the context of an algorithm based on minimax, the score of a non-terminal state is the minimax value of the subtree starting from this state whose terminal leaves are evaluated by their scores. After training, the adaptive evaluation function then contains more information than just an approximation of the result of the game, it contains an approximation of the score of the game. If the game score is intuitive, this should improve learning performances.

Remark 2.

In the context of the game of the Amazons, the score is the size of the territory of the winning player, i.e. the squares which can be reached by a piece of the winning player. This is approximately the number of empty squares.

5.1.2 Additive and Multiplicative Depth Heuristics

Now we offer the following reinforcement heuristic: the depth heuristic. It consists in giving a better value to the winning states close to the start of the game than to the winning states far from the start. Reinforcement learning with the depth heuristic, it is learning the duration of matches in addition to their results. This learned information is then used to try to win as quickly as possible and try to lose as late as possible. The hypothesis of this heuristic is that a state close to the end of the game has a more precise value than a state more distant and that the duration of the game is easily learned. Under this assumption, with this heuristic, we will take less risk to try to win as quickly as possible and to lose as late as possible. In addition, with a long game, a player in difficulty will have more opportunities to regain the upper hand. We propose two realizations of the depth heuristic: the additive depth heuristic, that we denote by , and the multiplicative depth heuristic, that we denote by . The evaluation function returns the value if the first player wins, the value if the second player wins, and in case of a draw, with where is the maximum number of playable actions in a game and is the number of actions played since the beginning of the game. For the game of Hex, is the number of empty cells on the board plus . For the games where is very large or difficult to compute, we can instead use with a constant approximating (close to the empirical average length of matches).The evaluation function is identical except that satisfies .

Remark 3.

Note that the idea of fast victory and slow defeat has already been proposed but not used in a learning process (Cazenave et al., 2016).

5.1.3 Cummulative Mobility

The next reinforcement heuristic that we propose is cummulative mobility. It consists in favoring the games where the player has more possibility of action and where his opponent has less. The implementation used in this article is as following. The value of a terminal state is if the first player wins, if the second player wins, and in case of a draw, where is the sum of the number of available actions in each turn of the first player since the start of the game and is the sum of the number of available actions in each turn of the second player since the start of the game.

5.1.4 Piece Counting: Presence

Finally, we propose as reinforcement heuristic: the presence heuristic. It consists in taking into account the number of pieces of each player and starts from the assumption that the more a player has pieces the more this one has an advantage. There are several implementations for this heuristic, we use in this article the following implementation: the heuristic value is if the first player wins, if the second player wins, and in case of a draw, where is the number of pieces of the first player and is the number of pieces of the second player. Note that in the games Surakarta and Othello, the score corresponds to a presence heuristic.

5.2 Completion

Relying solely on the value of states calculated from the terminal evaluation function and the adaptive evaluation function can sometimes lead to certain aberrant behaviors. More precisely, if we only seek to maximize the value of states, we will then choose to play a state rather than another state if is of greater value than even if is a winning resolved state (a state is resolved if we know the result of the game starting from this state in which the two players play optimally). A search algorithm can resolve a state. This happens when all the leaves of the subtree starting from this state are terminal. Choosing rather than , a winning resolved state, is an error333There is perhaps, in certain circumstances, an interest in making this error from the point of view of learning. when is not resolved (or when is resolved and is not winning). By choosing , guarantee of winning is lost. The left graph of Figure 4 illustrates such a scenario.

Figure 4: The left graph is a game tree where maximizing does not lead to the best decision ; the right graph is the left game tree with completion (nodes are labeled by a pair of values) and thus maximizing leads to the best decision (square node: first player node (max node), circle node: second player node (min node), octagon: terminal node).

It is therefore necessary to take into account both the value of states and the resolution of states. The completion technique, which we propose in this section, is one way of doing it. It consists, on the one hand, in associating with each state a resolution value . The value of a leaf state is if the state is not resolved or if it is resolved as a draw, if it is resolved as a winning state and if the state is resolved as a losing state. The value of a non-leaf state is computed as the minimax value of the partial game tree where the leaves are evaluated by their resolution value. It consists, on the other hand, to compare states from pairs , by using the lexicographic order (instead of just compare states from the value ). We then seek to maximize the pair, in particular to decide which action to play. The right graph of Figure 4 illustrates the use of completion. The use of the resolution of states also makes it possible to stop the search in the resolved subtrees and thus to save computing time. Descent algorithm modified to use the completion and the resolution stop is described in Algorithm 9. With completion, descent always chooses an action leading to a winning resolved state and never chooses, if possible, an action leading to a losing resolved state.

Function completed_descent_iteration(, , , , )
       if terminal() then
            
      else
             if  is not solved then
                   if  then
                         foreach  actions() do
                               if terminal() then
                                    
                              else
                                    
                   completed_best_action() completed_descent_iteration() completed_best_action()
      return
Function completed_best_action()
       if first_player() then
            return
      else
            return
Function completed_descent(, , , , , )
       time() while time() is not solved do completed_descent_iteration(, , , , ) return ,
Algorithm 9 Descent tree search algorithm with completion and resolution stop ( : state obtained after playing the action in the state ; : value obtained after playing in ; : value of ; : resolved value of state ( by default), is if is a draw, if is winning, if is losing ; : adaptive evaluation function ; : evaluation of terminal states ; and are from the point of view of the first player ; set of states of the game tree which are non-leaves or terminal ; : search time per action ; : transposition table, ).

We also propose to use the resolution of states with action selections, to reduce the duration of games and therefore a priori the duration of the learning process: always play an action leading to a winning resolved state if it exists and never play an action leading to a losing resolved state if possible. Thus, if among the available actions we know that one of the actions is winning, we play it. If there is none, we play according to the chosen action selection method among the actions not leading to a losing resolved state (if possible). We call it completed action selection.

5.3 Comparison of Reinforcement Heuristics

We now compare the different heuristics that we have proposed to the classical terminal evaluation function on different games, using the protocol of Section 3.2. Each combination uses descent with completion (Algorithm 9) and completed -greedy (see Algorithm 2 and Section 5.2). Each combination uses a different terminal evaluation function. These terminal evaluations are the classical (“binary”) evaluation function , the additive depth heuristic, the multiplicative depth heuristic, the scoring heuristic, the cummulative mobility, and the presence heuristic. Other parameters are the same as Section 3.3. There are, at most, a total of combinations per game (on some games, some heuristics are not evaluated because they are trivially of no interest or equivalent to another heuristic). The experiment was repeated times. The winning percentage of a combination for each game and for each evaluation step (i.e. each hour) is therefore calculated from to matches. The final winning percentages are shown in Table 4. Each percentage of the table has required between and matches. On average and in of the games, the classic terminal heuristic has the worst percentage. In Clobber and Othello, it is the second worst. In Lines of Action, it is the third worst. In scoring games, scoring is the best heuristic, as we might expect. Leaving aside the score heuristic, with the exception of Surakarta, it is one of the two depth heuristics that has the best winning percentage. On average, using a depth heuristic instead of using the classic evaluation increases the winning percentage by , and using the best depth heuristic increases the winning percentage by . The winning percentage curves are shown in Figure 5. The final percentages summarize the curves quite well. Note however, on the one hand, that the clear impact compared to the other heuristics (except score) of the additive depth heuristic to Breakthrough, Amazons, Othello, Hex, and Santorini and of the multiplicative depth heuristic to Clobber, Hex, and Open Outer Gomoku. Note, on the other hand, that the classic heuristic is down on all games, except on Othello, Clobber and particularly Lines of Action. In conclusion, the use of generic reinforcement heuristics has significantly improved performances and the depth heuristics are prime candidates as a powerful generic reinforcement heuristic.

Figure 5: Evolutions of the winning percentages of the combinations of the experiment of Section 5.3, i.e. the use of the following heuristics: classic (black line), score (purple line), additive depth (blue line), multiplicative depth (turquoise line), cumulative mobility (green line), and presence (red line). The display uses a simple moving average of 6 data.
depth
classic score additive multiplicative mobility presence
Surakarta score
Othello score
Hex X X X
Outer Open Gomoku X X X
Clobber X X
Breakthrough X
Amazons X
Santorini X X
Lines of Action X
mean
Table 4: Final winning percentages of the combinations of the experiment of Section 5.3 (X: heuristic without interest in this context ; presence coincides with score in Surakarta and Othello)

6 Search Algorithms for Game Playing

In this section, we propose another variant of , dedicated to be used in competition mode. Then, we compare it with other tree search algorithms.

6.1 Unbound Best-First Minimax with Safe Decision:

Thus, we propose a modification of , denoted . It aims to provide a safer game. The action chooses to play is the one that leads to the state of best value. In some cases, the (a priori) best action can lead to a state that has not been sufficiently visited (such as a non-terminal leaf). Choosing this action is therefore a risky decision. We propose, to avoid this problem, a different decision that aims to play the safest action, in the same way as MCTS (max child selection (Browne et al., 2012)). If no action leads to a winning resolved state, the action chosen by is the one that has been the most selected (since the current state of the game) during the exploration of the game tree. In case of a tie, decides by choosing the one that leads to the state of best value. This decision is safer because the number of times an action is selected is the number of times that this action is more interesting than the others.

Example 4.

The current player has the choice between two actions and . The action leads to a state of value and was selected times (from the current state and from the beginning of the game). The action leads to a state of value and was selected times. chooses the action while chooses the action .

The algorithm of with completion is described in Algorithm 11 (which uses Algorithm 10 ; resolution stop is not used for the succinctness of the presentation).

Function ubfms_iteration(, , , , )
       if terminal() then
            
      else
             if  then
                   foreach  actions() do
                         if terminal() then
                              
                        else
                              
            else
                  
       best_action() ubfms_iteration() best_action() return
Function best_action()
       if first_player() then
            return
      else
            return
Function ubfms_tree_search(, , , , , )
       time() while time() do ubfms_iteration(, , , , ) return ,
Algorithm 10 tree search algorithm with completion ( : state obtained after playing the action in the state ; : value obtained after playing in ; : value of ; : resolved value of state ( by default), is if is a draw, if is winning, if is losing ; : adaptive evaluation function ; : evaluation of terminal states ; and are from the point of view of the first player ; set of states of the game tree which are non-leaves or terminal ; : search time per action ; : number of times the action is selected in state (initially, for all and ) ; : transposition table, ).
Function safest_action(, )
       if first_player() then
            return
      else
            return
Function ubfms(, , , , , )
       ubfms_tree_search(, , , , ) return safest_action(, )
Algorithm 11 action decision algorithm with completion ( : state obtained after playing the action in the state ; : value obtained after playing in ; : value of ; : resolved value of state ( by default), is if is a draw, if is winning, if is losing ; : adaptive evaluation function ; : evaluation of terminal states ; and are from the point of view of the first player ; set of states of the game tree which are non-leaves or terminal ; : search time per action ; : number of times the action is selected in state (initially, for all and ) ; : transposition table, ).

6.2 Comparison of Search Algorithms for Game Playing

We now compare the winning percentages of different game algorithms, by using the learned evaluation functions of Section 4.2 (obtained at the end of the learning process, i.e. ). We compare with and iterative deepening alpha-beta with move ordering, denoted (each of these algorithms uses completion). For each game, each combination confronts the combinations , with a search time of per action, where , is minimax at depth , is one of the final evaluation functions of Section 4.2, and is the number of one of the repetitions (). The winning percentage of is the average of the winning percentage of over the functions and over the repetitions . The winning percentages are described in Table 5. For each game, the winning percentage of a search algorithm is calculated from matches. On all games except Clobber, gets the best winning percentage (on two games, Hex and Outer Open Gomoku, and have the same percentage). On Clobber, it is which obtains the best percentage, but only more than . On average, across all games, is better than and better than .

Then a variation of this experiment was performed. For each game, each combination confronts all the others, but the used evaluation functions are restricted to those generated from the learning algorithm descent and the search time is per action. The corresponding winning percentages are described in Table 6. For each game, the winning percentage of a search algorithm is calculated from matches. In all games, except Clobber and Santorini, it is again which obtains the best winning percentage. At Clobber and Santorini, it is which obtains the best percentage. On average across all games, is better than and better than . In conclusion, in the context of these experiments, is the best search algorithm.

Outer Open Gomoku Clobber Breakthrough Santorini Hex
Lines of Action Othello Amazons Surakarta mean
Table 5: Average winning percentages of , , and over the evaluation functions of Section 4.2, for different games, of the first experiment of Section 6.2 (search time: second per action).
Outer Open Gomoku Clobber Breakthrough Santorini Hex
Lines of Action Othello Amazons Surakarta mean
Table 6: Average winning percentages of , , and over the evaluation functions (generated from descent in Section 4.2), for different games, of the second experiment of Section 6.2 (search time: seconds per action).

7 Ordinal Distribution and Application to Hex

In this section, we propose the last technique, a new action selection distribution, and we apply it with all the previous techniques to design program-players to the game of Hex.

7.1 Ordinal Action Distribution

Thus, we propose an alternative probability distribution (see Section 2.3), that we call ordinal distribution. This distribution does not depend on the value of states. However, it depends on the order of their values. Its formula is:

with the number of children of the root, , the -th best child of the root, the probability of playing the action leading to the child and the exploitation parameter (). Algorithm 12 describes the action selection method resulting from the use of the ordinal distribution with an optimized calculation.

Function ordinal(, )
       if first_player() then
             sorted in descending order by
      else
             sorted in ascending order by
       for   do
             if probability  then
                   return
            
Algorithm 12 Ordinal action distribution algorithm with simulated annealing used in the experiments of this article ( is the time elapsed since the start of the reinforcement learning process ; is the chosen total duration of the reinforcement learning process ; : value obtained after playing in ).
Remark 5.

In an experiment using the common protocol, not presented in this article, ordinal distribution has been mostly better than softmax distribution, but lower than -greedy. However, during long-time learning processes at Hex (similar to the experiments of the following sections), ordinal distribution has performed best.

7.2 A Long Training for Hex

We now apply all the techniques that we have proposed to carry out a long self-play reinforcement learning on Hex size 11. More precisely, we use completed descent (Algorithm 9) with tree learning (Algorithm 3), completed ordinal distribution (see Section 5.2 and Algorithm 12), and the additive depth heuristic (see Section