Board games are favorite among AI researchers for experiments with intelligent decision making, building, for example, on works that analyze the game of Chess date back centuries[24, 35]. Already in 1826, papers were published on machines that supposedly played Chess automatically, although it was unclear whether the machine was still operated somehow by humans. Nowadays, for some games, such as Chess  and Go [33, 34], we know for sure that there are algorithms that can, without the help of humans, automatically decide on a move, and even out-play the best human players. In this paper, we want to investigate a new game, Tetris Link, that has not yet received attention from researchers before, to the best of our knowledge (see section II). Tetris Link is a manual, multi-player version of the well-known video-game Tetris. It is played on a vertical ”board”, not unlike Connect-4. The game has a large branching factor, and since it is not immediately obvious how a strong computer program should be designed, we put ourselves to this task in this paper. For that, we implement a digital version of the board game and take a brief look at the game’s theoretic aspects (section III). Based on that theory, we develop heuristics for a minimax-based program that we also test against human players (section IV-A). Performance is limited, and we try other common AI approaches: Deep Reinforcement Learning (RL)  and Monte Carlo tree search (MCTS) , approaches that were combined by Silver et al. in AlphaGo. In section V-B, we look at experimental results RL, and in section V-A, we look at MCTS as options to implement agents. In our design of the game environment for the RL agent, we assess the impact of choices such as the reward on training success. Finally, we compare the performance of these agents after letting them compete against each other in a tournament (section V-C). To our surprise, the humans are stronger.
The main contribution of this paper is that we present to the community the challenge of implementing a well-playing computer program for Tetris Link. This challenge is much harder than expected, and we provide evidence on why this might be the case, even for the deterministic 2-player version (without dice) version of the game. The real Tetris Link can be played with four players using dice, which will presumably be even harder for an AI.
We document our approach, implementing three players based on the three main AI game-playing approaches of heuristic planning, Monte Carlo Tree Search, and Deep Reinforcement Learning. To our surprise and regret, all players were handily beaten by human players. We respectfully offer our approach, code and experience to the community to improve upon.
Ii Related Work
Few papers on Tetris Link exist in the literature. A single paper describes an experiment using Tetris Link . This work is about teaching undergraduates ”business decisions” using the game Tetris Link. To provide background on the game, we analyze the game in more depth in section III.
The AI approaches that we try have been successfully applied to a variety of board games . Heuristic planning has been the standard approach in many games such as Othello, Checkers, and Chess [30, 31, 18, 39, 25], MCTS has been used in a variety of applications such as Go, and GGP [8, 4, 29] and Deep RL has seen great success in Backgammon and Go [36, 38, 33, 37]. Multi-agent MCTS has been presented in .
Iii Tetris Link
Tetris Link, depicted in Figure 1, is a turn-based board game for two to four players.
Just as the original Tetris video game, Tetris Link features a ten by twenty grid in which shapes called tetrominoes111A shape built from squares that touch each other edge-to-edge is called a polyomino. Because they are made out of precisely four squares, these shapes are called tetromino. are placed on a board. This paper will refer to tetrominoes as blocks for brevity. The five available block shapes are referred to as: I, O, T, S, L222The S and L blocks may also be referred to as Z and J.. Every shape has a small white dot, also in the original physical board game variant, to make it easier to distinguish individual pieces from each other. Every player is assigned a colour for distinction and gets twenty-five blocks: five of each shape. In every turn, a player must place precisely one block. If no fitting blocks are available any more, then the player will be skipped. A player can never voluntarily skip if one of the available blocks fits somewhere in the board even if placing it is disadvantageous. The game ends when no block of any player fits into the board any more.
The goal of the game is to obtain the most points. One point is awarded for every block, provided that it is connected to a group of at least three blocks. Not every block has to touch every other block in the group, as shown in Figure (b)b.
The I block only touches the T but not the L on the far right. Since they together form a chained group of three, it counts as three points. Blocks have to touch each other edge-to-edge. In Figure (a)a, the red player receives no points as the I is only connected edge-to-edge to the blue L.
A player loses one point per empty square (or hole) created, with a maximum of two minus points per turn. Figure (c)c shows how one minus point for red would look like. Moreover, the figure underlines a fundamental difference to video-game Tetris. In video-game Tetris, blocks slowly fall, and one could nudge the transparent L under the S to fill the hole by precise timing of an action. In Tetris Link, one can only throw pieces into the top and let them fall straight to the bottom. In the original rules, a dice is rolled to determine which block is placed. If a player is out of a specific block, then the player gets skipped. Since every block is potentially one point, being skipped means missing out on one point.
Due to the dice roll, Tetris Link is an imperfect information game. The dice roll also causes random skips/point deductions for players. In this paper we omit the dice roll, analyzing the perfect information version of Tetris Link. Note that we also focus on the two-player game only in this work, the three- and four-player versions are presumably even harder. However, our web based implementation for human test games333https://hizoul.github.io/contetro can handle up to four players and can provide an impression of the Tetris Link gameplay.
Iii-a Verification that all games can end
Each game of Tetris Link can be played to the end, in the sense that there are enough stones to fill the board. This is easy to see, by the following argument. The board is ten squares wide and twenty squares high so it can accommodate 200 individual squares. Every player has twenty-five blocks, each consisting of four squares. There are always at least two players playing the game, so they are always able to fill the board.
Iii-B Branching Factor
An essential property of a problem with respect to approaching it by means of a search algorithm is the branching factor. This is the number of possible moves a player can perform in one state. In order to compute this number, we look at the number of orientations for each block. The I block has two orientations as it can be used either horizontally or vertically. The O block has only one orientation because it is a square. The T and the S block have four different orientations for every side one can turn it to. The L is an unusual case as it has eight orientations. Four for every side one can turn it to, but when one mirrors the shape, it has four additional sides to be turned to. Hence, in total, nineteen different shapes can be placed by rotating or mirroring the available five base blocks. Since the board is ten units wide, there are also ten different drop points per shape. In total, there can be up to 190 possible moves available in one turn. However, the game rules state that all placed squares have to be within the bounds of the game board. Twenty-eight of these moves are always impossible because they would necessitate some squares to exceed the bounds either on the left or right side of the board. Therefore, the exact number of maximum possible moves in one turn for Tetris Link is 162. Since the board gets fuller throughout the game, not all moves are always possible, and the branching factor decreases towards the end. In order to show this development throughout matches, we simulate 10,000 games. We depict the average number of moves per turn in Figure 3.
The first eight to ten turns, all moves are available regardless of the quality of play. After that, there is a slow but steady decline. Tetris Link is a game of skill: random moves perform badly. A game consisting of random moves ends after only thirty turns. Many holes with many minus points are created, and the game ends quickly with a low score. The heuristic lines show that simple rules of thumb fill the board most of the time by taking more than forty turns. Furthermore, the branching factor in the midgame (turn 13-30) declines slower, and hence offer more variety to the outcomes.
We are now ready to calculate the approximate size of the state space of Tetris Link, in order to compare the complexity to other games. On average, across all three agents, a game takes 37 turns and allows for 74 actions per turn (). The state-space complexity is larger than in Chess () but smaller than in Go ().
Iii-C First move advantage
An important property of turn-based games is whether making the first move gives the player an advantage. To put the possible advantage into numbers, we let different strategies play against themselves 10,000 times to look at the win rate. The first six (#1) or all (#2) moves are recorded and checked for uniqueness.
|Win Rate #1||47.84%||47.15%||71.93%||68.41%|
|Unique Games #1||10,000||2188||7||29|
|Win Rate #2||48.16%||47%||71.65%||70%|
|Unique Games #2||10,000||10,000||7||50|
As can be seen in Table I, the win rate for random heuristic is almost 50%. Although the win rate for the first player is higher for the tuned heuristics, these numbers are not as representative because the heuristic repeats the same tactics over and over again resulting in only seven or twenty-nine unique game starts. If we repeat the same few games, then we will not know whether the first player has an advantage. Especially considering that at least until turn six, all moves are always possible, there are around or 18 Trillion444 possible outcomes. Since the random heuristic has more deviation and plays properly as opposed to random moves, we believe that it is a good indicator of the actual first player advantage. Note that 47% is close to an equal opportunity. Different match history comparisons of Chess measure a difference of around two to five percent in win rate for the first player . However, since neither Tetris Link nor Chess have been mathematically solved, one cannot be certain that there is a definite advantage.
Iv AI Player Design
We now describe the design of our heuristic player. A heuristic is a rule of thumb that works well most of the time . For Tetris Link, we identify four heuristic measures: the number of connectable edges, the size of groups, the player score, and the number of blocked edges. The number of blocked edges is the number of edges belonging to opponents that are blocked by the current players’ blocks. All heuristic values are positively related to the chance of winning.
Each parameter is multiplied by a weight, and the overall heuristic score is the sum of all four weighted values. For every possible move in a given turn, the heuristic value is calculated, and the one with the highest value is chosen. If multiple moves have the same maximum value, a random one of these best moves is chosen. The initial weights were manually set by letting the heuristic play against itself and detecting which combination would result in the most points gained for both players. We refer to this as user heuristic. We then use Optuna 
, a hyperparameter tuner, to tune a set of weights that reliably beat theuser heuristic. This version is called tuned heuristic.
To achieve a greater variety in playstyle, we also test a random heuristic
For applications in which no efficient heuristic can be found, MCTS is often used, as it constructs a value function by averaging random roll-outs . Our MCTS implementation uses the standard UCT selection rule . As further enhancements, we also use MCTS-RAVE  and MCTS-PoolRAVE  to see whether the modifications help in improving the quality of results. Furthermore, we experimented with improving the default (random) policy by replacing it with the heuristic. However, the heuristic calculation is so slow that it only manages to visit ten nodes per second.
MCTS is well-suited for parallelization, leading to more simulations per second and hence better play . We implemented tree parallelization, a frequently used parallelization . In tree parallel MCTS, many threads expand the same game tree simultaneously. Using 12 threads, we visit 16258 nodes per second on average with a random default policy. To put this into perspective, this is of all possibilities in the first six turns. Thus, only a small part of the game tree is explored by MCTS, even with parallel MCTS.
Iv-C Reinforcement Learning Environment and Agent
A reinforcement learning environment requires an observation, actions and a reward , and an RL agent an algorithm as well as a network structure. To prevent reinventing the wheel, we use existing code for RL, namely OpenAI gym  and the stable-baselines , which are written in Python. To connect Python to our Rust implementation, we compile a native shared library file and interact with it using Pythons ctypes. As RL Algorithm, we exclusively use the deep reinforcement learning algorithm PPO2, without AlphaZero-like MCTS to further improve training samples. For the network structure, we increase the number of hidden layers from two layers of size 64 to three layers of size 128, because increasing the network size decreases the chances of getting stuck in local optima
. We do not use a two-headed output, so the network only returns the action probabilities but not the certainty of winning like in AlphaZero.
The observation portrays the current state of the game field. Inspired by AlphaGo which includes as much information as possible (even the ”komi”555Komi refers to the first turn advantage points .
), we add additional information such as the number of pieces left per player, the players’ current score and which moves are currently legal. For the action space, we use a probability distribution over the possible moves. Probabilities of illegal moves are set to 0, so only valid moves are considered. For the reward, we have three different options.
Simple: depending on win / loss
The Guided reward stands out because it is the only one that reduces the number of points via scolding. If the chosen move was an illegal move, then the reward will be reduced, so the agent learns to only make valid moves. This technique is called reward shaping, and its results may vary .
In order to detect which one of the three options is the most effective, we conduct an experiment. Per reward function, we collect the averages for the number of steps it took, the average reward achieved, and what the average score of the players was in the results. Our results, shown in Table II, indicate that the Guided reward function works best. It only takes around 3183 steps on average to reach a local optimum, and the average scores achieved in the matches is the highest. The Score reward function also lets the agent reach a local optimum, but it takes twice as long as the Guided function, and the score is slightly lower as well. The simple reward function seems unfit for training. It never reached a local optimum in the 10,000 steps we allowed it to run and it got the lowest score in its games.
|Reward Type||Steps||Episode Reward||Score|
V Agent Training and Comparison
For our experimental analysis, we first look at the performance of the MCTS agent (section V-A) and the training process of the RL agents (section V-B). Finally, we compare all previously introduced agents in a tournament to analyze their play quality and determine the currently best playing approach.
V-a MCTS Effectiveness
Initial test matches of MCTS against the user heuristic
resulted in a zero percent win rate, and a look at the game boards suggested near-random play. We use a basic version of MCTS with random playouts because heuristic guidance was too slow. AlphaZero has shown that even games with high branching factor such as Go can be played well by MCTS when guided by a neural network. However, without decision support from a learned model or a heuristic, we rely on simulations. In order to see if this guidance is the reason for bad MCTS performance, we abuse the fact that the user heuristic plays very predictably (section III-C). We use the RAVE-MCTS variant (without the POOL addition), pre-fill the RAVE values with 100 games of the user heuristic playing against itself and then let the MCTS play 100 matches against the user heuristic. We repeat this three times and use the average value across all three runs. We run this experiment with different RAVE parameter values. This parameter is responsible for the exploration/exploitation balancing and replaces the usual parameter. The closer the RAVE visits of a node reach , the smaller the exploration component becomes. Furthermore, we employ the slow heuristic default policy at every node in this experiment. We simulate one match per step because otherwise, the one second thought time is not enough for the slow heuristic policy to finish the simulation step.
Our MCTS implementation can play well with a decent win rate against the user-heuristic, as shown in Figure 4. This result underlines that in games with high branching factors, MCTS needs good guidance through the tree in order to perform well.
The declining win rate with a higher beta value suggests that exploration on an already partially explored game tree worsens the result if the opponent does not deviate from its paths. The rise in win rate for a value of 5000 after the large drop in 2500 underlines the effect that the randomness involved in the search process can have.
Even though the heuristic supported playout policy works well, we will still use a random playout policy for the tournament (section V-C). Pre-filling the tree is very costly and would, therefore, provide an unfair advantage to the MCTS method.
We perform another small experiment in order to see how the branching factor influences MCTS performance: we run MCTS on different board sizes (2x2 to 11x11) of Hex against a shortest path heuristic. The result is striking: as long as the branching factor stays below 49 (7x7), MCTS wins up to 90% of the matches. For larger branching factors, the win rate drops to 0% quickly.
V-B RL Agents Training
We define an RL agent as the combination of environment, algorithm and training opponent. We use the guided reward function because it worked best in our experiment and call this agent RL-Selfplay. (This is a neural network only RL, without MCTS to improve training samples.)
In addition to this rather simple agent, we introduce the RL-Selfplay-Heuristic agent. It builds on a trained RL-Selfplay agent where we continue training by playing against the heuristic. Observation and reward are the same as for RL-Selfplay.
From the first turn advantage experiment, we know that the heuristic plays well even with random weights. That is why we also introduce an agent called RL-Heuristic. This agent gets the numerical observation as input and outputs four numbers that represent the heuristics weights (section IV-A). We use a modified version of the guided reward function:
Group size stands for the total number of stones that are connected with at least one other stone. This is added because we want the algorithm to draw a connection between the number of points gained and the number of connected stones. However, mainly the difference in points between itself and the opponent is used as learning signal, so it aims for gaining more points than the opponent. Scolding is not necessary any more as we do not have to filter the output in any way.
In this section, we detail the training process of the RL agents. Each training is done four times, and only the best run is shown. Agents are trained with the default PPO2 hyperparameters, except for RL-Heuristic, which uses hand-tuned parameters.
Furthermore, we increase the hidden layer amount from two hidden layers with size 64 to three hidden layers with size 128 because increasing the network size decreases the chances of getting stuck in local optima.
When playing only against themselves, the networks still quickly reached a local optimum even with increased layer size. This optimum manifested in the same game being played on repeat and the reward per episode staying the same. This repetition is a known problem in self-play and can be called ”chasing cycles” . To prevent these local optima, we train five different agents against each other in random order. To be able to train against other agents, we modified the stable-baselines code.
The training process for RL-Selfplay is visualized in Figure 5. In the beginning, it keeps improving, but after peaking around 1.5 million steps, it only deteriorates. (Note that this is a form of Self-Play using the neural net only, without MCTS, as opposed to AlphaZero.) Usually, a reward training graph should although jittery, steadily improve and climb in the reward achieved .
For RL-Selfplay-Heuristic we use the two best candidates from RL-Selfplay, namely #3 after one million steps with a reward of 0.04, and #1 after 1.5 million steps666The actual peak is at 1.7 million steps, but the model was only saved every 500,000 steps. with a reward of 0.034. The training of RL-Selfplay#1-Heuristic reaches its peak after 3.44 and RL-Selfplay#3-Heuristic after 3.64 Million steps with a reward of 0.032 and 0.024. These are our first RL agents that can achieve a positive reward while playing against the heuristic.
The RL-Heuristic training worked well, achieving mostly a positive reward. But by looking at the output values, we realize the reward function design was unfortunate. It sets all weights to zero, except for the enemy block value which is fifteen and the number of open edges which varies between four and seven. So by negating the players score with the opponent’s score, we have unwillingly forced the heuristic to focus on blocking the opponent over everything else. Needless to say with these weights, RL-Heuristic rarely wins. Although it manages to keep the opponents score low, it does not focus on gaining points which leaves it with a point disadvantage.
In the tournament, we will pit all previously shown AI approaches against each other. Every bot will play 100 matches against every other bot. We have five different RL bots, three MCTS bots and three heuristic bots. The bots skill will be compared via a Bayesian Bradley Terry (BBT) skill rating. The original BBT uses a skill rating in the range of 0 to 50, similar to TrueSkill. By changing the parameter of the rating function, we change the range from 0 to 3000, so it is similar to the standard ELO range.
The final skill rating is portrayed in Figure 7. The three heuristic agents take the top 3, followed by RL and MCTS. Remarkably, the tuned heuristic performed best, even though it is only optimized to play well against the user heuristic, but yet it performs best across all agents.
Seeing RL-Heuristic as the best RL approach shows that the other RL agents are far from playing well. Yet all RL agents consistently beating MCTS with random playouts proves that the agents definitely learned to play reasonable.
It is interesting to see that the MCTS-UCB (14% win rate) variant performed best because the other two variants [RAVE (0.02%), PoolRAVE (0.04%)] were conceived in order to improve the performance of UCB via slight modifications.
The skill rating omits information about the quality of the individual moves. To gain further insight into that, we provide Figure 6. Here, we can see that every agent manages at least once to gain 8 points or more. This means that every agent had at least one match it played well. Looking at the lowest achieved scores and average scores, we find that every agent except for the pure heuristic ones plays badly, considering that on average, they only make points.
Vi Conclusion and Future Work
Board game strategy analysis has been done for decades, and especially games like Chess and Go have seen countless papers analyzing the game, patterns and more to find the best play strategies . We contributed to that field by taking a close look at the board game Tetris Link. While the strategy is key to winning, some games, such as Hex, give the first player a definite advantage. We have experimentally shown that there is no clear advantage for the starting player in Tetris Link (section III-C).
We have implemented three game playing programs, based on common approaches in AI: heuristic search, MCTS, and reinforcement learning. Despite some effort, none of our programs was able to beat human players.
In doing so, we have obtained an understanding of why it may be hard to design a good AI for Tetris Link:
Especially at the beginning, the branching factor is large, staying at around 160 for at least the first six turns.
Many moves cannot be reversed. The unforgivingness for these moves may make it harder to come up with a decent strategy, as generally postulated by .
Many rewards in the game stack — they come delayed after multiple appropriate moves because groups of pieces count and not single pieces.
All this holds true for the simplified version we treat here: no dice, only two players. Adding up to two more players and dice will also make the game harder.
With a solid understanding of the game itself, we investigated different approaches for AI agents to play the game, namely heuristic, RL and MCTS. We have shown that all tested approaches can perform well against certain opponents. The best currently known algorithmic approach is the tuned heuristic, although it can not consistently beat human players.
Training an RL agent (section V-B) for Tetris Link has proven to be complicated. Just getting the network to produce positive rewards required much trial and error, and in the end, the agent did not perform well even when consistently achieving a positive reward. We believe the learning difficulty in Tetris Link comes from the many opportunities to make minus points in the game. One turn offers at most one plus point, or three and more if a group is connected, but that means that the previous two or more turns at most gave zero points if not even more minus points. Hence recovering from minus points is difficult, meaning small mistakes have graver consequences.777Note that the RL agent did not use MCTS-based self-play as AlphaZero , but a neural network, as used for Atari .
Although MCTS performed poorly in our tournament, we have shown that with proper guidance through the tree MCTS can perform nicely in Tetris Link and Hex (section V-A). That is why a combination where RL guides an MCTS through the tree might work well, e.g. AlphaZero  or MoHex v3 , and is something to try in future work.
We invite the research community to use our code and improve upon our approaches888https://github.com/Hizoul/masterthesis.
-  (2019) Optuna: a next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631. Cited by: §IV-A.
-  (1826) The history and analysis of the supposed automation chess player of m. de kempelen: now exhibiting in this country, by mr. maelzel. Hilliard, Gray & Company. Cited by: §I.
-  (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §IV-C.
-  (2012) A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games 4 (1), pp. 1–43. Cited by: §I, §II, §IV-B, §IV-B.
-  (1997) How to lose at tetris. The Mathematical Gazette 81 (491), pp. 194–200. Cited by: footnote 2.
-  (2005) Applying reinforcement learning to tetris. Department of Computer Science Rhodes University. Cited by: footnote 2.
-  (2019) Hyperstate space graphs for automated game analysis. In IEEE Conference on Games, CoG 2019, London, United Kingdom, August 20-23, 2019, pp. 1–8. External Links: Cited by: 2nd item.
-  (2006) Efficient selectivity and backup operators in Monte-Carlo tree search. In International conference on computers and games, pp. 72–83. Cited by: §II.
-  (2007) Jigsaw puzzles, edge matching, and polyomino packing: connections and complexity. Graphs and Combinatorics 23 (1), pp. 195–208. Cited by: footnote 1.
-  (1998) The branching factor of regular search spaces. In AAAI/IAAI, pp. 299–304. Cited by: §III-B.
-  (2009) A lock-free multithreaded monte-carlo tree search algorithm. In Advances in Computer Games, pp. 14–20. Cited by: §IV-B.
-  (2014) Heuristic-based multi-agent monte carlo tree search. In IISA 2014, The 5th International Conference on Information, Intelligence, Systems and Applications, pp. 177–182. Cited by: §II.
Move prediction using deep convolutional neural networks in hex. IEEE Transactions on Games 10 (4), pp. 336–343. Cited by: §VI.
-  (1999) Rating the chess rating system. CHANCE-BERLIN THEN NEW YORK- 12, pp. 21–28. Cited by: §V-C1.
-  (2008) Plan-based reward shaping for reinforcement learning. In 2008 4th International IEEE Conference Intelligent Systems, Vol. 2, pp. 10–22. Cited by: §IV-C.
-  (2007) TrueSkill™: a bayesian skill rating system. In Advances in neural information processing systems, pp. 569–576. Cited by: §V-C1.
-  (2018) Stable baselines. GitHub. Note: https://github.com/hill-a/stable-baselines Cited by: §IV-C.
-  (1997) Computer chess, then and now: the deep blue saga. In Proceedings of Technical Papers. International Symposium on VLSI Technology, Systems, and Applications, pp. 153–156. Cited by: §I, §II.
Reinforcement learning: a survey.
Journal of artificial intelligence research4, pp. 237–285. Cited by: §IV-C.
Bandit based monte-carlo planning.
European conference on machine learning, pp. 282–293. Cited by: §IV-B.
-  (1997) Lessons in neural network training: overfitting may be harder than expected. In AAAI/IAAI, pp. 540–545. Cited by: §IV-C, §V-B2.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §V-B3, footnote 7.
Does experiential learning improve learning outcomes in an undergraduate course in game theory–a preliminary analysis. Cited by: §II.
-  (1790) Analysis of the game of chess. P. Elmsly. Cited by: §I.
-  (1996) Best-first fixed-depth minimax algorithms. Artificial Intelligence 87 (1-2), pp. 255–293. Cited by: §II.
-  (2020) Learning to play—reinforcement learning and games. Universiteit Leiden. Cited by: §II.
-  (2010) Biasing monte-carlo simulations through rave values. In International Conference on Computers and Games, pp. 59–68. Cited by: §IV-B, §V-C2.
-  (1985) What is a heuristic?. Computational Intelligence 1 (1), pp. 47–58. Cited by: §IV-A.
-  (2013) Combining simulated annealing and monte carlo tree search for expression simplification. arXiv preprint arXiv:1312.0841. Cited by: §II.
-  (2016) Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,. Cited by: §II.
-  (2002) Games, computers and artificial intelligence. Chips Challenging Champions: games, computer and Artificial Intelligence, pp. 3–9. Cited by: §II.
-  (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §IV-C.
-  (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §I, §II, footnote 5.
-  (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §I, §IV-C, §V-A1, §VI, §VI, footnote 7.
-  (1889) On the different possible non-linear arrangements of eight men on a chess-board. Proceedings of the Edinburgh Mathematical Society 8, pp. 30–43. Cited by: §I.
-  (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I, §II.
-  (1989) Neurogammon: a neural network backgammon learning program. Heuristic Programming in Artificial Intelligence: The First Computer Olympiad, Chichester, England. Cited by: §II.
-  (2008) Application of reinforcement learning to the game of othello. Computers & Operations Research 35 (6), pp. 1999–2017. Cited by: §II.
-  (2002) Search and evaluation in hex. Master of science, University of Alberta. Cited by: §II.
-  (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, pp. 1–5. Cited by: §V-B2.
-  (2011) A bayesian approximation method for online ranking. Journal of Machine Learning Research 12 (Jan), pp. 267–300. Cited by: §V-C1.
-  (2019-08-18)(Website) External Links: Cited by: §III-C, §III-C.
-  (2019-09-09)(Website) External Links: Cited by: footnote 1.