AlphaGomoku: An AlphaGo-based Gomoku Artificial Intelligence using Curriculum Learning

09/27/2018 ∙ by Zheng Xie, et al. ∙ 0

In this project, we combine AlphaGo algorithm with Curriculum Learning to crack the game of Gomoku. Modifications like Double Networks Mechanism and Winning Value Decay are implemented to solve the intrinsic asymmetry and short-sight of Gomoku. Our final AI AlphaGomoku, through two days' training on a single GPU, has reached humans' playing level.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Free style Gomoku is an interesting strategy board game with quite simple rules: two players alternatively place black and white stones on a board with 15 by 15 grids and winner is the one who first reach a line of consecutive five or more stones of his or her color. It is popular among students since it can be played simply with a piece of paper and a pencil to kill the boring class time. It is also popular among computer scientists since Gomoku is a natural playground for many artificial intelligence algorithms. Some powerful AIs are created in this field, to name a few, say YiXin, RenjuSolver. However, most of the AIs existed are rule-based requiring human experts to construct well-crafted evaluation function. To some extent, this kind of AIs are the slaves of humans’ will without ideas formed by themselves. They are more like smart machines rather than intelligences.

Go is a far more complicate board game compared to Gomoku and cracking the game of Go is always the holy grail of the whole AI community. In the year of 2016, AlphaGo become the world first AI to defeat human Go grand master [3] and in 2017, it even mastered Go totally from scratch without human knowledge [1]. AlphaGo is, in general, a powerful universal solution to board games like Chess and Shogi [2]. The most exciting part of AlphaGo is that it masters how to play through learning without any form of human-based evaluation.

Therefore, since the great potential of AlphaGo, we try to deploy the amazing algorithm in the game of Gomoku to construct artificial intelligence AlphaGomoku that can learn how to play free style Gomoku. However, customization is difficult since some intrinsic properties of free style Gomoku, e.g. asymmetry and short sight. It’s easy to be stuck in poor local optimum where the white almost resigns and the black attacks blindly if we directly apply AlphaGo method.

Hence, to address these issues, we use the so-called Curriculum Learning [13] paradigm to smooth the training process. Modifications like Double Networks Mechanism and Winning Value Decay are implemented to alleviate the intrinsic issues of Gomoku. Through two days’ training, AlphaGomoku has reached humans’ performance.

All the source codes of this project are published in Github111https://github.com/PolyKen/15_by_15_AlphaGomoku.

Ii Original Components of AlphaGo

Ii-a Bionics’ Explanation of AlphaGo’s Components

AlphaGo’s decision-making system is comprised of two units i.e. the policy-value network and the Monte Carlo Tree Search (MCTS) [6], both of which have some intuitive explanations in bionics.

Policy-value network functions like human’s brain, which observes the current board and generates prior judgments, analogous to the human intuition, of the current situation. MCTS is similar to human’s meditation or contemplation, which simulates multiple possible outcomes starting from the current state.

Like human’s meditation process guided by intuition, the simulation procedure of MCTS is also controlled by the prior judgements generated from the policy-value network. Conversely, the policy-value network is also trained according to the simulation results from MCTS, an analogy that human’s intellect is enhanced through deep meditation.

The final playing decision is made by MCTS’s simulation results, not the prior judgments from neural network, an analogy that rational person makes decision through deep meditation, not instant intuition.

In the following subsections, we will make the above discussions more precise.

Ii-B Policy-Value Network

The policy-value network , with

as the parameter, receives a tensor

characterizing the current board and outputs a prior policy probability distribution

and value scaler . Formally, we can write as:

(1)

Specifically, in AlphaGomoku, is a by by tensor, where is the width and length of the board and is the number of channels. and are consisted of binary values representing the presence of the current player’s stones and opponent’s stones respectively( equals to one if the location is occupied by the current color stone; equals to zero if the location is either empty or occupied by the other color; ’s assignment is analogous to ). is the last-move channel whose values are binary in a way that if and only the last move the opponent takes is at the location.

is a vector whose

component

represents the prior probability of placing the current stone to the

location of the board. represents the winning value of the current player, who is about to place his or her stone at the current playing stage. The bigger is, the more network believes the current player is winning.

The network we used here is a deep convolutional neural network [9] equiped with residual blocks [8] since it can solve the degradation problem caused by the depth of neural network, which is essential for the learning capability. Other mechanisms like Batch-Normolization [10] are used to further improve the performance of our policy-value net. The architecture of the net is consisted of three parts:

  • Residual Tower

    : Receives the raw board tensor and conducts high-level feature extraction. The output of residual tower is passed to policy head and value head separately.

  • Policy Head: Generates the prior policy probability distribution vector .

  • Value Head: Generates the winning value scalar .

And a detailed topology is shown in Table 1, Table 2, and Table 3.

Layer

(1) Convolution of 32 filters size 3 with stride 1

(2) Batch-Normalization

(3) Relu Activation

(4) Convolution of 32 filters size 3 with stride 1
(5) Batch-Normalization
(6) Relu Activation
(7) Convolution of 32 filters size 3 with stride 1
(8) Batch-Normalization
(9) Shortcut
(10) Relu Activation
(11) Convolution of 32 filters size 3 with stride 1
(12) Batch-Normalization
(13) Relu Activation
(14) Convolution of 32 filters size 3 with stride 1
(15) Batch-Normalization
(16) Shortcut
(17) Relu Activation
TABLE I: Residual Tower
Layer
(1) Convolution of 2 filters size 1 with stride 1
(2) Batch-Normalization
(3) Relu Activation
(4) Flatten
(5) Dense Layer with dim vector output
(6) Softmax Activation
TABLE II: Policy Head
Layer
(1) Convolution of 1 filter size 1 with stride 1
(2) Batch-Normalization
(3) Relu Activation
(4) Flatten
(5) Dense Layer with dim vector output
(6) Relu Activation
(7) Dense Layer with scaler output
(8) Tanh Activation
TABLE III: Value Head

Ii-C Monte Carlo Tree Search

MCTS , instructed by policy-value net , receives the tensor of current board and outputs a policy probability distribution vector generated from multiple simulations. Formally, we can write as:

(2)

Compared with those of , the informations of are more well-thought since MCTS chews the cud of current situation deliberately. Hence, serves as the ultimate guidance for decision-making in AlphaGo algorithm. There are three types of policies used in our AlphaGomoku program based on :

  • Stochastic Policy: AI chooses action randomly with respect to , i.e.

    (3)
  • Deterministic Policy: AI plays the current optimal move, i.e.

    (4)
  • Semi-Stochastic Policy: At the every beginning of the game, the AI will play stochastically with respect to policy distribution . And as the game continunes, after a user-specified stage, the AI will adopt deterministic policy.

and they are used in different contexts which we shall cover later.

Unlike the black-box property of policy-value network, we know exactly the logic inside MCTS algorithm. For a search tree, each node represents a board situation and the edges from that node represent all possible moves the player can take in this situation. Each edge stores the following statistics:

  • Visit Count, : the number of visits of the edge. Larger implies MCTS is more interested in this move. Indeed, the policy probability distribution of state is derived from the visit counts of all edges from in a way that:

    (5)

    where is the temperature parameter controlling the trade-off between exploration and exploitation. If is rather small, then the exponential operation will magnify the differences between components of and therefore reduces the level of exploration.

  • Prior Probability, : the prior policy probability generated by the network after network evaluates the edge’s root state . Larger indicates the network prefers this move and hence may guide MCTS to exam it carefully. While note that large does not guarantee a nice move since it is merely a prior judgment, an analogy to human’s instant intuition.

  • Mean Action Value, : the mean of the values of all nodes of the subtree under this edge and it represents the average wining value of this move. We can write as:

    (6)

    , where , for in the subtree of . The tricky part here is the plus-minus sign ””. is the wining value of the current player of , not necessarily ’s current player, while here is to evaluate the wining chance of the current player of if he or her chooses this move and therefore we need to adjust ’s sign accordingly.

  • Total Action Value, : the total sum of the values of all nodes of the subtree under this edge. serves as an intermediate variable when we try to update since we have the following relation: .

Before each action is played, MCTS will simulate possible outcomes starting from the current board situation for multiple times. Each simulation is made up of the following three steps:

  • Selection: Starting from the root board , MCTS iteratively select edge such that:

    (7)

    at each board situation under till is a leaf node. The expression is called the upper confidence bound, where:

    (8)

    and is a constant controlling the level of exploration. The design of upper confidence bound is to balance relations among mean action value , prior probability and visit count . MCTS tends to select the moves with large mean action value, large prior probability and small visit count.

  • Evaluation and Expansion: Once MCTS encounters a leaf node, say , which haven’t been evaluated by network before, we let outputs its prior policy probability distribution and wining value . Then, we create ’s edges and initialize their statistics as: for the edge.

  • Backup: Once we finish the evaluation and expansion procedure, we traverse reversely along the path to the root and update statistics of all the edges we pass in the backup procedure in a way that . The plus-minus sign ”” here is to implement Equ 6.

Fig. 1 demonstrates the pipeline of decision-making process of AlphaGo algorithm.

Fig. 1: Decision-Making of AlphaGo Algorithm [1]

One tricky issue to note is the problem of end node, e.g. draw or win or lose, which doesn’t have legal children nodes. If end node is encountered in simulation, instead of executing the evaluation and expansion procedure, we directly run the backup mechanism, where the backup value is -1 if it is a win(lose) node or 0 if it is a draw node.

Ii-D Training Target of Policy-Value Network

The key to the success of AlphaGo algorithm is that the prior policy probability distribution and wining value provided by policy-value network narrow MCTS to a much smaller search space with more promising moves. The guidance of and directly influences the quality of moves chose by MCTS and therefore we need to train our network in a wise way to improve its predictive intellect. There are three criteria set to a powerful policy-value network:

  • It can predict the winner correctly.

  • The policy distribution it provides is similar to a deliberate one, say the one simulated by MCTS.

  • It can generalize nicely.

To these ends, we apply Stochastic Gradient Descend algorithm with Momentum [11] to minimize the following loss function:

(9)

where is a parameter controlling the penalty and is the result of the whole game, i.e.

(10)

And we can see the three terms in the loss function reflecting the three criteria that we mentioned above.

Ii-E Multi-Threading Simulation

The time bottleneck of the decision-making process is the simulation of MCTS, which can be accelerated greatly using the multi-threading technique.

There are three points needed to be addressed to implement the asynchronous simulation:

  • Expansion Conflict: If two threadings happen to encounter the same leaf node and expand the node simultaneously, then the number of children nodes under this node will be mistaken and a conflict occurs. To address this issue, we maintain an expanding list which is a list of leaf nodes under expansion. When a threading encounters a leaf node, instead of expanding the node instantly, the threading checks the expanding list to see if the current node is in it. If yes, then the threading waits for a short period of time to let the other threading expands the node and keeps on selecting after the other threading finishes its expansion. If no, then just executes the normal expansion procedure.

  • Virtual Loss: To let different threadings try as various paths as possible, after each selection, we discredits the selected edge virtually by increasing its and deceasing its temporarily to deceive other threadings into choosing other edges. And the DeepMind terms the amount of increasing and decreasing as virtual loss. Clearly, each threading needs to clear out the virtual loss in the backup procedure.

  • Dilution Problem

    : If the virtual loss and the number of threadings are set too high, then the simulation numbers of the most promising moves may be diluted severely since many threadings are deceived to search other less promising edges. Therefore, the tunning of related hyperparameters is crucial.

Iii Customize AlphaGo to the Game of Gomoku

Iii-a Asymmetry and Short Sight of Free Style Gomoku

Although AlphaGo is, in general, a universal board game algorithm which also successfully master the game of Chess and Shogi besides Go, it is still difficult to make the algorithm work in Gomoku without proper customization. As a matter of fact, our first attempt of directly applying AlphaGo algorithm to Gomoku without customization fails, where the AI is stuck in poor local optimum.

The reasons for the difficulty lie in the fact that free style Gomoku is extremely asymmetric and short-sighted compared to Go, Chess, and Shogi.

Asymmetry: The black player, the side placing stone first, has greater advantage than the white player, which results in unbalanced training dataset, i.e. black wins far more games than white. Training on such dataset directly will easily corrupt the value branch of the network, making the white AI almost resign and the black AI arrogant since they both mistakenly agree that the black is winning regardless of the current circumstance. Besides, since the asymmetry of the game, the strategies of the black and the white are quite different, where the black is prone to attack while the white tends to defend. It’s hard to master such divergence with a single network. As we observe, if we apply the original AlphaGo algorithm to Gomoku, the white AI will sometimes attack blindly indicating the white strategy is negatively influenced by the black strategy.

Short Sight: Unlike Go, Chess, and Shogi where the global situation of the board determines the winning probability, Gomoku is more short-sighted where local recent situation is more important than global long-term situation. Therefore, AI should decrease the winning value back propagated from the simulation of future, which is not implemented in the original AlphaGo algorithm.

To alleviate the issues, we propose the following modifications to customize the AlphaGo algorithm to free style Gomoku: Double Networks Mechanism and Winning Value Decay, which shall be clear in the below subsections.

Iii-B Double Networks Mechanism

To solve the problem of asymmetry, we construct two policy-value networks to learn black and white strategies separately. Specifically, the black net is trained solely on the black moves and the white net is trained solely on the white moves. In simulation, the black net or white net are applied depending on the color of the current expanding node to give prior policy distribution and predicted winning value of the node.

By implementing the Double Networks Mechanism, we can see significant improvement of AlphaGomoku’s performance since it now plays asymmetrically, which fits the traits of the game.

Iii-C Winning Value Decay

There are two kinds of backward processes happens in the original AlphaGo algorithm, one is the backup stage of simulation where the predicted winning value of the leaf node is used to update the values of the nodes in the current simulation path and another is the labelling process of where each move’s is set according to the final result of the game. To solve the problem of short sight, we let the values to decay exponentially in the above processes and the resulted improvement is promising since the AI now focuses more on the recent events, which are dominantly significant in free style Gomoku.

Iv Curriculum Learning

Iv-a Intuition of Curriculum Learning

Curriculum Learning [13], a machine learning paradigm proposed by Bengio et.al, mimics the way human receive education. It introduces relatively easy concepts to the learning algorithm at the initial training stage and gradually increases the difficulty of the learning mission. By training like this, the learning algorithm can take advantage of previously learned basic concepts to ease the learning of more high-level abstractions. Bengio et.al have shown empirically that curriculum learning can accelerate the convergence of non-convex training and improve the quality of the local optimum obtained.

Back to the case of Gomoku, although AlphaGo algorithm is capable of learning the Go strategy without the guidance of human knowledge, it’s computationally intractable for most organizations to conduct such learning and the quality of the AI obtained cannot be guaranteed. Hence, we train AlphaGomoku in a curriculum learning pipeline to accelerate the training and secure the performance of the AI.

Specifically, our training pipeline can be divided into three phases: Learn Basic Rule, Imitate Mentor AI, and Self-play Reinforcement Learning. In the first two phases, AlphaGomoku learns basic strategy and receives guidance from the mentor AI. In the final phase, AlphaGomoku learns from its own playing experiences and enhance its playing performance. We can summarize the training pipeline using a famous Chinese saying:

The master teaches the trade, but apprentice’s skill is self-made.

Note that in all phases we need to train the black network and white network separately on corresponding moves. Fig. 2 demonstrates the general pipeline of our curriculum learning.

Fig. 2: Curriculum Learning

Iv-B Learn Basic Rule

At the very beginning of training, we teach the basic strategy of Gomoku to AlphaGomoku. To be specific, we randomly generate eighty thousand basic moves and let the networks learn the moves using mini-batch stochastic gradient descent with momentum. The moves generated include basic attack and basic defense.

For an attack instance, we randomly generate . is a situation, whose current player is black, containing a consecutive line of four black stones without being blocked in both sides. , the target policy distribution, is a one hot vector whose items are zero except the place which leads black to five. , the target winning value, is set to one.

For a defense instance, we generate . is a situation, whose current player is white, containing a consecutive line of four black stones being blocked in only one side. , the target policy distribution, is a one hot vector whose items are zero except the place which prevents the black from reaching five. , the target winning value, is set to zero.

Iv-C Imitate Mentor Gomoku Artificial Intelligence

We implement a rule-based tree search Gomoku AI as the mentor of AlphaGomoku. The mentor AI competes against itself and AlphaGomoku learns the games using mini-batch stochastic gradient descent with momentum. As we can observe, after learning the from mentor AI, AlphaGomoku has formed some advanced strategies like ”three-three”, ”four-four” and ”three-four” and its playing style is quite similar to mentor AI.

When AlphaGomoku can successfully defeat mentor AI with a great margin, we stop the imitation process to avoid over-fitting and further enhance the performance of AlphaGomoku through self-play reinforcement learning.

Iv-D Self-Play Reinforcement Learning

After the imitation, we train the networks through self-play reinforcement learning, where we maintain a single search tree guided by double policy-value networks and let it compete with itself using semi-stochastic policy. After each move, the search tree alters its root node to the move it takes and discards the remainder of the tree until the game ends. We collect all the data generated in several games and sample uniformly from the collected data to train the networks. The structure of each training data is:

(11)

Note that Gomoku is invariant under rotation and reflection, and hence we can augment training data by rotating and reflecting the board using seven different ways. After each training, we evaluate the training effect by letting the trained network compete with the currently strongest model using semi-stochastic policy. If the trained model wins, then we set the newly trained model to be the currently strongest and discard the old model. If the trained model loses, then we discard the newly trained model and keep the old one to be the currently strongest. Fig. 3 demonstrates the pipeline of self-play reinforcement learning.

We discuss three important questions that worth special attention in the above paragraph:

  • Why we need the evaluation procedure? To avoid poor local optimum by discarding badly performed trained model.

  • Why we adopt semi-stochastic policy in self-play and evaluation? Firstly, we add randomness into our policy to conduct exploration since more possible moves will be tried. Secondly, we let our model to behave discreetly after a certain stage to avoid bad quality data.

Another interesting observation to note is the variation of time spent in each self-play game. The time will first decay and then extend. The reason for this phenomenon is that as the agent evolves across time, it first grasps the attacking technique, which shortens the game, and then learns the defending technique, which prolongs the game.

Fig. 3: Pipeline of Self-Play Reinforcement Learning [1]

V Experiment

V-a Mentor AI vs AlphaGomoku

We let mentor AI compete with version-14 AlphaGomoku for one hundred games, in which AlphaGomoku plays black and white for fifty games respectively. We can see clearly that AlphaGomoku dominates the game and surpasses mentor AI. Fig. 4 shows the statistics of the competition. Fig. 5 and Fig. 6 are two sample games between mentor and AlphaGomoku.

Fig. 4: Statistics of Mentor vs AlphaGomoku
Fig. 5: Mentor(White) vs AlphaGomoku(Black)
Fig. 6: Mentor(Black) vs AlphaGomoku(White)

V-B Human vs AlphaGomoku

Version-14 AlphaGomoku use wechat mini program happy Gomoku to challenge random online players. Fig. 7 shows the statistics of our online test. Note that two of the games of the online test are played by Macau Gomoku professional and AlphaGomoku only loses to him when it plays black stone. Fig. 8, Fig. 9, Fig. 10 and Fig. 11 are the sample games of human vs AlphaGomoku.

Fig. 7: Statistics of Human vs AlphaGomoku
Fig. 8: Human(White) vs AlphaGomoku(Black)
Fig. 9: Human(White) vs AlphaGomoku(Black)
Fig. 10: Human(Black) vs AlphaGomoku(White)
Fig. 11: Human(Black) vs AlphaGomoku(White)

Acknowledgment

We would like to say thanks to BaiAn Chen from Vthree.AI and MingWen Liu from ShiningMidas Private Fund for their generous help throughout the research. We are also grateful to ZhiPeng Liang and Hao Chen from Sun Yat-sen University for their supports of the training process of our Gomoku AI. Without their supports, it’s hard for us to finish such a complicated task.

References