I Introduction
Free style Gomoku is an interesting strategy board game with quite simple rules: two players alternatively place black and white stones on a board with 15 by 15 grids and winner is the one who first reach a line of consecutive five or more stones of his or her color. It is popular among students since it can be played simply with a piece of paper and a pencil to kill the boring class time. It is also popular among computer scientists since Gomoku is a natural playground for many artificial intelligence algorithms. Some powerful AIs are created in this field, to name a few, say YiXin, RenjuSolver. However, most of the AIs existed are rulebased requiring human experts to construct wellcrafted evaluation function. To some extent, this kind of AIs are the slaves of humans’ will without ideas formed by themselves. They are more like smart machines rather than intelligences.
Go is a far more complicate board game compared to Gomoku and cracking the game of Go is always the holy grail of the whole AI community. In the year of 2016, AlphaGo become the world first AI to defeat human Go grand master [3] and in 2017, it even mastered Go totally from scratch without human knowledge [1]. AlphaGo is, in general, a powerful universal solution to board games like Chess and Shogi [2]. The most exciting part of AlphaGo is that it masters how to play through learning without any form of humanbased evaluation.
Therefore, since the great potential of AlphaGo, we try to deploy the amazing algorithm in the game of Gomoku to construct artificial intelligence AlphaGomoku that can learn how to play free style Gomoku. However, customization is difficult since some intrinsic properties of free style Gomoku, e.g. asymmetry and short sight. It’s easy to be stuck in poor local optimum where the white almost resigns and the black attacks blindly if we directly apply AlphaGo method.
Hence, to address these issues, we use the socalled Curriculum Learning [13] paradigm to smooth the training process. Modifications like Double Networks Mechanism and Winning Value Decay are implemented to alleviate the intrinsic issues of Gomoku. Through two days’ training, AlphaGomoku has reached humans’ performance.
All the source codes of this project are published in Github^{1}^{1}1https://github.com/PolyKen/15_by_15_AlphaGomoku.
Ii Original Components of AlphaGo
Iia Bionics’ Explanation of AlphaGo’s Components
AlphaGo’s decisionmaking system is comprised of two units i.e. the policyvalue network and the Monte Carlo Tree Search (MCTS) [6], both of which have some intuitive explanations in bionics.
Policyvalue network functions like human’s brain, which observes the current board and generates prior judgments, analogous to the human intuition, of the current situation. MCTS is similar to human’s meditation or contemplation, which simulates multiple possible outcomes starting from the current state.
Like human’s meditation process guided by intuition, the simulation procedure of MCTS is also controlled by the prior judgements generated from the policyvalue network. Conversely, the policyvalue network is also trained according to the simulation results from MCTS, an analogy that human’s intellect is enhanced through deep meditation.
The final playing decision is made by MCTS’s simulation results, not the prior judgments from neural network, an analogy that rational person makes decision through deep meditation, not instant intuition.
In the following subsections, we will make the above discussions more precise.
IiB PolicyValue Network
The policyvalue network , with
as the parameter, receives a tensor
characterizing the current board and outputs a prior policy probability distribution
and value scaler . Formally, we can write as:(1) 
Specifically, in AlphaGomoku, is a by by tensor, where is the width and length of the board and is the number of channels. and are consisted of binary values representing the presence of the current player’s stones and opponent’s stones respectively( equals to one if the location is occupied by the current color stone; equals to zero if the location is either empty or occupied by the other color; ’s assignment is analogous to ). is the lastmove channel whose values are binary in a way that if and only the last move the opponent takes is at the location.
is a vector whose
componentrepresents the prior probability of placing the current stone to the
location of the board. represents the winning value of the current player, who is about to place his or her stone at the current playing stage. The bigger is, the more network believes the current player is winning.The network we used here is a deep convolutional neural network [9] equiped with residual blocks [8] since it can solve the degradation problem caused by the depth of neural network, which is essential for the learning capability. Other mechanisms like BatchNormolization [10] are used to further improve the performance of our policyvalue net. The architecture of the net is consisted of three parts:

Residual Tower
: Receives the raw board tensor and conducts highlevel feature extraction. The output of residual tower is passed to policy head and value head separately.

Policy Head: Generates the prior policy probability distribution vector .

Value Head: Generates the winning value scalar .
And a detailed topology is shown in Table 1, Table 2, and Table 3.
Layer 

(1) Convolution of 32 filters size 3 with stride 1 
(3) Relu Activation 
(4) Convolution of 32 filters size 3 with stride 1 
(5) BatchNormalization 
(6) Relu Activation 
(7) Convolution of 32 filters size 3 with stride 1 
(8) BatchNormalization 
(9) Shortcut 
(10) Relu Activation 
(11) Convolution of 32 filters size 3 with stride 1 
(12) BatchNormalization 
(13) Relu Activation 
(14) Convolution of 32 filters size 3 with stride 1 
(15) BatchNormalization 
(16) Shortcut 
(17) Relu Activation 
Layer 

(1) Convolution of 2 filters size 1 with stride 1 
(2) BatchNormalization 
(3) Relu Activation 
(4) Flatten 
(5) Dense Layer with dim vector output 
(6) Softmax Activation 
Layer 

(1) Convolution of 1 filter size 1 with stride 1 
(2) BatchNormalization 
(3) Relu Activation 
(4) Flatten 
(5) Dense Layer with dim vector output 
(6) Relu Activation 
(7) Dense Layer with scaler output 
(8) Tanh Activation 
IiC Monte Carlo Tree Search
MCTS , instructed by policyvalue net , receives the tensor of current board and outputs a policy probability distribution vector generated from multiple simulations. Formally, we can write as:
(2) 
Compared with those of , the informations of are more wellthought since MCTS chews the cud of current situation deliberately. Hence, serves as the ultimate guidance for decisionmaking in AlphaGo algorithm. There are three types of policies used in our AlphaGomoku program based on :

Stochastic Policy: AI chooses action randomly with respect to , i.e.
(3) 
Deterministic Policy: AI plays the current optimal move, i.e.
(4) 
SemiStochastic Policy: At the every beginning of the game, the AI will play stochastically with respect to policy distribution . And as the game continunes, after a userspecified stage, the AI will adopt deterministic policy.
and they are used in different contexts which we shall cover later.
Unlike the blackbox property of policyvalue network, we know exactly the logic inside MCTS algorithm. For a search tree, each node represents a board situation and the edges from that node represent all possible moves the player can take in this situation. Each edge stores the following statistics:

Visit Count, : the number of visits of the edge. Larger implies MCTS is more interested in this move. Indeed, the policy probability distribution of state is derived from the visit counts of all edges from in a way that:
(5) where is the temperature parameter controlling the tradeoff between exploration and exploitation. If is rather small, then the exponential operation will magnify the differences between components of and therefore reduces the level of exploration.

Prior Probability, : the prior policy probability generated by the network after network evaluates the edge’s root state . Larger indicates the network prefers this move and hence may guide MCTS to exam it carefully. While note that large does not guarantee a nice move since it is merely a prior judgment, an analogy to human’s instant intuition.

Mean Action Value, : the mean of the values of all nodes of the subtree under this edge and it represents the average wining value of this move. We can write as:
(6) , where , for in the subtree of . The tricky part here is the plusminus sign ””. is the wining value of the current player of , not necessarily ’s current player, while here is to evaluate the wining chance of the current player of if he or her chooses this move and therefore we need to adjust ’s sign accordingly.

Total Action Value, : the total sum of the values of all nodes of the subtree under this edge. serves as an intermediate variable when we try to update since we have the following relation: .
Before each action is played, MCTS will simulate possible outcomes starting from the current board situation for multiple times. Each simulation is made up of the following three steps:

Selection: Starting from the root board , MCTS iteratively select edge such that:
(7) at each board situation under till is a leaf node. The expression is called the upper confidence bound, where:
(8) and is a constant controlling the level of exploration. The design of upper confidence bound is to balance relations among mean action value , prior probability and visit count . MCTS tends to select the moves with large mean action value, large prior probability and small visit count.

Evaluation and Expansion: Once MCTS encounters a leaf node, say , which haven’t been evaluated by network before, we let outputs its prior policy probability distribution and wining value . Then, we create ’s edges and initialize their statistics as: for the edge.

Backup: Once we finish the evaluation and expansion procedure, we traverse reversely along the path to the root and update statistics of all the edges we pass in the backup procedure in a way that . The plusminus sign ”” here is to implement Equ 6.
Fig. 1 demonstrates the pipeline of decisionmaking process of AlphaGo algorithm.
One tricky issue to note is the problem of end node, e.g. draw or win or lose, which doesn’t have legal children nodes. If end node is encountered in simulation, instead of executing the evaluation and expansion procedure, we directly run the backup mechanism, where the backup value is 1 if it is a win(lose) node or 0 if it is a draw node.
IiD Training Target of PolicyValue Network
The key to the success of AlphaGo algorithm is that the prior policy probability distribution and wining value provided by policyvalue network narrow MCTS to a much smaller search space with more promising moves. The guidance of and directly influences the quality of moves chose by MCTS and therefore we need to train our network in a wise way to improve its predictive intellect. There are three criteria set to a powerful policyvalue network:

It can predict the winner correctly.

The policy distribution it provides is similar to a deliberate one, say the one simulated by MCTS.

It can generalize nicely.
To these ends, we apply Stochastic Gradient Descend algorithm with Momentum [11] to minimize the following loss function:
(9) 
where is a parameter controlling the penalty and is the result of the whole game, i.e.
(10) 
And we can see the three terms in the loss function reflecting the three criteria that we mentioned above.
IiE MultiThreading Simulation
The time bottleneck of the decisionmaking process is the simulation of MCTS, which can be accelerated greatly using the multithreading technique.
There are three points needed to be addressed to implement the asynchronous simulation:

Expansion Conflict: If two threadings happen to encounter the same leaf node and expand the node simultaneously, then the number of children nodes under this node will be mistaken and a conflict occurs. To address this issue, we maintain an expanding list which is a list of leaf nodes under expansion. When a threading encounters a leaf node, instead of expanding the node instantly, the threading checks the expanding list to see if the current node is in it. If yes, then the threading waits for a short period of time to let the other threading expands the node and keeps on selecting after the other threading finishes its expansion. If no, then just executes the normal expansion procedure.

Virtual Loss: To let different threadings try as various paths as possible, after each selection, we discredits the selected edge virtually by increasing its and deceasing its temporarily to deceive other threadings into choosing other edges. And the DeepMind terms the amount of increasing and decreasing as virtual loss. Clearly, each threading needs to clear out the virtual loss in the backup procedure.

Dilution Problem
: If the virtual loss and the number of threadings are set too high, then the simulation numbers of the most promising moves may be diluted severely since many threadings are deceived to search other less promising edges. Therefore, the tunning of related hyperparameters is crucial.
Iii Customize AlphaGo to the Game of Gomoku
Iiia Asymmetry and Short Sight of Free Style Gomoku
Although AlphaGo is, in general, a universal board game algorithm which also successfully master the game of Chess and Shogi besides Go, it is still difficult to make the algorithm work in Gomoku without proper customization. As a matter of fact, our first attempt of directly applying AlphaGo algorithm to Gomoku without customization fails, where the AI is stuck in poor local optimum.
The reasons for the difficulty lie in the fact that free style Gomoku is extremely asymmetric and shortsighted compared to Go, Chess, and Shogi.
Asymmetry: The black player, the side placing stone first, has greater advantage than the white player, which results in unbalanced training dataset, i.e. black wins far more games than white. Training on such dataset directly will easily corrupt the value branch of the network, making the white AI almost resign and the black AI arrogant since they both mistakenly agree that the black is winning regardless of the current circumstance. Besides, since the asymmetry of the game, the strategies of the black and the white are quite different, where the black is prone to attack while the white tends to defend. It’s hard to master such divergence with a single network. As we observe, if we apply the original AlphaGo algorithm to Gomoku, the white AI will sometimes attack blindly indicating the white strategy is negatively influenced by the black strategy.
Short Sight: Unlike Go, Chess, and Shogi where the global situation of the board determines the winning probability, Gomoku is more shortsighted where local recent situation is more important than global longterm situation. Therefore, AI should decrease the winning value back propagated from the simulation of future, which is not implemented in the original AlphaGo algorithm.
To alleviate the issues, we propose the following modifications to customize the AlphaGo algorithm to free style Gomoku: Double Networks Mechanism and Winning Value Decay, which shall be clear in the below subsections.
IiiB Double Networks Mechanism
To solve the problem of asymmetry, we construct two policyvalue networks to learn black and white strategies separately. Specifically, the black net is trained solely on the black moves and the white net is trained solely on the white moves. In simulation, the black net or white net are applied depending on the color of the current expanding node to give prior policy distribution and predicted winning value of the node.
By implementing the Double Networks Mechanism, we can see significant improvement of AlphaGomoku’s performance since it now plays asymmetrically, which fits the traits of the game.
IiiC Winning Value Decay
There are two kinds of backward processes happens in the original AlphaGo algorithm, one is the backup stage of simulation where the predicted winning value of the leaf node is used to update the values of the nodes in the current simulation path and another is the labelling process of where each move’s is set according to the final result of the game. To solve the problem of short sight, we let the values to decay exponentially in the above processes and the resulted improvement is promising since the AI now focuses more on the recent events, which are dominantly significant in free style Gomoku.
Iv Curriculum Learning
Iva Intuition of Curriculum Learning
Curriculum Learning [13], a machine learning paradigm proposed by Bengio et.al, mimics the way human receive education. It introduces relatively easy concepts to the learning algorithm at the initial training stage and gradually increases the difficulty of the learning mission. By training like this, the learning algorithm can take advantage of previously learned basic concepts to ease the learning of more highlevel abstractions. Bengio et.al have shown empirically that curriculum learning can accelerate the convergence of nonconvex training and improve the quality of the local optimum obtained.
Back to the case of Gomoku, although AlphaGo algorithm is capable of learning the Go strategy without the guidance of human knowledge, it’s computationally intractable for most organizations to conduct such learning and the quality of the AI obtained cannot be guaranteed. Hence, we train AlphaGomoku in a curriculum learning pipeline to accelerate the training and secure the performance of the AI.
Specifically, our training pipeline can be divided into three phases: Learn Basic Rule, Imitate Mentor AI, and Selfplay Reinforcement Learning. In the first two phases, AlphaGomoku learns basic strategy and receives guidance from the mentor AI. In the final phase, AlphaGomoku learns from its own playing experiences and enhance its playing performance. We can summarize the training pipeline using a famous Chinese saying:
The master teaches the trade, but apprentice’s skill is selfmade.Note that in all phases we need to train the black network and white network separately on corresponding moves. Fig. 2 demonstrates the general pipeline of our curriculum learning.
IvB Learn Basic Rule
At the very beginning of training, we teach the basic strategy of Gomoku to AlphaGomoku. To be specific, we randomly generate eighty thousand basic moves and let the networks learn the moves using minibatch stochastic gradient descent with momentum. The moves generated include basic attack and basic defense.
For an attack instance, we randomly generate . is a situation, whose current player is black, containing a consecutive line of four black stones without being blocked in both sides. , the target policy distribution, is a one hot vector whose items are zero except the place which leads black to five. , the target winning value, is set to one.
For a defense instance, we generate . is a situation, whose current player is white, containing a consecutive line of four black stones being blocked in only one side. , the target policy distribution, is a one hot vector whose items are zero except the place which prevents the black from reaching five. , the target winning value, is set to zero.
IvC Imitate Mentor Gomoku Artificial Intelligence
We implement a rulebased tree search Gomoku AI as the mentor of AlphaGomoku. The mentor AI competes against itself and AlphaGomoku learns the games using minibatch stochastic gradient descent with momentum. As we can observe, after learning the from mentor AI, AlphaGomoku has formed some advanced strategies like ”threethree”, ”fourfour” and ”threefour” and its playing style is quite similar to mentor AI.
When AlphaGomoku can successfully defeat mentor AI with a great margin, we stop the imitation process to avoid overfitting and further enhance the performance of AlphaGomoku through selfplay reinforcement learning.
IvD SelfPlay Reinforcement Learning
After the imitation, we train the networks through selfplay reinforcement learning, where we maintain a single search tree guided by double policyvalue networks and let it compete with itself using semistochastic policy. After each move, the search tree alters its root node to the move it takes and discards the remainder of the tree until the game ends. We collect all the data generated in several games and sample uniformly from the collected data to train the networks. The structure of each training data is:
(11) 
Note that Gomoku is invariant under rotation and reflection, and hence we can augment training data by rotating and reflecting the board using seven different ways. After each training, we evaluate the training effect by letting the trained network compete with the currently strongest model using semistochastic policy. If the trained model wins, then we set the newly trained model to be the currently strongest and discard the old model. If the trained model loses, then we discard the newly trained model and keep the old one to be the currently strongest. Fig. 3 demonstrates the pipeline of selfplay reinforcement learning.
We discuss three important questions that worth special attention in the above paragraph:

Why we need the evaluation procedure? To avoid poor local optimum by discarding badly performed trained model.

Why we adopt semistochastic policy in selfplay and evaluation? Firstly, we add randomness into our policy to conduct exploration since more possible moves will be tried. Secondly, we let our model to behave discreetly after a certain stage to avoid bad quality data.
Another interesting observation to note is the variation of time spent in each selfplay game. The time will first decay and then extend. The reason for this phenomenon is that as the agent evolves across time, it first grasps the attacking technique, which shortens the game, and then learns the defending technique, which prolongs the game.
V Experiment
Va Mentor AI vs AlphaGomoku
We let mentor AI compete with version14 AlphaGomoku for one hundred games, in which AlphaGomoku plays black and white for fifty games respectively. We can see clearly that AlphaGomoku dominates the game and surpasses mentor AI. Fig. 4 shows the statistics of the competition. Fig. 5 and Fig. 6 are two sample games between mentor and AlphaGomoku.
VB Human vs AlphaGomoku
Version14 AlphaGomoku use wechat mini program happy Gomoku to challenge random online players. Fig. 7 shows the statistics of our online test. Note that two of the games of the online test are played by Macau Gomoku professional and AlphaGomoku only loses to him when it plays black stone. Fig. 8, Fig. 9, Fig. 10 and Fig. 11 are the sample games of human vs AlphaGomoku.
Acknowledgment
We would like to say thanks to BaiAn Chen from Vthree.AI and MingWen Liu from ShiningMidas Private Fund for their generous help throughout the research. We are also grateful to ZhiPeng Liang and Hao Chen from Sun Yatsen University for their supports of the training process of our Gomoku AI. Without their supports, it’s hard for us to finish such a complicated task.
References
 [1] Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of go without human knowledge[J]. Nature, 2017, 550(7676): 354.
 [2] Silver D, Hubert T, Schrittwieser J, et al. Mastering chess and shogi by selfplay with a general reinforcement learning algorithm[J]. arXiv preprint arXiv:1712.01815, 2017.
 [3] Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. nature, 2016, 529(7587): 484.
 [4] Mnih V, Kavukcuoglu K, Silver D, et al. Humanlevel control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529.
 [5] Sutton R S, Barto A G. Reinforcement learning: An introduction[M]. MIT press, 1998.
 [6] Browne C B, Powley E, Whitehouse D , et al. A survey of monte carlo tree search methods[J]. IEEE Transactions on Computational Intelligence and AI in games, 2012, 4(1): 143.

[7]
Goodfellow I, Bengio Y, Courville A, et al. Deep learning[M]. Cambridge: MIT press, 2016.

[8]
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770778.

[9]
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C]//Advances in neural information processing systems. 2012: 10971105.
 [10] Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[J]. arXiv preprint arXiv:1502.03167, 2015.
 [11] Ruder S. An overview of gradient descent optimization algorithms[J]. arXiv preprint arXiv:1609.04747, 2016.
 [12] Knuth D E, Moore R W. An analysis of alphabeta pruning[J]. Artificial intelligence, 1975, 6(4): 293326.
 [13] Bengio, Yoshua, et al. ”Curriculum learning.”Proceedings of the 26th annual international conference on machine learning. ACM, 2009.
Comments
There are no comments yet.