1 Introduction
The longstanding challenge in artificial intelligence of playing Go at professional human level has been succesfully tackled in recent works [5, 7, 6], where software tools (AlphaGo, AlphaGo Zero, AlphaZero) combining neural networks and Monte Carlo tree search reached superhuman level. A recent development was Leela Zero [4], an open source software whose neural network is trained over millions of games played in a distributed fashion, thus allowing improvements within reach of the resources of the academic community.
However, all these programs suffer from a relevant limitation: it is impossible to target their victory margin. They are trained with a fixed komi of 7.5 and they are built to maximize just the winning probability, not considering the score difference.
This has several negative consequences for these programs: when they are ahead, they choose suboptimal moves, and often win by a small margin; they cannot be used with komi 6.5, which is also common in professional games; they show bad play in handicap games, since the winning probability is not a relevant attribute in that situations.
In principle all these problems could be overcome by replacing the binary reward (win=1, lose=0) with the game score difference, but the latter is known to be less robust [3, 8] and in general strongest programs use the former since the seminal works [1, 3, 2].
Truly, letting the score difference be the reward for the AlphaGo Zero method, where averages of the value are computed over different positions, would lead to situations in which a low probability of winning with a huge margin could overcome a high probability of winning by 0.5 points in MCTS search, resulting in weaker play.
An improvement that would ensure the robustness of estimating winning probabilities, but at the same time would overcome these limitations, would be the ability to set the initial bonus for white player (komi) to an arbitrary value. The agent would then maximize the winning probability with a variable virtual bonus/malus, resulting in a flexible play able to adapt to positions in which it is ahead and behind taking into account implicit information about the score difference.
The first attempt in this direction gave unclear results [9].
In this work we propose a model to pursue this strategy, and as a proofofconcept of its value we apply it to a game of Go on a 77 goban. We discuss the result of our experiment and propose a way forward.
The source code of the SAI fork of Leela Zero and of the corresponding server can be found on GitHub at https://github.com/saidev/sai and https://github.com/saidev/saiserver.
2 General ideas
In this section we explain the general ideas of our project. Many details, both theorical and applied are deferred to the sections after this.
2.1 Probability of winning
The probability of victory of the current player depends on the state . For the sake of generality we include a second parameter, i.e. a number of virtual bonus points the player is given. So we will have , with the latter being our standard notation. When trying to win by some amount of points , the agent may let to ponder its chances. When trying to recover from a losing position (maybe because of handicap) where it estimates being points behind, the agent may initially let be a positive integer less than to try to recover some points to start.
Since as a function of must be increasing and map the real line onto , a family of sigmoid functions is a natural choice:
(1) 
Here we set
(2) 
The number is the signed komi, i.e. if the real komi of the game is , we set if at the current player is white and if it is black.
The number is a shift parameter: since , it represents the expected difference of points on the board from the perspective of the current player.
The number is a scale parameter: the higher it is, the steeper is the sigmoid, generally meaning that the result is set.
The highest meaningful value of is of the order of 10, since at the end of the game, when the score on the board is set, must go from about 0 to about 1 by increasing its argument by one single point.
The lowest meaningful value of for the full 1919 board is of the order of , since at the start of the game, even for a very weak agent it would be impossible to lose with a 361.5 points komi in favor.
2.2 Neural network
AlphaGo, AlphaGo Zero, AlphaZero and Leela Zero all share the same core structure, with a neural network^{1}^{1}1Actually, this is done with two neural networks in the case of AlphaGo. that for every state gives

a probability distribution over the possible moves
(the policy); 
a real number (the value).
The policy is trained as to choose the most promising moves for searching the tree of subsequent positions.
The value is trained to estimate the probability of winning for the current player. (In some programs is mapped to instead of but this doesn’t change its meaning and use.)
We propose a modification of Leela Zero neural network that for every state gives the usual policy , and the two parameters and described above instead of .
The net will feature the usual main structure, based on a 33 residual convolutional tower, topped with 3 heads, for the 3 outputs to estimate.
The value head in Leela Zero is formed by a 1
1, 1filter convolution (with batch normalization and ReLu as is usual), followed by a 256unit fullyconnected layer, ReLu, then 1unit fullyconnected layer and finally a
transformation. It is trained against the outcome of the game with loss function.We will introduce several slightly different net structures, but the basic idea is to duplicate the value head up to the single unit output. These two heads would yield the final value of and a raw output that is transformed to by
(3) 
The exponential transform imposes the natural condition that is always positive. The constant is clearly redundant when the net is fully trained, but the first numerical experiments show that it may be useful to tune the training process at the very beginning, when the net weights are almost random, because otherwise would be of the order of 1, which is much too large for random play, yielding training problems.
The two outputs will be trained with the usual loss function but with the value substituted with
Remark 1.
It must be noted that from a theoretical point of view, for each state we are trying to train two parameters and from a single output (i.e. the game’s outcome). This is expected to work somewhat, thanks to the generalizing abilities of neural networks, but it makes the training more difficult, in particular at the beginning of the reinforcement learning pipeline.
We stress that it would be much better to have at least two finished games (with different komi) for many training states . More on this in the next section.
Remark 2.
The basic approach proposed here is was chosen to be simple and lightweight. The komi value of the game is exclusively used in a rigid way to compute the probability of winning. In particular it is not used to modify the estimated policy probabilities. This may have to be changed in the future. A different choice would have been for example to feed the komi value into the input planes.
2.3 Training data
Training of Go neural networks with multiple komi evaluation is a challenge on its own.
Supervised approach appears unfeasible, since large databases of games have typically standard komi values of 6.5, 7.5 or so and moreover it’s not possible to estimate final territory reliably for them.
Unsupervised learning asks for the creation of millions of games even when the komi value is fixed. If that had to be made variable, then theoretically millions of games would be needed for each komi value^{2}^{2}2The argument that one can play the games to the end and then score under multiple komi does not work here because this doesn’t allow to estimate the parameter. Moreover that approach would rely on the agent of the selfplays to converge to scoreperfect play, while the current approach is satisfied with convergence to winningperfect play..
Moreover, games started with komi very different from the natural values may well be weird, wrong and useless for training, unless one is able to provide agents with different strength.
Finally, the fact that we need to estimate two values instead of one would certainly work better if there were two games with different komi starting from every position, because of the issue mentioned in Remark 1.
We propose a solution to this problem, by dropping the usual choice that selfplay games for training always start from the initial empty board position.
The proposed procedure is the following.

Start a game from the empty board with random komi near the natural one.

For each state in the game, take note of the estimated value of .

After the game is finished, look for states in which is large: these are positions in which one of the sides was estimated to be ahead of points.

With some probability start a new game from states with the komi corrected by points, in such a way that the new game starts with even chances of winning, but with a komi very different from the natural one.

Iterate from the start.
With this approach games branch when they become uneven, generating fragments of games with natural situations in which a large komi may be given without compromising the style of game.
Moreover, the starting fuseki positions, that, with the typical naive approach, are greatly overrepresented in the training data, are in this way much less frequent.
Finally, not all but many training states are in fact branching points for which there exists two games with different komi, thus expected to yield easier training.
2.4 Agent behaviour
In Leela Zero, the agent chooses moves according to information obtained during a Monte Carlo Tree Search (MCTS) inside the tree of possible futures. The MCTS is guided by the policy and value estimated for each state. The precise algorithm, that will be detailed in Section 4.3.1, is often referred to as multiarmed bandit and the tree which it defines iteratively is called upper confidence tree (UC tree).
The procedure is actually quite noisy, due to the useful but very imprecise network evaluation of states. MCTS exploration with very high visit number is then able to mitigate this potential weakness thanks to two smart choices:

the evaluation of the winning probability of an intermediate state is the average of the value over the subtree of states rooted at , instead of the typical minimax that is expected in these situations;

the final selection of the move to play is done, at the root of the MCTS tree, by maximizing the number of playouts instead of the winning probability.
Both these choices are very robust to defects of the neural network and hence we chose to incorporate them in our new agent, which is designed to be able to win by large score differences. In other words, we maintained this procedure and just designed a new . We actually designed a parametric family of scores , containing as a special case, as follows.
The idea is to define as the average of for on an interval ranging from to a level of bonus/malus points that would make the game closer to be even, therefore under or overestimating the probability of victory.
Let
Notice that if the signed komi was , then
We want the point to be chosen inside in such a way that it is equal to when and to when . To this end, let
that is, is a probability intermediate between the true (with komi ) and the even probability , and is closer to if is higher. Then, let be the unique real number such that
that is, the level of bonus/malus points that would make the probability of victory .
Finally, let
where is the root node of the UC tree for which the value of is needed (more details in Section 4.3).
As a result we have a family of agents, parametrized by . If the agent simply tries to maximize the winning probability, as the Leela Zero agent did – even though this probability is estimated using and instead of . If , the behaviour depends on whether the current player is winning or not, that is, if is bigger or smaller than .

When the current player is winning, that is , agents with a high try to get a higher probability of winning even with a score malus (), thus they aim at maximizing the final score.

When the current player is losing, that is , agents with a high try to get a higher probability of winning under the assumption that they have a score bonus (), thus they embark on a calm plan to recover points, instead of trying desperate moves to subvert the game.
We observe that the first situation is useful during games with weaker opponents, because agents with high would try very strong moves also in the endgame and late middlegame. The second situation is useful to produce a strong game when the software plays with some handicap, in stones or komi.
It is natural to imagine to tune the value of
dynamically depending on the situation, the opponent’s style, or the moment of the game. It would be even possible to apply values of
outside of , provided .3 Setting the stage: 77 Leela Zero
To provide a benchmark for the developement of SAI, we adapted Leela Zero to 77 Go board and performed several runs of training from purely random play to a level at which further improvement wasn’t expected.
3.1 Scaling down Go complexity
Scaling the Go board from size to size with yields several advantages:

Average number of legal moves at each position scales by .

Average length of a game scales by .

The number of visits in the UC tree that would result in a similar understanding of the total game, scales at an unclear rate, nevertheless one may naively infer from the above two, that it may scale by about .

The number of resconv layers in the ANN tower scales by .

The fully connected layers in the ANN are also much smaller, even if it is more complicated to estimate the speed contribution.
All in all it is reasonable that the total speed improvement for selfplay games is of the order of at least.
Since the expected time to train 1919 Go on reasonable hardware has been estimated to be in the order of several hundred years, we anticipated that for 77 Go this time should be in the order of weeks.
In fact, with a small cluster of 3 personal computers with average GPUs we were able to complete most runs of training in less than a week each.
We always used networks with 3 residual convolutional layers of 128 filters, the other details being the same as Leela Zero.
The number of visits corresponding to the standard value of 3200 used on the regular Go board would scale to about 60 for 77. We initially experimented with 40, 100 and 250 visits and then went with the latter, which we found to be much better.
The Dirichlet noise parameter has to be scaled with the size of the board, according to [6] and we did so, testing with the (nonscaled) values of , and .
The number of games on which the training is performed was assumed to be quite smaller that the standard 250k window used at size 19, and after some experimenting we observed that values between 8k and 60k generally give good results.
3.2 Measuring playing strength
When doing experiments with training runs of Leela Zero, we produce many networks, which must be tested to measure their playing strength, so that we can assess the performance and efficiency of each run.
The simple usual way to do so is to estimate an Elo/GOR score for each network^{3}^{3}3In fact the neural network, is just one of many components of the playing software, which depends also on several other important choices, such as the number of visits, fpu policies and all the other parameters. Rigorously the strength should be defined for the playing agent (each software implementation of Leela Zero), but to ease the language and the exposition, we will speak of the strength of the network, meaning that the other parameters were fixed at some value for all matches.. The idea which defines this number is that if and are the scores of two nets, then the probability that the first one wins against the second one in a single match is
so that is, apart from a scaling coefficient
(traditionally set to 400), the logoddsratio of winning.
This model is so simple that is actually unsuitable to deal with the complexity of Go and Go playing ability. Even for size 7 board.
In fact in several runs of Leela Zero 77 we observed that each training phase would produce at least one network which solidly won over the previous best, and was thus promoted to new best. This process would continue forever, or at least as long as we dared keep the run going, even if from some point on, the observed playing style was not evolving anymore. When some match was tried between nonconsecutive networks, we saw that the strength inequality was not transitive, in that it was easy to find cycles of 3 or more networks that regularly beat each other in a directed circle. Even with very strong margins.
We even tried to measure the playing strength in a more refined way, by performing roundrobin tournaments between nets and then estimating Elo score by maximum likelihood methods. This is much heavier to perform and still showed poor improvement in predicting match outcomes.
It must be noted that this appears to be an interesting research problem in its own. The availability of many artificial playing agents with different styles, strengths and weaknesses will open new possibilities in collecting data and experimenting in this field.
Remark 3.
It appears that this problem is mainly due to the peculiarity of 77 Go and only relevant to it.
In the official 1919 Leela Zero project the Elo estimation is done with respect to previous best agent only and it is known that there is some Elo inflation, but tests against a fixed set of other opponents or against further past networks have shown that real playing strength does improve.
3.2.1 Panel evaluation and elicitation
A different approach which is both robust and refined and is easy to generalize is to use a panel of networks to evaluate the strength of each new candidate.
We chose 15 networks of different strength from the first 5 runs of Leela Zero 7
7. Each network to be evaluated is opposed to each of these in a 100 games match. The result is then a vector of 15 sample winning rates, which contains useful multivariate information on the playing style, strengths and weaknesses of the tested net.
To summarize this information in one rough (but scalar) score number, it would not do to simply sum the winning rates, since the elements of the panel have different strength and the results against them will certainly be correlatend in a complex way. Some sort of weighted sum is recommended, thus allowing for a sort of elicitation of the panel nets.
An imperfect but simple solution is to use principal component analysis. We performed covariance PCA once for all on the match results of the first few hundreds of good networks, determined the principal factor and used its components as weights
^{4}^{4}4By the properties of PCA, the principal factor will find the maximum distinguisher between results againts the panel networks. In the 15dimensions space of results this will be the direction in which the networks analysed are more different from one another. The coefficients of this factor resulted to be all positive numbers ranging from for very weak networks, to for the strongest one, thus confirming that these weights represent a reasonable measure of strength.. Hence the score of a network is the principal component of its PCA decomposition.This value, which we call panel evaluation, correlates well with the maximum likelihood estimation of Elo by roundrobin matches, but is much easier and quick to compute and convenient to use on large sets of networks from different runs.
3.3 Effect of visits and Dirichlet parameter
After a few experimental runs to ensure that the system was working well, we decided to study the influence of the two main factors: visits and Dirichlet’s .
3.3.1 Design of experiment
We performed 5 runs (Taguchi L4 design plus central point).
 Number of visits.

It is the total number of nodes of the UC tree which is built for every position in the game to decide the next move. The higher it is, the slower are the games, because each move takes more time. Because of tree reuse, if the net is strong, only a small fraction of the nodes will be computed anew on each move, and hence the time dependence on this parameter is not expected to be linear.
Moreover if a larger value is used, then the agent will play better, generations will improve faster and so fewer generations of nets are expected to be required to reach the same level of play. In this sense we expected some tradeoff in the total effort required and possibly the existence of an optimal value for this parameter.
The values used for these experiments are 40, 100 and 250. (Logarithmically spaced.)
 Dirichlet noise parameter.

It is denoted by and is used in the generation of a random probability distribution on the moves (itself Dirichlet distributed), which is used as a perturbation of the policy distribution estimated by the neural network to get the noisy policy :
Here , and is rescaled according to [6].
By the properties of the Dirichlet distribution, if is the number of legal moves, it is expected that most of the probability mass of will be concentrated on moves. (This number is at the start of the game, for the default value . Hence these randomly selected moves will have an average probability bonus of about , or .)
It is expected that this parameter impacts on the quality and quantity of exploration during selfplay and that it may interact in a nonlinear way with the number of visits, since if the latter is too low then these probability bonus can have small to no effect.
The values used for these experiments are , , . (Logarithmically spaced.)
Other parameters were fixed at their default value. In particular, the number of moves played more randomly at the beginning of selfplay games was set to 4, which seems a suitable rescaling of the default value of 30 which is used on size 19 board.
3.3.2 Experimental protocol
After some trial and error the following choices were made.

AlphaGo Zero promotion criterion: at least 55% wins of 400 games against current best network.

Fixed number of 10240 selfplay games per generation.

Training window of 1 to 5 previous generations (10240 to 51200 games) according to whether the last generations improved much or not.

Minibatch size 512. Training rate 0.05, training steps 8000. New candidate network to test every 1000 steps.

Training starts from current best.

If no new network gets promoted, replicate the last training once, then try again up to 10 times by changing the number and choice of generations in the training window.

Stop the run if unable to find a network that gets promoted, or after waiting at least 10 more generation if there is evidence that the playing strength against the panel is not improving. (Even if new networks get promoted each time.)
3.3.3 Results
The results of the five runs are in the plots of Figure 1. The Dirichelet’s parameter has apparently no effect in this range of values. The number of visits instead has a major importance, in that not only it impacts the speed of learning, but it also affects the final level of playing that is reached by the best networks of the run.
3.4 Effect of net structure and AlphaZero protocol
After the first experiments, we decided to fix the number of visits to 250 and the Dirichelet’s parameter to and studied the dependence on different choices in the runs protocol.
 AlphaZero promotion criterion.

There is only one new network for each generation. Every network plays 1280 selfplay games. The training window is of the last 16 generations, for a total of at most 20480 games. The next generation’s network is produced after 1000 training steps and automatically promoted best network.
 AlphaZero randomness.

All moves of selfplay games are played more randomly.
 Augmented filters.

The net structure is slightly augmented by raising from 1 to 2 the number of filters for the 11 convolutions of the value head.
3.4.1 Results
The results of the five runs are in the plots of Figure 2.
It is apparent that AlphaZero promotion criterion gives good fast improvement and even seems more uniform than promotion conditioned on winning.
AlphaZero randomness seems to give more stability but slows quite a bit the learning at the beginning.
The augmented filters version seems harder to train and somewhat of similar performances.
4 Proof of concept: 77 Sai
After gaining a proper understanding of the learning process of 77 Leela Zero we started experimenting with SAI.
4.1 Neural network structure
As explained in Section 2.2, Leela Zero’s neural network provides for each position two outputs:
 policy

– is a probability distribution over the existing moves which predicts the moves that the engine should read with higher priority;
 winrate

– is an estimate of the probability of winning of the current player.
SAI’s neural network should provide for each position three outputs: the policy as before and the two parameters and of a sigmoid function which would allow to estimate the winrate for different komi values with a single computation of the net.
It is unclear whether the komi itself should be provided as an input of the neural network: it may help the policy adapt to the situation, but could also make the other two parameters unreliable^{5}^{5}5As will be explained soon, the training is done at the level of winrate, so in principle, knowing the komi, the net could train and to any of the infinite pairs that, with that komi, give the right winrate.. For the initial experiments the komi will not be provided as an input to the net.
With the above premises, the first structure we propose for the network is very similar to Leela Zero’s one, with the value head substituted by two identical copies of itself devoted to the parameters and . The latter is then mapped to by equation (3).
We call the following type V structure.

One 33 convolutional layer with 18 input planes^{6}^{6}6Two input planes are either all ones or all zeros according to the color of the current player. The other 16 layers represent the last 8 positions of the board as bit planes describing the stones of current/other player. and 128 output filters, followed by batch normalization and ReLU.

A tower of three 33 residual convolutional layers with 128 inputs and 128 filters each, followed by batch normalization and ReLU.

One 11 convolutional layer with 128 inputs and 2 filters, followed by batch normalization and ReLU.

One fully connected layer with inputs and filters, followed by softmax gives the policy distribution.

Attached again to the output of the tower of residual layers, one 11 convolutional layer with 128 inputs and 1 filter, followed by batch normalization and ReLU.

One fully connected layer with inputs and filters, followed by ReLU, and then a fully connected layer with inputs and 1 filter, and finally ReLU, gives the parameter .

Attached again to the output of the tower of residual layers, one 11 convolutional layer with 128 inputs and 1 filter, followed by batch normalization and ReLU.

One fully connected layer with inputs and filters, followed by ReLU, and then a fully connected layer with inputs and 1 filter, and finally ReLU, gives the parameter .
Some natural alternatives to the above are type Y and type T structures, which substitute the last four points of the list above respectively with:

Attached again to the output of the tower of residual layers, one 11 convolutional layer with 128 inputs and 2 filters, followed by batch normalization and ReLU.

One fully connected layer with inputs and filters, followed by ReLU, and then a fully connected layer with inputs and 1 filter, and finally ReLU, gives the parameter .

Attached again to the output of the last convolutional layer, one fully connected layer with inputs and filters, followed by ReLU, and then a fully connected layer with inputs and 1 filter, and finally ReLU, gives the parameter .
and with:

Attached again to the output of the tower of residual layers, one 11 convolutional layer with 128 inputs and 2 filters, followed by batch normalization and ReLU.

One fully connected layer with inputs and filters, followed by ReLU, and then a fully connected layer with inputs and 2 filters, and finally ReLU, gives the parameters and .
4.2 Training
To train the network we included the komi value into the training data used by SAI. The training is then performed the same way as for Leela Zero, with the loss function given by the sum of regularization term, cross entropy for the policy and norm for the winning rate.
The winning rate is computed with the sigmoid function given by equations (1) and (2), in particular we set
and backpropagate gradients through these functions.
4.2.1 On generating good training data
To train the neural network it is clearly necessary to have different komi values in the data set. It would be best to have very different komi values, but when the agent starts playing well enough, only few values around the correct komi^{7}^{7}7The correct komi for 77 Go is known to be 9, in that with that value both players can obtain a draw. Since we didn’t want to deal with draws, for 77 Leela Zero we chose a komi, thus giving victory to white in case of a perfect play. In fact we noticed that with a komi of or (equivalent by chinese scoring) the final level of play of the agents didn’t seem to be as subtle as it appears to be for the komi. make the games meaningful.
To adapt the komi values range to the ability of the current network, when the server assign a selfplay match to a client, it chooses a komi value randomly generated with distribution given by the sigmoid itself. Formally,
(4) 
where , is the initial empty board state, and are the computed values with current network and , thus giving to an approximate logistic distribution.
As the learning goes on, we expect to converge to the correct value of 9, and to increase, narrowing the range of generated komi values.
To deal with this problem we implemented the possibility for the server to assign selfplay games starting from any intermediate position.
After a standard game is finished, the server looks to each of the game’s positions and from each one may branch a new game (independently and with small probability). The branched game starts at that position with a komi value that is considered even by the network. Formally,
where is the branching position and is the value of at position , as computed by the current network, with the sign changed if the current player was white.
The branched game is then played until it finishes and then all its positions starting from are stored in the training data, with komi and the correct information on the winner of the branch.
This procedure should produce branches of positions with unbalanced situations and values for the komi that are natural to the situation but nevertheless range on a wide interval of values.
4.3 Sensible agent
When SAI plays, it can estimate the winning probability for all values of the komi with a single computation of the neural network. In fact, getting and it knows the sigmoid function that gives the probability of winning with different values of the komi for the current position.
We propose the generalization of the original agent of Leela Zero as introduced in Section 2.4. Here we give further details.
The agent behaviour is parametrized by a real number which will be usually chosen in . The expected behaviour is to simply maximize the probability of winning when (as Leela Zero does) and try to win but also increase the score when .
4.3.1 Formalization of the UC tree
To describe rigorously the agent, we need to introduce some more mathematical notation.
Games, moves, trees.
Let be the set of all legal game states, with denoting the empty board starting state.
For every , let the set of legal moves at state and for every , let denote the game state reached from by performing move . This clearly induces a directed graph structure on with no directed cycles (which are not legal because of superko rule) and with root . This graph can be uplifted to a rooted tree by taking multiple copies of the states which can be reached from the root by more than one path. From now on we will identify with this rooted tree and denote by the edge relation going away from the root.
For all let denote the unique state such that .
For all , let denote the set of states reachable from by a single move. We will identify with from now on.
For any subtree , let denote its size (number of nodes) and for all , let denote the subtree of rooted at .
Values, preferences and playouts.
Suppose that we are given a policy and two value functions , with the following properties:

the policy , defined on with values in and such that

the value , defined on with values in ;

the first play urgency , defined on with values in .
Then for any nonempty subtree and node not necessarily inside we can define the evaluation of over , as
It should be noted here that may well also depend on the subtree . In fact two proposed choices for are the following:
(AlphaGo Zero)  
(Leela Zero) 
We can then define the UC urgency of over , as
Finally, the playout over , starting from is defined as the unique path on the tree which starts from and at every node chooses the node that maximizes .
Definition of .
In the case of Leela Zero, the value function does not depend on and is simply the output of the value head of the neural network, passed through an hyperbolic tangent and rescaled in .
In the case of SAI, as explained in Section 2.4 we compute the average winrate at over a range of komi values that depends on .
Formally, for any state , let
Let be the real komi value of the game and let
Then the estimated winrate for the current player at is and it is natural to generalize this value to the family of averages
(5) 
Ideally the right value of should be not too far from , in particular if the estimated winning rate is not far from . The winning rate estimated at should not be too different from the one estimated at .
To this end, a natural choice is to define
(not too different from and on the same side of for ) and to let
(not too far from when the state is critical, but possibly very far from it when the winning rate is high or low).
The final definition is then
where the value is computed at state but the range of the average is decided at state .
Remark 4.
We bring to the attention of the reader that a simple rescaling shows that the quantity would be somewhat less useful, because it depends on and only through .
Remark 5.
Notice that the integral in equation (5) can be computed analitically and easily implemented in the software.
Tree construction and move choice.
Suppose we are at state and the agent has to choose a move in . This will be done by defining a suitable decision subtree of , rooted at , and then choosing the move randomly inside with probabilities proportional to
where is the Gibbs temperature which is defaulted to 1 for the first moves of selfplay games and to 0 (meaning that the move with highest is chosen) for other moves and for match games.
The decision tree
is defined by an iterative procedure. In fact we define a succession of trees and stop the procedure by letting for some (usually the number of visits or when the thinking time is up).The trees in the succession are all rooted at and satisfy for all , so each one adds just one node to the previous one:
The new node is defined as the first node outside reached by the playout over starting from .
4.4 Results
4.4.1 Obtaining a strong SAI
The first experimental run was done before the generation of branching games was implemented, and it showed that SAI could learn to play at the same level as Leela Zero, attaining a maximum evaluation of with many networks above 1, meaning that the winrate was correctly estimated. Nevertheless the two parameters and were not estimated very well and even the stronger networks failed to discover that the correct komi is 9. (See Figure 3
In the subsequent runs, plotted in Figure 4, we experimented with several parameters: the promotion protocol (AlphaGo Zero vs AlphaZero), the network type (V, Y or T), the value of for selfplay games (0 or 0.5) and the first play urgency definition (Leela Zero or AlphaGo Zero).
These experiments were done with 250 visits and Dirichelet’s parameter 0.02. The number of first move played more randomly was changed between 4 and 15.
Generation of branching games was implemented in all these runs, with probability of branching , where is the estimated probability of winning at state and the constant is set to . The rationale behind this choices was that we thought it to be more useful to branch in very unbalanced situations, and since the average length of a game is around 40 moves, this formula gives approximatively one branch per standard game.
All the matches with the evaluation panel were done with and Leela Zero’s definition of first play urgency, with the intent to attain the maximum strength.
These experiments showed that with branching generation it is indeed possible to train and correctly, and to learn very sharply that the correct komi is 9. (See Figure 5.) Nevertheless the maximum playing strength was somewhat lower than Leela Zero and the learning process have a tendency to get stuck at evaluations values in the range .
Analysis of the losing match games that affected panel evaluation mostly showed that SAI nets were unable to anticipate some tesuji moves of Leela Zero panel nets in complex situations. Further study of the nets evaluations and of the tree search development in those positions suggested that SAI nets had a poor accuracy in estimating in near even positions.
Considering this, we tried to simplify the formula for the branching probability, giving constant probability of branching for all states, thus giving higher chance of branching in balanced situation.
We applied this change at the end of the 9th run of SAI to see if it was able to produce immediate benefit, and indeed the improvement was almost immediate, steady and important, with many networks above 1 with a maximum of 1.11. See Figure 6.
4.4.2 Evaluation of positions by Leela Zero and by SAI
To illustrate the ability of SAI to understand the winning probability in a more complex fashion, we chose 3 meaningful positions which are shown in Figure 7.
For each position we plotted SAI’s sigmoid evaluations of the winrate (black curves) and Leela Zero’s point estimates at standard komi (blue dots). Every one of these plots shows a sample of 41 Leela Zero and 76 SAI nets from different runs, chosen among the strongest ones.
There appears to be much variability, showing that even strong nets do not have a clear understanding of single complicated positions, this being mitigated by MCTS when choosing the next move. It is important to observe that the distributions of the winrates seem to agree for the two groups at standard komi, indicating that SAI’s estimates have similar accuracy and precision as Leela Zero’s.
Remark 6.
It would be really important for the playing agent to be able to estimate this variability for each position, that is, have some knowledge of how much precise is its current estimate of the winrate. This is particularly relevant, because this variability is highly dependent on the position.
SAI networks have this possibility: these plots show that in the more complex first position, the value of is small (consistently across the nets), while in the latter two positions, is much larger.
Thus, the estimate of can be considered as a measure of the precision of both and the winrate estimate, hence allowing a single network to understand if the situation is more or less complicated and allowing for subtle tuning of future agents based on this kind of networks.
Finally, we emphasize that the SAI nets also provide an estimate of the difference of points between the players.
 Position 1.

Here it is black turn and black is ahead of 13 points on the board, thus, with komi 9.5, black margin is 3.5 points. However the position is difficult, because there is a seki (quite uncommon in our 77 games) which may be poorly interpreted as dead (black ahead by 49 points on the board) or as alive (black ahead by 5 points on the board). In agreement with this analysis, the sample of SAI nets gives a low and sharp estimate for with average
and a wild estimate for , with average and standard deviation, with values ranging from 5 to 42. The sample of Leela Zero nets gives winrate estimates which are almost uniformly distributed in
: many of these nets have a wrong understanding of the position and are not aware of this. SAI nets on the other hand are aware of the high level of uncertainty.  Position 2.

Here it is white turn and white is behind by 5 points on the board, thus, with komi 9.5, white is winning by 4.5 points. Following the policy, which recognizes a frequent shape here, many nets will consider cutting at F6 instead than E5, so losing two points. Accordingly, the estimate of ranges approximately from to with average and standard deviation . The sample of has average and standard deviation , thus showing that is to be considered precise to plus or minus one unit.
 Position 3.

Here the situation is very similar to the previous one: white is behind by 7 points on the board, thus, with komi 9.5, white is winning by 2.5 points. Following the policy, which recognizes a frequent shape here, many nets will consider cutting at B2 instead than C3, so losing two points. Accordingly, the estimate of ranges approximately from to with average and standard deviation . The sample of has average and standard deviation , thus showing that is to be considered precise to plus or minus one unit.
4.4.3 Experimenting different agents for SAI
Finally, we experimented on how the attitude parameter of SAI affects the playing strength and the style of the agent. This was done by computing panel evaluation of the same nets when playing matches with (normally panel matches for SAI nets are played with ).
We computed the panel evaluation at of all the nets of the 9th run of SAI. We expected to obtain a slightly lower strength than with , since the latter agent should in principle maximize the winning probability, while the former should try to balance winning probability and a higher score margin, which, against an almost perfect player, could result in running the risk of losing more often. Moreover, when in disadvantage, an agent with overestimates its winning probability, thus leading to a weaker game against an almost perfect player.
The results are illustrated in Figure 8. We found that the strength of most networks was not affected by setting . Only one group of weak consecutive networks showed a clear decrease in strength, while the other differences were compatible with pure statistical fluctuations.
This is probably due to the fact that margins of victory are almost nonexistent in the 77 setting, therefore the performance of a strong net may be independent on in this context. On larger boards, agents with a variable value of may be experimented to test their ability to target higher margins of victory and avoid suboptimal moves, while still maintaining a strong game.
5 Conclusions
We introduced SAI, a model incorporating a variable level of bonus points in the traditional neural network and Monte Carlo tree search strategy. We trained several nets on the simplified 77 goban, exploring several sets of parameters and settings, and we could obtain nets able to play at almost perfect level. We showed that the estimates of the winning probability of our nets at standard komi are compatible with those of Leela Zero, but we showed that SAI’s bonusdependent estimates provide a deeper understanding of the game situation. SAI also provides an estimate of the current point difference between players.
Due to the limitations of the 77 goban, it was not possible to assess whether our model allowed to target higher margins of victory, but the results were promising.
We posit that implementing SAI in a distributed effort could produce a software tool able to provide a deeper understanding of the potential of each position, to target high margins of victory and play with handicap in the 99 and full 1919 board, thus providing an opponent for human players which never plays suboptimal moves, and ultimately progressing towards the optimal game.
References
 [1] Rémi Coulom. Efficient selectivity and backup operators in MonteCarlo tree search. In International conference on computers and games, pages 72–83. Springer, 2006.

[2]
Sylvain Gelly and David Silver.
Combining online and offline knowledge in UCT.
In
Proceedings of the 24th international conference on Machine learning
, pages 273–280. ACM, 2007.  [3] Sylvain Gelly, Yizao Wang, Rémi Munos, and Olivier Teytaud. Modification of UCT with patterns in MonteCarlo go. Research Report RR6062, INRIA, Nov 2006.
 [4] GianCarlo Pascutto and contributors. Leela Zero, 2018. [Online; accessed 17August2018].
 [5] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
 [6] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by selfplay with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
 [7] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017.
 [8] David Silver, Richard Sutton, and Martin Müller. Reinforcement learning of local shape in the game of go. In IJCAI, volume 7, pages 1053–1058, 2007.
 [9] TiRong Wu, I Wu, GuanWun Chen, Tinghan Wei, TungYi Lai, HungChun Wu, LiCheng Lan, et al. MultiLabelled Value Networks for Computer Go. arXiv preprint arXiv:1705.10701, 2017.
Comments
There are no comments yet.