1 Introduction
Bayesian optimization was used as a routine service to adjust the hyperparameters of AlphaGo (Silver et al., 2016) during its design and development cycle, resulting in progressively stronger agents. In particular, Bayesian optimization was a significant factor in the strength of AlphaGo in the highly publicized match against Lee Sedol.
AlphaGo may be described in terms of two stages: Neural network training, and game playing with Monte Carlo tree search (MCTS). Each of these stages has many hyperparameters. We focused on tuning the hyperparameters associated with game playing. We did so because we had reasonably robust strategies for tuning the neural networks, but less human knowledge on how to tune AlphaGo during game playing.
We metaoptimized many components of AlphaGo. Notably, we tuned the MCTS hyperparameters, including the ones governing the UCT exploration formula, nodeexpansion thresholds, several hyperparameters associated with the distributed implementation of MCTS, and the hyperparameters of the formula for choosing between fast rollouts and value network evaluation per move. We also tuned the hyperparameters associated with the evaluation of the policy and value networks, including the softmax annealing temperatures. Finally, we metaoptimized a formula for deciding the search time per move during games. The number of hyperparameters to tune varied from 3 to 10 depending on a tuning task. The results section of this brief paper will expand on these tasks.
Bayesian optimization not only reduced the time and effort of manual tuning, but also improved the playing strength of AlphaGo by a significant margin. Moreover, it resulted in useful insights on the individual contribution of the various components of AlphaGo, for example shedding light on the value of fast Monte Carlo rollouts versus value network board evaluation.
There is no analytically tractable formula relating AlphaGo’s winrate and the value of its hyperparameters. However, we can easily estimate it via selfplay, that is by playing an AlphaGo version
against a baseline version for games and, subsequently, computing the average winrate:(1) 
By playing several games with different versions of AlphaGo, we can also adopt the BayesElo algorithm (Coulom, 2008) to estimate a scalar value indicating the strength of each AlphaGo agent.
Since each Go game has only two outcomes, win or lose, the average winrate is the sample average of a Bernoulli random variable with true winrate
. We can also compute confidence intervals easily for a sample of size
. The winrate , or simply , is a function of the hyperparameters of with fixed, but the analytical form of this function is unknown.Before applying Bayesian optimization, we attempted to tune the hyperparameters of AlphaGo oneatatime using grid search. Specifically, for every hyperparameter, we constructed a grid of valid values and ran selfplay games between the current version and a fixed baseline . For every value, we ran games. The games were played with a fixed 5second search time per move. It took approximately 20 minutes to play one game. By parallelizing the games with several workers, using 400 GPUs, it took approximately 6.7 hours to estimate the winrate for a single hyperparameter value. The optimization of 6 hyperparameters, each taking 5 possible values, would have required 8.3 days. This high cost motivated us to adopt Bayesian optimization.
2 Methods
Bayesian optimization is a sequential modelbased approach to optimize blackbox functions , . Its data efficiency makes it particularly suitable for expensive blackbox evaluations like hyperparameter evaluation. Bayesian optimization specifies a probabilistic prior model over the unknown function
and applies Bayesian inference to compute a posterior distribution over
given the previous observations . This posterior distribution is in turn used to construct an acquisition function to decide the next query point . The acquisition function tradesoff exploitation and exploration.One could use a wide variety of probabilistic models and acquisition functions, see for example the tutorial of Shahriari et al. (2016). In this work, we use Gaussian process (GP) priors over functions (Rasmussen and Williams, 2006) and the Expected Improvement (EI) acquisition function (Moćkus et al., 1978). Figure 1 (from Brochu et al. (2010)) illustrates Bayesian optimization with these choices in a 1D scenario.
The expected improvement at a given point is defined as
(2) 
where is a target value, usually the best past observation or best posterior mean at past query points. Hence, can be thought of as an aspiration value. EI attempts to do better than , but instead of greedily exploiting, it also uses the estimates of uncertainty derived by the probabilistic model over to explore the space .
The choice of Bayesian optimization to tune the hyperparameters of AlphaGo was motivated by several factors: (i) the winrate is not differentiable, (ii) large computational resources are required to evaluate the winrate at a single hyperparameter setting, and (iii) the number of hyperparameters is moderate, making it possible to find a good setting within a few hundred steps.
We use a modified version of Spearmint (Snoek et al., 2012) with input warping to conduct Bayesian optimization. The hyperparameter tuning procedure is summarized in Algorithm 1.
In blackbox optimization, most algorithms assume the function value is observed either exactly or noisily with an unknown noise magnitude. In contrast, in selfplay games, we can estimate the observation noise as the observed winrate is an average of Bernoulli variables. When the number of games is large enough, the observed value is normally distributed according to the central limit theorem, and the noise standard deviation can be estimated as
(3) 
We adopted a Gaussian process model with a nonstationary Gaussian observation noise model. For every parameter setting, we supplied the Gaussian process model with the observed winrate and the estimated standard deviation. The estimated was clipped from below by to avoid ignoring the noise when observing all win/lose games.
We can reduce the cost of a function evaluation by decreasing the number of selfplay games per hyperparameter candidate at the price of higher noise. (While this is reminiscent of the general idea of early stopping in Bayesian optimization, it is simpler in our domain.) Guided by this and the central limit theorem for Bernoulli random variables, we chose to evaluate with games.
Finally, we developed a visualization tool to understand the winrate sensitivity with respect to each individual hyperparameter. Specifically, we plot the winrate posterior mean and variance as a function of one or a pair of hyperparameters while holding the other hyperparameters fixed, as illustrated in Figure
2. We also estimated the contribution of each parameter to the difference in playing strength between two settings. We found it useful to understand the importance of each hyperparameter and the correlations among the hyperparameters. We used this information to select the most influential hyperparameters for optimization in subsequent AlphaGo versions.3 Tasks and Results
In the following subsections, we describe the various tasks where Bayesian optimization was applied, and found to yield fruitful results.
3.1 Task 1: Tuning MCTS hyperparameters
We optmized the MCTS hyperparameters governing the UCT exploration formula (Silver et al., 2018, Section Search), network output tempering, and the mixing ratio between the fast rollout value and value network output. The number of hyperparameters to tune varied from 3 to 10.
The development of AlphaGo involved many design iterations. After completing the development of an AlphaGo version, we refined it with Bayesian optimization and selfplay. At the start of each design iteration, the winrate was 50%. However, by tuning the MCTS hyperparameters this winrate increased to 63.2% and 64.4% (that is, 94 and 103 Elo gains) in two design iterations prior to the match with Lee Sedol. Importantly, every time we tuned a version, the gained knowledge, including hyperparameter values, was passed on to the team developing the next version of AlphaGo. That is, the improvements from all tuning tasks were compounded. After the match with Lee Sedol, we continued optimizing the MCTS hyperparameters, resulting in progressively stronger AlphaGo agents. Figure 3 shows a typical curve of the winrate against an opponent of the same version with fixed hyperparameters.
Interestingly, the automatically found hyperparameter values were very different from the default values found by previous hand tuning efforts. Moreover, the hyperparameters were often correlated, and hence the values found by Bayesian optimization were not reachable with elementwise handtuning, or even by tuning pairs of parameters in some cases.
By tuning the mixing ratio between rollout estimates and value network estimates, we found out that Bayesian optimization gave increased preference to value network estimates as the design cycle progressed. This eventual led the team to abandon rollout estimates in future versions of AlphaGo and AlphaGo Zero (Silver et al., 2017).
3.2 Task 2: Tuning fast AlphaGo players for data generation
We generated training datasets for the policy and value networks by running selfplay games with a very short search time, e.g., seconds in contrast to the regular search time. The improvement of AlphaGo over various versions depended on the quality of these datasets. Therefore, it was crucial for the fast players for data generation to be as strong as possible. Under this special time setting, the optimal hyperparameters values were very different, making manual tuning prohibitive without proper prior knowledge. Tuning the different versions of the fast players resulted in Elo gains of 300, 285, 145, and 129 for four key versions of these players.
3.3 Task 3: Tuning on TPUs
Tensor Processing Units (TPUs) provided faster network evaluation than GPUs. After migrating to the new hardware, AlphaGo’s performance was boosted by a large margin. This however changed the optimal value of existing hyperparameters and new hyperparameters also arose in the distributed TPU implementation. Bayesian optimization yielded further large Elo improvements in the early TPU implementations.
3.4 Task 4: Developing and tuning a dynamic mixing ratio formula
Early versions of AlphaGo used a constant mixing ratio between the fast rollout and value network board evaluation, regardless of the stage of a game and the search time. This was clearly a suboptimal choice, but we lacked proper technique to search for the optimal mixing function. With the introduction of Bayesian optimization, we could define a more flexible formula and search for the best formula parameters. In particular, we defined the new dynamic mixing ratio at a tree node as a function of the move number, , and number of node visits, , during tree search in the following form:
(4) 
where is the logistic function to restrict , is the parameter corresponding to the mixing ratio at move 0 without search, and and are linear coefficients.
During optimization, we observed that the best value of was highly correlated with as illustrated in Figure 3(a). This caused difficulties in optimizing the parameters jointly and resulted in failed handtuning attempts. By inspecting the ridge of high winrate in the figure, we noticed that when , all the points along the ridge corresponded to a mixing ratio formula that crossed around the point . Figure 3(b) shows four mixing ratio vs. move number curves corresponding to the four points in plot fig:dynamic_mixing_ratio_curves. This suggested that it was important to find a good value for the mixing ratio at around move 150. This finding was consistent with the observation that the gamedeterminingpoints in AlphaGo’s selfplay games usually happened between moves 150 and 200.
With this observation, we reparameterized the mixing formula as
(5) 
with indicating the mixing ratio at move 150, and obtained an uncorrelated function as shown in Figure 3(c). The optimal dynamic mixing ratio formula is the green curve in Figure 3(b). With this formula, AlphaGo placed more weight on value networks at the beginning and less towards the end of a game. This was consistent with the facts that the value network was better at global opening judgment than rollouts, and that rollouts became accurate in end games when search depth reduced.
The parameter was also highly correlated with and we did find a reparameterization form to decorrelate them, but the optimal value for turned out to be 0 after tuning, suggesting that we do not need to include the visit number in the mixing ratio formula.
3.5 Task 5: Tuning a time control formula
MCTS is an anytime algorithm, whose tree search can be interrupted at any point, returning to the current best choice. To prepare for the formal match with Lee Sedol, which had a main time of 2 hours and 3 60second byoyomi per player, we wanted to optimize the search time allocation across all moves. We considered time allocation as an optimization problem, so as to maximize the winrate of a player subject to the time restriction against another one with a fixed time schedule. Huang et al. (2010) proposed a parameterized time control formula to allocate search time for a fixed main time without byoyomi. We adopted a more flexible formula including the byoyomi time, as follows
(6)  
(7) 
where is the 1based move number, is the move with the peak search time, is the remaining main time at move , is the byoyomi time, and if is true, otherwise 0. , , are hyperparameters to tune together with . The formula reduces to that in Huang et al. (2010) when and .
The optimal formula after tuning all the hyperparameters is shown in Figure 5. AlphaGo obtains an improvement of a 66.5% winrate against the default time setting with a fixed 30second search time per move. Interestingly, the move with the peak search time under the optimal time control formula is also around move 150.
4 Conclusion
Bayesian optimization provided an automatic solution to tune the gameplaying hyperparameters of AlphaGo. This would have been impossible with traditional handtuning. Bayesian optimization contributed significantly to the winrate of AlphaGo, and helped us gain important insights which continue to be instrumental in the development of new versions of selfplay agents with MCTS.
5 Acknowledgments
We are very thankful to Misha Denil and Alexander Novikov for providing us with valuable feedback in the preparation of this document.
References
 Brochu et al. [2010] Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Report arXiv:1012.2599, 2010.
 Coulom [2008] Rémi Coulom. Wholehistory rating: A bayesian rating system for players of timevarying strength. In International Conference on Computers and Games, pages 113–124. Springer, 2008.

Huang et al. [2010]
ShihChieh Huang, Remi Coulom, and ShunShii Lin.
Time management for MonteCarlo tree search applied to the game of
Go.
In
Technologies and Applications of Artificial Intelligence
, pages 462–466, 2010.  Moćkus et al. [1978] J Moćkus, V Tiesis, and A Źilinskas. The application of Bayesian methods for seeking the extremum. Toward Global Optimization, 2, 1978.

Rasmussen and Williams [2006]
Carl Edward Rasmussen and Chris Williams.
Gaussian Processes for Machine Learning
. MIT Press, 2006.  Shahriari et al. [2016] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
 Silver et al. [2016] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550:354, 10 2017.

Silver et al. [2018]
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew
Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore
Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis.
A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay.
Science, 362(6419):1140–1144, 2018.  Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pages 2951–2959, 2012.