Weighting NTBEA for Game AI Optimisation

03/23/2020 ∙ by James Goodman, et al. ∙ Queen Mary University of London 6

The N-Tuple Bandit Evolutionary Algorithm (NTBEA) has proven very effective in optimising algorithm parameters in Game AI. A potential weakness is the use of a simple average of all component Tuples in the model. This study investigates a refinement to the N-Tuple model used in NTBEA by weighting these component Tuples by their level of information and specificity of match. We introduce weighting functions to the model to obtain Weighted- NTBEA and test this on four benchmark functions and two game environments. These tests show that vanilla NTBEA is the most reliable and performant of the algorithms tested. Furthermore we show that given an iteration budget it is better to execute several independent NTBEA runs, and use part of the budget to find the best recommendation from these runs.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In Game AI, as in many other fields, algorithms usually have several parameters that need to be specified. For any given problem some parameter settings may give good results, while other settings give very poor results. For any new problem (a new game for example) we need to decide on which parameter values to use. In many cases a set of ‘standard’ parameter settings are available based on previous work, but these may not be ideal for the new domain. An exhaustive search of all possible parameter settings is usually unfeasible - it may take days of processing time on a large parallel cluster to train a complex neural network using Reinforcement Learning (RL). If the RL algorithm has four parameters, each of which can have five values, then training a policy under each possible setting will take

= 625 cluster-days, or about 2 cluster-years to evaluate each. The problem is considerably worse if the outcome of any one evaluation (or experiment) is stochastic, so that a good estimate of the value of a given parameter setting requires many independent evaluations.

The field of parameter (and hyper-parameter) optimisation seeks fast methods for deciding on parameter settings in a new domain with an available computational budget. This generally involves constructing a predictive model for the result of a future untried evaluation. After each time-consuming real-world evaluation has been run, the computationally cheap predictive model is updated with the result and interrogated to suggest the next set of parameter values to try. By reducing the number of expensive full evaluations to find a good (if not necessarily optimal) set of parameters, we save significant time and money.

The N-Tuple Bandit Evolutionary Algorithm (NTBEA) was introduced in [Kunanusont_Gaina_Liu_Perez-Liebana_Lucas_2017, Lucas_Liu_Perez-Liebana_2018]. It has been benchmarked against several other optimisation algorithms in stochastic game environments and proven to be more effective at finding a good set of parameter settings than other algorithms within a fixed computational budget [Lucas_Liu_Bravi_Gaina_Woodward_Volz_Perez-Liebana_2019]. Similarly [Sironi_Winands_2019] find NTBEA is the best optimiser of a number tried modify MCTS parameters during algorithm execution for a number of games.

NTBEA in [Lucas_Liu_Perez-Liebana_2018, Lucas_Liu_Bravi_Gaina_Woodward_Volz_Perez-Liebana_2019] estimates the value of a set of parameter values using the simple average of all matching Tuples in the model (see Background for a detailed explanation). The current work extends this to weight the matching Tuples using the amount of data (i.e number of real-world experiments) that inform a given Tuple, and the degree of specificity of the Tuple match. We hypothesise that this approach will allow us to converge to a good parameter setting faster and more robustly than vanilla NTBEA.

In addition to introducing Weighted-NTBEA in this work, we also modify four benchmark tests from the function optimisation literature to incorporate noise. These enable optimisation algorithms to be compared cheaply (in terms of computational budget) and also provide greater confidence in conclusions because the true underlying value is known exactly, and is not an estimate over multiple expensive evaluations.

2 Background

2.1 Black-box optimisation

Black-box function optimisation addresses the problem of finding the optimal value of some


where can be evaluated at any , but not differentiated. When is expensive to evaluate we wish to minimise the number of evaluations we make and can use the real evaluations made so far to model the result of (the ‘response surface’) to decide what value of should be evaluated next. A common approach is to use Bayesian optimisation techniques with a prior over the response surface, and update a posterior model after each evaluation. To pick the next point a trade-off is made between exploitation and exploration; for example the point with the largest expected improvement (EI), or the highest 95% confidence bound (UCB) [Shahriari_Swersky_Wang_Adams_deFreitas_2016, Brochu_Cora_deFreitas_2010, Jones_Schonlau_Welch_1998]. Bayesian methods require either a model to be specified, or a decision on the kernel functions to use in a (non-parametric) Gaussian Process. They are sensitive to stochastic noise, especially noise that is highly non-Gaussian [Brochu_Cora_deFreitas_2010]. Approaches exist to integrate different types of noise into the model, but these add complexity to the model [Shahriari_Swersky_Wang_Adams_deFreitas_2016].

Most Bayesian methods and libraries assume that is continuous in all dimensions , and do not work in discrete spaces. This is not true for all, for example BOCS [Baptista_Poloczek_2018]

uses Bayesian Linear Regression with semi-definite programming to optimise a discrete combinatorial problem. However, BOCS does assume uniform Gaussian noise. Other approaches have been used to model the response surface in black-box optimisation: Random Forests are used in the SMAC algorithm


In a bandit-based approach, each setting of the parameters is one ‘arm’ of the bandit, and we seek to find out which ‘arm’ gives us the highest reward in a limited number of pulls. This is a natural fit if each can take a small number of discrete values, but it cannot cope easily with continuous dimensions.

NTBEA combines a bandit-based approach with an N-Tuple model [Kunanusont_Gaina_Liu_Perez-Liebana_Lucas_2017] and an evolutionary algorithm to select the next point to be evaluated. The UCB1 (Upper Confidence Bound) algorithm is used to balance exploration and exploitation [Auer_Cesa-Bianchi_Fischer_2002]. NTBEA is described in detail in the next section.

2.2 Ntbea

This explanation of NTBEA closely follows [Lucas_Liu_Perez-Liebana_2018]. During each iteration of NTBEA we:

  1. Run a full game (or experiment, or other expensive function evaluation) using the current test setting . For the first iteration is selected at random.

  2. Update the N-Tuple Model with the evaluation result.

  3. Generate a neighbourhood of points by applying a mutation operator to (repeat times to get a neighbourhood of size ).

  4. Evaluate the Upper Confidence Bound (UCB) for each of the N points using the N-Tuple Model. Select the one with the highest UCB as the new , and repeat from 1.

In this study, as in [Lucas_Liu_Bravi_Gaina_Woodward_Volz_Perez-Liebana_2019, Lucas_Liu_Perez-Liebana_2018, Kunanusont_Gaina_Liu_Perez-Liebana_Lucas_2017] we set =50, and the mutation operator used is to randomly mutate each

to a random setting with probability

, always mutating at least one .

2.2.1 N-Tuple Model

An 1-Tuple model breaks down the modelled into components, where using Equation(2). Each component is the expected value of assuming that only affects the value. If , this is the mean of all evaluation results so far where . In (2), is a delta-function that is 1 when a previously evaluated matches with the current setting in the th dimension, is the total number of such previous evaluations, and is the th of these.


In other words, our prediction is the average of all the

matching 1-Tuple predictions based on past observations. There are no interactions between different parameters, and there are no assumptions about relationships between different values of a given parameter. For example, if one parameter has discrete values 1, 2 or 3 then the result of evaluations where this was 1 or 3 will have no impact at all on predictions for the intermediate 2. This is a very conservative non-parametric model. In the case of 5 dimensions with 10 possible values for each, we need to maintain just 50 sets of statistics for a 1-Tuple model (

, the number of times this tuple-setting has been tried, and , the mean of these evaluations). Any will match with exactly five of these, and is the mean of these five.

A 2-Tuple model extends this to consider interactions between two parameter settings. We replace in (2) with , and now consider all evaluations that were a match on two different parameters. In the case of 5 dimensions with 10 possible values for each this gives a total of distinct 2-Tuples for which and are maintained. Any will match with exactly .

In all the experiments in this study, as in [Lucas_Liu_Bravi_Gaina_Woodward_Volz_Perez-Liebana_2019, Lucas_Liu_Perez-Liebana_2018, Kunanusont_Gaina_Liu_Perez-Liebana_Lucas_2017] we use 1-Tuples, 2-Tuples and -Tuples in the model. A -Tuple matches on all parameters, so is unique for each . The predicted value of the model for any new is the arithmetic mean across all matching tuples.

2.2.2 Ucb

The UCB1 algorithm [Auer_Cesa-Bianchi_Fischer_2002] calculates a probable upper bound on the true value of the ‘arm’ of a bandit , given the data observed so far using (3). is the total number of trials of the bandit, and is the number of times this ‘arm’ has been pulled (i.e. the number of times that has been evaluated).


The N-Tuple model uses equation (2) to calculate , but we still have the second term of equation (3) that controls exploration. We can calculate this for each individual tuple, with equal to the total number of NTBEA iterations, and equal to the number of these for which the tuple matches ; i.e. in (2). NTBEA calculates the second term for each matching tuple, and then takes the arithmetic average. There is one additional nuance in that some tuples will never have been evaluated, and formally (3) will return in this case. To avoid this an additional hyper-parameter is added, so that


In this study, as in [Lucas_Liu_Bravi_Gaina_Woodward_Volz_Perez-Liebana_2019, Lucas_Liu_Perez-Liebana_2018, Kunanusont_Gaina_Liu_Perez-Liebana_Lucas_2017] we set . The value of needs to be scaled to the range of , and is set for each domain (see Method section).

3 Hypothesis

Vanilla NTBEA estimates the value of a parameter setting as the simple arithmetic mean of all the matching Tuples in the model that match. For example if we have five parameters and are using 1-, 2- and N-Tuples then any will have one matching -Tuple (where ), five matching 1-Tuples and matching 2-Tuples. The statistics gathered for each of these 16 Tuples is then averaged. The same approach applies to calculating the exploration estimate using (3). Even if we have evaluated a specific multiple times, the results from those evaluations still only comprise of the NTBEA estimate; always comes from the matching 1-Tuples. Our hypothesis is that NTBEA will better estimate the value of a parameter setting if it applies greater weight to the more specific tuples as the number of evaluations increases. In the limit of a large number of evaluations of a specific , then only the statistics from the fully-matching -Tuple should be relevant.

We propose four distinct weighting schemes, which vary in the rate of decay in the influence of less-specific tuples. In all cases the value of a parameter setting , with different parameters is


where is the average value from the N-Tuple statistics of and is the weight used for the N-Tuple statistics. The remaining 1- weight is applied to the average of all (N-1)-Tuples, i.e. all Tuples on the next level down. In (5), is a slight abuse of notation and refers to the number of such Tuples. In the case that no Tuples are held at the N-1 level, then this descends to the next level for which we do have Tuples in the NTBEA model. Note that (5) is recursive, and each of the terms is calculated from weighting its Tuple statistics, , with a sum over at the next level down.

In our example vanilla NTBEA always weights the 5-Tuple, the ten 2-Tuples and the five 1-Tuples at each. Using (5) this weighting will change as we gain more information. With no evaluations, the for any Tuple will be , and as the number of evaluations for a Tuple increases we want to increase towards a maximum of so that asymptotically we ignore information from lower-level Tuples.

The four weighting schemes use linear, inverse, inverse square-root and exponential decay functions.

Figure 1: The four weighting functions used in Weighted-NTBEA. The x-axis is the number of evaluations that match the tuple, and the y-axis is the weighting applied to the tuple between 0 and 1. The remainder of the value is calculated from the average of the next level of tuples. in all cases.
  1. Linear

  2. Inverse-root

  3. Inverse

  4. Exponential


These functions are sketched in Figure 1. They have the desired properties that when (when no evaluations have been conducted that match the tuple), and as . They differ in the rate at which this decay happens, which in all cases must be parameterised by some . The Linear decay is most draconian, and will ignore any information from lower level tuples once , while under Inverse-root decay lower-level tuples have a residual weight of 0.71 after evaluations. For all experiments in this study we set . This is somewhat arbitrary, but scaled to be about 5% of the total iterations in the smallest experiments with a budget of about 300 NTBEA iterations.

Parameter Planet Wars I Asteroids I Planet Wars II Asteroids II
Sequence Length 5, 10, 15, 20, 25, 30 5, 10, 15, 20, 50, 100, 150 7, 10, 13, 16, 20, 25, 30 50, 75, 100, 125, 150, 200
Mutated Points 0, 1, 2, 3 0, 1, 2, 3 1, 2, 3, 5, 10, 15, 20 1, 2, 3, 5, 10, 20, 30, 50
Resample 1, 2, 3 1, 2, 3 1, 2, 3 1, 2, 3
Flip One Value false, true false, true false, true false, true
Use Shift Buffer false, true false, true false, true false, true
Mutation Transducer false false false, true false, true
Repeat Prob. - - 0.2, 0.4, 0.6, 0.8 0.2, 0.4, 0.6, 0.8
Discount Factor 1.0 1.0 1.0, 0.999, 0.99, 0.95, 0.9 1.0, 0.999, 0.99, 0.95, 0.9
Parameter Space size 228 336 23,520 23,040
Table 1: Parameter space for RHEA in Planet Wars and Asteroids game experiments. The first two columns for the I experiments are as in [Lucas_Liu_Bravi_Gaina_Woodward_Volz_Perez-Liebana_2019]. The optimal values found for the games in that paper and in [Lucas_Liu_Perez-Liebana_2018] are in bold.

4 Method

We apply each of the decay functions (6), (8), (7), 9) to a number of different optimisation problems to determine whether our hypothesis holds and the modified model does converge faster and more robustly than vanilla NTBEA. By using a number of different problems we seek to test that any improvement generalises, and is not specific to one domain. A secondary goal is exploratory, to see if the four different weighting functions have varying patterns of performance.

4.1 Benchmark functions

We test on four benchmark functions from the global optimisation literature [dixon1978global, Jones_Schonlau_Welch_1998]. These are interesting non-convex functions for which we can calculate the true value, and hence judge the performance for the NTBEA variants. Some amendments are needed to the original functions:

  1. These are all deterministic functions with no noise. To convert them to a stochastic win/lose setting appropriate for a game benchmark we convert the function value to a probability of a +1 score (a ‘win’), and a 1- probability of a -1 score (a ‘loss’).

  2. They are continuous functions in all dimensions. We discretise by taking values at equally spaced intervals for each dimension.

  3. Global optimisation seeks to minimise a function. To maximise we multiply by -1.

We outline the four functions below. A complete description is in [dixon1978global].

  • Hartmann-3. A three-dimensional function with four local optima. Two of these optima are close in value, with one slightly higher. In the original problem the output range is [0.0, 3.59], so we divide by 4.0 to get a value between 0 and 1. We split all three dimensions into ten equally spaced discrete values, for a total parameter-space size of 1000 with a true value .

  • Hartmann-6. A six-dimensional function with a similar four optima to Hartmann-3. We apply the same modifications as with Hartmann-3, and discretize each dimension into five equally spaced values, for a parameter-space of size 15,625 with .

  • Branin. A two-dimensional function with three global maxima at 0.4. We split each dimension into 20 equally spaced intervals to get a parameter-space of 400. We add 10 to the result, divide by 12 with a floor at 0 to get to a valid range for . In this case only 14.8% of the 400 points are non-zero.

  • Goldstein-Price. A two-dimensional function with one global maximum, and several local ones. We split each dimension into 20 equally spaced intervals to get a parameter-space of 400. We add 400 to the result, divide by 500 with a floor at 0 to get to a valid range for . 13.3% of the 400 points are non-zero.

In all cases we try each weighting function, plus vanilla NTBEA on each benchmark function with 300, 1000 and 3000 iterations. For each setting we run NTBEA 1000 times, and record the estimated value (by NTBEA) of the finally selected and the actual value. In NTBEA we use for the exploration constant in (4).

4.2 Game Parameters

Lucas et al. 2019 [Lucas_Liu_Bravi_Gaina_Woodward_Volz_Perez-Liebana_2019] compare NTBEA against several other popular optimisation algorithms in two games; Planet Wars and Asteroids. They optimise a Rolling Horizon Evolutionary Algorithm (RHEA) to find the best setting to win the 2-player Planet Wars (+1 for a win, and -1 for a loss), and also to obtain the highest score in 2000 game-ticks in the 1-player Asteroids. For comparable results we use exactly the same games and settings. In Planet Wars we use for the exploration constant in (4), and for Asteroids.

In Planet Wars each player has a number of planets which generate ships at a constant rate. Players send ships from a planet to invade another, and to win the game they must conquer all planets. In Asteroids the player controls a ship which can rotate and shoot to destroy surrounding asteroids. Points are gained for shooting asteroids, and if one collides with the player then a life is lost; after three lost lives the game ends. The details of the gameplay are not central to this study, and more details can be found in [Lucas_Liu_Perez-Liebana_2018, Lucas_Liu_Bravi_Gaina_Woodward_Volz_Perez-Liebana_2019].

RHEA is optimised over five parameters in [Lucas_Liu_Bravi_Gaina_Woodward_Volz_Perez-Liebana_2019], which are listed in Table 1. Each optimisation algorithm was permitted 288 evaluations in Planet Wars, and 336 in Asteroids. This allowed Grid Search to run one game for each parameter setting. We repeat these experiments up to 100 times for each game and each weighting function. We record the parameter setting that is chosen each time. To get a good estimate of the actual value of the 288 and 336 possible settings it is feasible to run 1000 games for each setting of Planet Wars and 500 for Asteroids, although this takes 6 days to run for Asteroids, illustrating the value of a rapid optimiser.

These small parameter spaces of 228 and 336 have the advantage of permitting a good estimate of the ‘best’ setting to be found by brute force computation, but they are not representative of larger spaces in real problems. For example when optimising RHEA for a Game of Life variant [lucas2019local] use NTBEA with 100 evaluations in a space of size 28,800. As a final experimental set we add further parameters to RHEA (discount factor, mutation transducer and repeat probability) from [lucas2019local], and extend the other parameters to give a larger overall space as detailed in Table 1 in the ‘II’ columns. These extensions were fixed after seeing the results of the first set of experiments (the ‘I’ columns) to focus on areas with higher performance. For Planet Wars we increased the concentration of Sequence Length options around the optimal 10-15 range, and in Asteroids we did the same around the optimal 100 value. We also increased the upper range of Mutated Points significantly, especially for Asteroids where the optimal value of 3 was the highest possible.

For these larger parameter spaces we used a budget of about 20,000 total iterations to try different overall approaches:

  • 10 runs of 2,000 iterations each

  • 3 runs of 7,000 iterations each

  • 2 runs of 10,000 iterations each

  • 1 run of 20,000 iterations

Given the size of the parameters spaces it was not feasible to estimate an accurate value for all parameter settings. Instead we do this (by running 1000 or 500 games for Planet Wars and Asteroids respectively) for just the settings suggested by any of these runs. The purpose of these experiments is to understand how best to spend an available budget of iterations. Should we use them in a single NTBEA run, or spread them out and then pick the best of the suggestions. This is motivated by an observation from Deep Reinforcement Learning research, in which the random seed can have a major effect on the outcome of the algorithm, and results are often reported using ‘best of N’ runs [Henderson_Islam_Bachman_Pineau_Precup_Meger_2018].

5 Results

5.1 Benchmark functions

Table S1

in the Supplementary Material tabulates the numeric means and confidence intervals for the NTBEA experiments on the four benchmark functions with added noise. Figure 

2 displays boxplots of the true value of the NTBEA recommended parameters for each benchmark function and weighting function (1000 NTBEA runs for each, at 300, 1000 and 300 iterations).

  • Hartmann-3. The appears to be the easiest of the four functions for NTBEA to optimise, with 300 iterations getting a mean value of 0.862 of a maximum of 0.897 for both Vanilla NTBEA (STD), and the Linear and Inverse-root weighting functions. With 3000 iterations all of the variants obtain a mean score of between 0.88 and 0.89; in all cases 25% to 35% of all runs recommend one of the three top parameter settings with actual values between 0.895 and 0.897

  • Hartmann-6. This is harder to optimise with a clear progression as iterations increase from 300 to 3000. Vanilla NTBEA is a clear winner at only 300 iterations, and the Inverse-root and Inverse weighting functions are joint top with the Vanilla version at 3000 iterations (in a parameter space of size 15,625). The Linear weighting function does very poorly in comparison.

  • Branin. As with Hartmann-6, Vanilla NTBEA is a clear winner at 300 iterations, and is joint top with the Inverse-root and Inverse weighting functions at 3000 iterations. The parameter space is only 400.

  • Goldstein-Price. The same pattern is repeated here. Vanilla NTBEA is best for a small number of iterations, and all except the Linear weighting function are equally good with 3000 iterations to explore a parameter space of size 400.

The key finding is that here vanilla NTBEA (‘STD’ in Figure 2) is always the best or joint best for any combination of benchmark function and number of iterations, and is particularly effective for smaller numbers of iterations.

Figure 2: Boxplots for the true Score of settings recommended by NTBEA after 300, 1000 and 3000 iterations in each of the four benchmark functions.

5.2 Games

NTBEA Runs Iterations Game Mean S Dev 95% Interval Delta 95% Interval Top6
STD 100 288 Planet Wars 0.655 0.079 0.640 0.671 -0.185 -0.203 -0.167 60%
LIN 100 288 Planet Wars 0.615 0.111 0.593 0.636 0.187 0.164 0.211 44%
INV 100 288 Planet Wars 0.630 0.110 0.610 0.656 0.086 0.061 0.107 53%
SQRT 100 288 Planet Wars 0.633 0.097 0.616 0.653 -0.017 -0.036 0.001 51%
EXP 100 288 Planet Wars 0.643 0.091 0.625 0.663 0.130 0.107 0.151 58%
STD 62 336 Asteroids 9596 67 9580 9613 -709 -741 -680 94%
LIN 66 336 Asteroids 9577 87 9556 9598 129 104 155 88%
INV 68 336 Asteroids 9584 77 9567 9604 -27 -52 -3 87%
SQRT 69 336 Asteroids 9563 118 9536 9591 -248 -274 -222 81%
EXP 67 336 Asteroids 9570 82 9552 9590 104 80 125 87%
Table 2: Results for Weighted NTBEA variants with Planet Wars and Asteroids. Mean is the estimated value of the final recommended parameter setting from 1000 offline games, with a 95% confidence interval. Delta is the average difference to the NTBEA-estimated value of this point in the N-Tuple model, with a 95% confidence interval. All confidence intervals are calculated with a basic bootstrap. Bold entries indicate the best performing variants (within confidence bounds) for each game. LIN is the Linear weighting function; INV is Inverse, SQRT is the Inverse-root and EXP is the Exponential. STD is vanilla NTBEA. Top6 is the percentage runs that recommended one of the Top 6 parameter settings as estimated from the 1000 games run for each.

Table 2 shows the results for Planet Wars I and Asteroids I experiments, with 228 and 336 NTBEA iterations on similarly sized parameter spaces. Figures 3 and 4

have box plots for the data. These are averaged over 100 runs for each setting for Planet Wars, and between 62 and 69 runs for Asteroids (the number that completed in an 84 hour window). For Planet Wars vanilla NTBEA gives both the best and most reliable (i.e. lowest standard deviation) results. The Exponential decay variant is the only one to have a performance within the 95% confidence interval of vanilla NTBEA. The single highest parameter setting gives a score of 0.732, with 6 of the 288 settings having a score of 0.65 or higher averaged over 1000 games. Since we have run 1000 games for each of the 288 settings and

then picked the highest result, the 0.732 will be an over-estimate. Apart from the Linear weighted variant, all algorithms pick one of the top 6 settings between 50% and 60% of the time.

For Asteroids the results are quite similar. Vanilla NTBEA gives the best result with the smallest standard deviation. One of the variants is within the 95% confidence interval, but in this case it is the Inverse weighting function. In both games is is clear, as in the Benchmark Function results, that vanilla NTBEA gives the best recommended parameter setting despite giving a very poor estimate of the absolute value that the recommendation will provide when used.

The 95% confidence intervals in Table 2 are calculated on the basis that the estimated values of each parameter setting are exact. This was true for the benchmark functions in Table S1, but is not true here due to noise in these estimates from averaging across 1000 or 500 independent games. We do not have an estimate of this additional uncertainty.

Encouragingly, we obtain exactly the same the optimal parameter settings for both games as those found in the original work (highlighted in Table 1) [Lucas_Liu_Bravi_Gaina_Woodward_Volz_Perez-Liebana_2019, Lucas_Liu_Perez-Liebana_2018]. However, we get rather higher values for these in game play. For Planet Wars the original work finds that 288 iterations of NTBEA achieves a score of , while we obtain . In Asteroids the relevant values are , against our . The reason for this discrepancy is not clear, but we do not believe it affects the key conclusions of this study.

Figure 3: Boxplots for the estimated true Score of settings recommended by NTBEA after 288 iterations in Planet Wars (top), and the Delta of the NTBEA predicted value to this (bottom). The red horizontal line marks a Delta of 0.0, indicating perfect prediction by NTBEA.
Figure 4: Boxplots settings recommended by NTBEA after 336 iterations in Asteroids. Key as in Figure 3.
Game NTBEA Iterations Runs Best score Mean SD 95% Bounds
Planet Wars STD 1000 20 0.772 0.707 0.045 0.688 0.727
Planet Wars LIN 1000 20 0.752 0.679 0.067 0.652 0.711
Planet Wars INV 1000 20 0.788 0.694 0.070 0.665 0.728
Planet Wars SQRT 1000 20 0.762 0.712 0.035 0.697 0.728
Planet Wars EXP 1000 20 0.774 0.681 0.061 0.656 0.708
Planet Wars STD 3000 7 0.762 0.709
Planet Wars LIN 3000 7 0.762 0.718
Planet Wars INV 3000 7 0.762 0.708
Planet Wars SQRT 3000 7 0.760 0.714
Planet Wars EXP 3000 7 0.774 0.735
Planet Wars STD 10000 2 0.756 0.717
Planet Wars LIN 10000 2 0.748 0.736
Planet Wars INV 10000 2 0.756 0.747
Planet Wars SQRT 10000 2 0.756 0.740
Planet Wars EXP 10000 2 0.770 0.748
Planet Wars STD 20000 1 0.708
Planet Wars LIN 20000 1 0.640
Planet Wars INV 20000 1 0.674
Planet Wars SQRT 20000 1 0.732
Planet Wars EXP 20000 1 0.632
Asteroids STD 1000 20 9815 9701 63 9675 9728
Asteroids LIN 1000 20 9776 9655 89 9617 9694
Asteroids INV 1000 20 9803 9706 70 9690 9722
Asteroids SQRT 1000 20 9811 9702 68 9676 9736
Asteroids EXP 1000 20 9819 9620 125 9569 9673
Asteroids STD 3000 7 9804 9707
Asteroids LIN 3000 7 9804 9764
Asteroids INV 3000 7 9835 9778
Asteroids SQRT 3000 7 9817 9736
Asteroids EXP 3000 7 9818 9758
Asteroids STD 10000 2 9705 9705
Asteroids LIN 10000 2 9709 9612
Asteroids INV 10000 2 9804 9801
Asteroids SQRT 10000 2 9817 9814
Asteroids EXP 10000 2 9779 9762
Asteroids STD 20000 1 9735
Asteroids LIN 20000 1 9783
Asteroids INV 20000 1 9783
Asteroids SQRT 20000 1 9815
Asteroids EXP 20000 1 9815
Table 3: Results for Weighted NTBEA variants with Planet Wars and Asteroids over larger parameter spaces. Mean is the estimated value of the final recommended parameter setting from 1000/500 offline games for Planet Wars/Asteroids, with 95% confidence intervals calculated with a basic bootstrap. Bold entries indicate the best performing variants (within confidence bounds) for each game. Best score is the best individual result for any of the runs for that line.

Table 3 shows the results from the Planet Wars II and Asteroids II experiments with larger, more realistic parameter spaces to explore. There were 142 unique parameter settings recommended by the 150 NTBEA runs for the Planet Wars II experiments and an estimated value for each of these was calculated from averaging 1000 runs of the game. The best estimated scores of the recommended parameter settings have increased to 0.77 compared to the best possible score of 0.73 for Planet Wars I, so the additional parameters enable RHEA to better play the game if we can efficiently explore the space.

For Planet Wars vanilla NTBEA gives the best mean result at 1k iterations, and does not give significantly different results at more iterations (within 95% error bounds). The same caveat applies to these error bounds as in Table 2 as they do not include the additional uncertainty from the average over 1000 runs used to estimate the value of the final parameter settings.

The Inverse-root weighting functions matches vanilla NTBEA at 1k, and at 3k all variants at least match vanilla performance, with the Exponential weighting being the best. These results make clear that there is a high level of uncertainty in any individual NTBEA run. The best of the 20 vanilla runs at 1k gives a parameter setting that scores 0.772 over 1000 games, and the worst scores a mere 0.616. This remains true at 10k and 20k iterations, with three of the 20k runs recommending parameters that score less than 0.7.

Even with a large number of iterations any single NTBEA run may give a relatively poor result. Given a fixed budget of games to optimise a parameter Table 3 suggests that it is not a good idea to put the whole budget into a single NTBEA run. Far better to execute several NTBEA runs with a small number of iterations, and then use the remaining game budget to estimate the true value of each of these and pick the best.

This is reinforced when we look at the Asteroids results in Table 3. Vanilla NTBEA does joint best with 1k iterations, and the mean score does not increase significantly for higher numbers of iterations. At higher iterations all variants except the Linear function are at least as good, but not necessarily reliably better. In the Asteroids case there is an effective maximum score of 10000 when we use 2000 game ticks as here, so with all the mean and best results in the 9700 to 9800 range the optimisation does not have much room to work, especially when we add noise.

6 Discussion

In all four of the benchmark functions, and in both games across small and large parameter spaces vanilla NTBEA is at least as good as the weighting variants tried for small numbers of iterations; and usually better with lower variance in results. As the number of iterations increases this effect shrinks, and for some cases one of the weighting variants can be significantly better. For example Inverse-root with 1000 iterations on the Hartmann-6 function, or the Exponential function with 3000 iterations in Asteroids II. However, this is cherry-picking. Furthermore the weighting variants introduce complexity with a new hyper-parameter

to be specified.

When we optimise an expensive function such as game performance over a parameter space we are deliberately trying to use a small number of iterations. Vanilla NTBEA works best in this situation, and we conclusively reject the hypothesis that improving the N-Tuple model with these weighting functions improves either reliability or performance.

We do not reject the hypothesis that the variants provide a better estimate of the true value of a parameter setting. Across all benchmark functions and game environments vanilla NTBEA provides very poor estimates of the actual value, under-estimating by a very large margin because it is averaging over all possible Tuple matches. The Inverse and Inverse-root weighting functions consistently do a much better job of estimating the value of their recommendation. However, this is not as important when our key objective is to get a good recommendation; we can always go on to get a good estimate of its value later.

Linear weighting is clearly worse than the other options that do not exclude all contributions from less-specific Tuples with more information after only iterations. The appears to be because once it has evaluations of a specific setting it ignores all other data, and uses the average of those evaluations. With a larger number of iterations what often happens is that sequential iterations focus on the current best estimate until the mean falls sufficiently and the focus shifts to another setting. With noisy function evaluations this often leads to a recommendation with a smaller number of trials (but more than ), that happens to currently have a high estimate. Hence the recommendation is optimistic because it picks the best (stochastic) estimate across all options with more than evaluations, and we can see this reflected in the general over-estimate of the value of its recommendation (a version of the ‘winner’s curse’). This effect is less evident for the other weighting functions, as they never let the weighting of other Tuples fall to zero.

7 Conclusion and Future Work

We hypothesised that adding a recursive weighting function to apply to Tuples in NTBEA would improve performance in parameter optimisation in terms of quality and reliability of a recommended (optimised) parameter setting and in providing a more accurate estimate of the value of this. We tried four different weighting functions with different decay characteristics (linear, inverse, inverse-root and exponential) across four benchmark functions from the function optimisation literature, and two games with two distinct sizes of parameter space.

Across all ten experiments we found no evidence that the proposed weighting functions improved NTBEA except in the least important one of providing a better estimate of the true value of the parameter setting recommended by the optimising process. On the contrary, we found strong evidence that vanilla NTBEA is better able than the weighting function variants to reliably find a higher quality recommendation. This is especially true for the smaller number of iterations that would tend to be used in real world applications.

Finally we investigated how best to use a fixed budget of NTBEA iterations in the Planet Wars and Asteroids games. These showed than any individual NTBEA run may give a poor recommendation, and it is better to run several NTBEA runs with a smaller number of iterations, and then use the remaining budget to estimate more accurately the value of these, and then pick the best.

We have not explored different values of , the hyper-parameter introduced to determine how the weighting function is used, and it is possible that other values may perform better. There are other more adventurous options to improve the N-Tuple model, such as regression across the tuples to determine which ones are important. The updated model in this paper still assumes that each Tuple at a given level is equally important. If we have no data for the full -Tuple then we average across all matching 2-Tuples, when in practise some of these may be more important than others. One approach to try would be to construct a regression model across the tuples to up-weight the ones that better predict the observed results. We have also not changed the exploration model, which averages across all matching tuples as in vanilla NTBEA. It could be worthwhile to experiment with different noise models, for example using a square root instead of a log function in Equation (3), which has been found useful in other areas where exploration is more important than exploitation [Tolpin_Shimony_2012].

8 Acknowledgments

This work was funded by the EPSRC CDT in Intelligent Games and Game Intelligence (IGGI) EP/S022325/1.