Log In Sign Up

Rolling Horizon Evolutionary Algorithms for General Video Game Playing

Game-playing Evolutionary Algorithms, specifically Rolling Horizon Evolutionary Algorithms, have recently managed to beat the state of the art in performance across many games. However, the best results per game are highly dependent on the specific configuration of modifications and hybrids introduced over several works, each described as parameters in the algorithm. However, the search for the best parameters has been reduced to several human-picked combinations, as the possibility space has grown beyond exhaustive search. This paper presents the state of the art in Rolling Horizon Evolutionary algorithms, combining all modifications described in literature and some additional ones for a large resultant hybrid. It then uses a parameter optimiser, the N-Tuple Bandit Evolutionary Algorithm, to find the best combination of parameters in 20 games with various properties from the General Video Game AI Framework. We highlight the noisy optimisation problem resultant, as both the games and the algorithm being optimised are stochastic. We then analyse the algorithm's parameters and interesting combinations revealed through the parameter optimisation process. Lastly, we show that it is possible to automatically explore a large parameter space and find configurations which outperform the state of the art on several games.


page 1

page 7

page 8

page 9


Analysis of Vanilla Rolling Horizon Evolution Parameters in General Video Game Playing

Monte Carlo Tree Search techniques have generally dominated General Vide...

Population Seeding Techniques for Rolling Horizon Evolution in General Video Game Playing

While Monte Carlo Tree Search and closely related methods have dominated...

An experimental study of exhaustive solutions for the Mastermind puzzle

Mastermind is in essence a search problem in which a string of symbols t...

Rolling Horizon NEAT for General Video Game Playing

This paper presents a new Statistical Forward Planning (SFP) method, Rol...

Playing Against the Board: Rolling Horizon Evolutionary Algorithms Against Pandemic

Competitive board games have provided a rich and diverse testbed for art...

A Generic Metaheuristic Approach to Sequential Security Games

The paper introduces a generic approach to solving Sequential Security G...

Statistical Tree-based Population Seeding for Rolling Horizon EAs in General Video Game Playing

Multiple Artificial Intelligence (AI) methods have been proposed over re...

I Introduction

In this paper we revisit the application of game-playing Evolutionary Algorithms with a deeper analysis of algorithm modifications and hybrids. We further argue that automatic exploration of algorithm variations is essential for optimisation problems with large search spaces, although still not exhaustive due to computation speed limitations. There have been several recent advances in game-playing Evolutionary Algorithms [2, 12] and a multitude of modifications and hybrids proposed to improve performance across a large number of games. The result of this is that the possibility space for algorithm parameters has grown beyond manual optimisation efficiency. Although preforming grid-search is sometimes possible for finding good values for some parameters [7], more recent works find the need to reduce more and more the number of parameter combinations chosen for analysis [11]. Therefore, the interesting insights into which variation of the algorithm is actually the best are limited to human exploration of very small sections of the entirety of the search space.

The specific novel application of Evolutionary Algorithms as game-playing methods (referred to as Rolling Horizon Evolutionary Algorithms, or RHEA) was introduced for the first time in 2013 by Perez et al. [25]. In the context of playing games, RHEA evolves, at every game step, a sequence of actions to play in the game; the first action of the best sequence found is played at the end of the evolutionary process and a new sequence is evolved for the subsequent game step. This base algorithm has been extended in several works. Gaina et al. [7] performed an in-depth analysis of the algorithm’s main parameters (population size and individual length), generally finding that the higher the parameter values, the better RHEA performs across several games; this work further highlights an increase in performance with available budget and correspondingly higher parameter values. Different population initialisation methods were explored in [8]; this work was important in highlighting the benefit of different options in different game types, as some games saw increased performance with greedy initialisation, while others preferred a statistical approach instead. Furthermore, Gaina et al. tested in [9] various hybrids and combinations with other techniques, which further pinpointed not only the difference in performance of certain parameter configurations across the different games, but also that the RHEA parameter space was already being expanded beyond the possibility of exhaustively exploring all parameter combinations. Some of these enhancements were further tested by Santos et al. [28] in General Video Game AI (GVGAI) and by Tong et al. [31] in MuJoCo’s physical control tasks, both with great success. Finally, a study on dynamically adjusting individual length based on the fitness landscape observed during evolution [11] shows that some parameters might be conflicting with each other and cause poor performance in some games; this suggests a need for carefully constructed parameter search spaces, as well as potential extra engineering for all combinations to work as expected.

The work in this paper is carried out within the domain of general video game playing, which focuses on finding general-purpose Artificial Intelligence players that are able to play any game, even those unseen previously. The concepts behind this could be further extended to general AI which is able to solve any given task (as opposed to any given game), as methods developed for games have been shown to be applicable to wider domains, such as chemistry [29]. Two large categories of players can be differentiated in this domain: planning and learning. The latter requires training for several episodes on a game before it can figure out how to play it, which is often an expensive process leading to narrow results: the agent trained for one game would be unlikely to be able to play another without significant training on the new game. The former category, which RHEA belongs to, refers to methods which work online, during the game, to search for the appropriate solutions. These methods require an internal model of the games (referred to as a forward model, or FM) to be able to simulate possible futures and effects of their actions.

Although Monte Carlo Tree Search (MCTS) had for a long time represented the state of the art in general video game playing, RHEA has been shown to outperform MCTS in multiple games in some of its variations [9], while other combinations of modifications led to significantly worse results. As highlighted by Lucas et al. [19], there can be a large difference in performance for the same base algorithm when using different parameters, and optimisation is key. Ashlock et al. [1] emphasise this in the context of general game playing, where one single method (or single parameter configuration, in our approach) is unlikely to achieve high performance across all possible tasks. Our specific problem is additionally highly noisy: most games are stochastic and the same sequence of actions in a game could lead to different outcomes; furthermore, the algorithm itself is stochastic and may produce different outputs given the same game state.

N-Tuple Bandit Evolutionary Algorithms (NTBEA) [15] have shown robust high performance in noisy optimisation problems even when compared with alternatives, with an addition of high sample efficiency, fast convergence and good scaling for large search spaces [19]. Evaluations of AI player performance on a multitude of games can be very expensive, therefore sample efficiency is key, making NTBEA suitable for optimising RHEA parameters. The algorithm has been previously successfully employed in a variety of noisy optimisation problems, such as tuning game parameters [16] as well as AI game-player parameters [20, 19, 4]. A highly adaptive system which can optimise its parameters and structure so as to achieve best performance in various games could easily feed into a generic life-long learning system such as that presented in [10].

To summarise our contributions, first, we give an overview of the current state of the art in Rolling Horizon Evolutionary Algorithms and parameter optimisation within the context of general video game playing. Second, we perform an in-depth analysis of the algorithm’s parameters with respect to its performance across the various games tested. And third, we show that it is possible to automatically search the RHEA parameter space for configurations which outperform the state of the art on specific games.

The rest of this paper is structured as follows: Section II describes the workings of RHEA and NTBEA, as well as introducing our testbed and game set employed in the experiments. Section III details the RHEA modifications considered for optimisation and the resulting parameter space. Section IV presents experiments carried out, followed by a discussion of results. Section V concludes the paper and provides insights for future work.

Ii Background

This section describes the two key concepts employed in this paper, the Rolling Horizon Evolutionary Algorithm (RHEA) and the N-Tuple Bandit Evolutionary Algorithm (NTBEA), as well as introducing the framework and game set used for experiments.

Ii-a Rolling Horizon Evolution

RHEA utilises Evolutionary Algorithms (EA) to evolve an in-game sequence of actions at every game tick, with restricted computation time per execution. This subsection will describe the baseline algorithm, often referred to as vanilla; modifications applied are detailed in Section III.

In this application of EAs for game-playing, the genotype is described as a vector of integers of length

(individual length), where each integer is in the range , with being the maximum number of actions in a given game state . This translates to a phenotype as a sequence of actions played in the game starting from state

, or, in other words, the behaviour of the player. In order to evaluate an individual in this context, RHEA uses the forward model (FM) of the game, an internal model of the world, to simulate through the actions, one at a time. The game state reached at the end is then evaluated with a heuristic function

and this value becomes the fitness of the individual: therefore, we are evolving action sequences which lead to the best game outcome, limited to the exploration range . The heuristic function is always kept to a generic form throughout the experiments; this aims to maximize the game score, while favouring wins and discouraging losses, see Equation 1 ( is the game state being evaluated and is the game score normalised in ).


Using this method to evaluate individuals, the vanilla algorithm follows a typical EA process. It begins by initialising a population of individuals of length at random and evaluates them. At every generation, while budget is still available, it promotes individuals directly to the next generation through elitism. It then generates offspring by repeatedly selecting parents through tournament selection, crosses them with uniform crossover to create a child, and mutates the child through uniform mutation before adding it to the pool of offspring. The best individuals from both parents and offspring pools are added to the next generation and the process repeats. Typically, a budget of ms per game tick is given to the algorithm for real-time decision-making.

We would further like to highlight the benefits of choosing RHEA over Monte Carlo Tree Search (MCTS), as the current favourite in general game-playing methods: RHEA is easily parallelizable, with each individual being independent from the rest and from the search process. It is able to handle continuous actions without modification, which has been previously explored in [27]. It can easily adapt to n-player games with minimal changes through co-evolution [18] and the many possible enhancements and modifications make it highly adaptive and controllable for different given problems.

Idx Game Stoch. Rewards Win Lose Levels NPCs Res. Actions [3] [22] [30]
0 Dig Dug x D Puzzle/Kill Timeout L/Dense E Move+Shoot 4 5
1 Lemmings D Exit/Puzzle Death L/Dense N Move+Shoot 4 5
2 Roguelike x D Exit Death L/Dense E x Move+Shoot 4 4
3 Chopper x D+Disq Kill No-kill L/Dense E x Move+Shoot g4-g1 2
4 Crossfire x N Exit Death M/Dense E Move 2 4
5 Chase D Kill Death M/Sparse F+E Move 2 4
6 Camel Race N Exit Timeout L/Sparse E Move 2 3
7 Escape N Exit/Puzzle Death M/Dense Move 2 3
8 Hungry Birds Disq Exit Timeout M/Sparse HP Move g7-g10 3
9 Bait N Puzzle/Exit Timeout S/Sparse x Move 4 4
10 Wait for Breakfast N Puzzle Timeout M/Dense N Move 2 3
11 Survive Zombies x D Timeout Death M/Dense F+E Move 3 4
12 Modality N Puzzle Timeout S/Dense Move 3 4
13 Missile Command D+Disq Kill No-kill M/Sparse E Move+Shoot 3 2
14 Plaque Attack D Kill No-kill L/Dense E Move+Shoot 3 2
15 Sea Quest x D+Disq Timeout Death M/Dense F+E x Move+Shoot 3 2
16 Infection x D Kill Timeout M/Dense F+E Move+Shoot 1 1
17 Aliens x D Kill Death M/Dense E LR+Shoot 1 1
18 Butterflies x D Kill Timeout M/Dense F Move 1 2
19 Intersection x D+Disq Timeout Death L/Dense E HP Move g18-g17 1
TABLE I: Game set including feature analysis. The last 3 columns show clusters as depicted in previous works; games with the same value are denoted as part of the same cluster. As [3] do not include all games we use in their study, column [22] shows the game indexes between which the missing games are placed by Mark Nelson (lower-higher); [30] shows more recent work clustering all GVGAI games.

Ii-B N-Tuple Bandit Evolutionary Algorithm

NTBEA is a model-based optimiser based on an Evolutionary Algorithm. It begins by randomly initiating a solution , or with a given solution (referred to as seed, and the process as seeding the algorithm) if specified in its application. It then evaluates one solution at a time, with the evaluation method determined by the specific application, and adds it to its internal -tuple model - that is, all combinations of parameters are registered to have observed the fitness of the evaluated solution. We use , and tuples, where is the solution length.

neighbours of the solution are then generated through uniform random mutation, with probability

, forming neighbourhood

. The fitness of all neighbours is estimated based on previously observed

-tuple values and two statistics are calculated: the average values of the neighbour’s tuples, as well as the average number of times each tuple was previously explored. These two statistics are used within a bandit equation (see Equation  2) to choose the next neighbour to become the current solution being evaluated. This equation aims to balance between promising solutions by exploiting high fitness values and uncertain solutions by exploring those seen less during the process . The constant sets the focus of the algorithm, whether more exploitative or more exploratory. Small random noise (maximum ) is added to each neighbour’s final value to randomly break ties.

The process then repeats with the new chosen solution () for a set number of iterations.


Ii-C Framework

We use NTBEA to tune RHEA parameters within the General Video Game AI (GVGAI) framework [24]. GVGAI is a framework widely used in research [23] which features a corpus of over single-player games and

two-player games. These are fairly small games, each focusing on specific mechanics or skills the players should be able to demonstrate, including clones of classic arcade games such as Space Invaders, puzzle games like Sokoban, adventure games like Zelda or game-theory problems such as the Iterative Prisoners Dilemma. All games are real-time and require players to make decisions in only

ms at every game tick, although not all games explicitly reward or require fast reactions; in fact, some of the best game-playing approaches add up the time in the beginning of the game to run Breadth-First Search in puzzle games in order to find an accurate solution [23]. However, given the large variety of games (many of which are stochastic and difficult to predict accurately), scoring systems and termination conditions, all unknown to the players, highly-adaptive general methods are needed to tackle the diverse challenges proposed.

GVGAI includes several different tracks which tackle different problems: single-player planning [26], two-player planning [6] and single-player learning tracks focus on finding general game-playing AI agents which would be capable of planning (with internal models of the world) or learning across all the games in the framework. More recently, level generation [14] (creating levels for any game) and rule generation [13] (creating rules for any given level) challenges were introduced as well, to push the limits of general game AI.

For the purpose of the experiments described in this paper, we will focus on the single-player planning track, although the work could easily be expanded to include two-player games.

Ii-D Game set

We select single-player games out of the larger GVGAI corpus, as previously analysed in several works. First introduced in [7] and described in Table I, the game set used in this study is sampled based on GVGAI competition entries performance across large subsets, so as to include games of varying difficulty. Additionally, half of the games are deterministic and half are stochastic, introducing additional noise to the parameter optimisation problem explored in this paper.

The game table includes additional information about each game. They showcase varying reward structures, such as games with no rewards (with the possibility of gaining points on win/lose conditions only), games with dense rewards (multiple interactions with the environment result in a score change) or games with discontinuous rewards (a longer sequence of actions is required to obtain the reward). Four different types of winning conditions are featured, in which the player has to kill certain game objects (Kill), reach an exit point (Exit), wait for a timer to run out (Timeout) or complete a certain more precise sequence of actions (Puzzle, such as move a box onto a specific point). Three types of losing conditions are included, which result in the player losing if they run out of time (Timeout), die (Death) or fail to kill specific game objects (No-kill).

Additionally, the 5 levels included with each game vary in size (Large - L, Medium - M or Small - S) and density of interactive tiles (that is, tiles which produce some sort of effect when the player interacts with it, such as blocking the player’s path, moving or getting destroyed). Some games include Non-Player Character (NPCs) that might either help the player (F), hurt the player (E) or have no direct influence on the player’s win/lose condition or score (N) through their behaviour. The player may need to collect resources or pay particular attention to their avatar’s hitpoints (HP). Finally, games vary in the actions available to the players (Move includes movement in all 4 directions, up, down, left and right; LR includes only left and right movement; a special Shoot action might be available in some games, with different effects). All symbols mentioned here refer strictly to the table notation.

When discussing parameter choices, we will refer to games as similar based on the features described in Table I, or the clustering identified from previous works

Iii RHEA Parameter Space

This section describes all the evolutionary algorithm hyper-parameters used for the experiments, including hybrids and game-specific modifications, some introduced in previous work [7, 8, 9, 11]. All algorithm parameters are presented in Table II. Several dependent parameters are highlighted in the table: these are parameters that would not impact the phenotype without specific values taken by other parameters, as detailed below.

Idx Name Values N_options
Population Size 1, 10, 15, 20 4
Individual Length 5, 10, 15, 20 4
Dynamic Depth False, True 2
Offspring Count 5, 10, 15, 20 4
Number Elites 0, 1 2
Initialisation Type Random, 1SLA, MCTS 3
Genetic Operator Crossover Only, Mutation Only, Crossover + Mutation 3
Selection Type † Rank, Tournament, Roulette 3
Crossover Type † Uniform, 1-point, 2-point 3
Mutation Type † Uniform, 1-Bit, 2-Bits, Softmax, Diversity 5
Fitness Assignment Last, Delta, Average, Min, Max, Discount 6
Diversity Weight 0.0, 0.5, 1.0 3
Frame Skip 0, 5, 10 3
Frame Skip Type † Repeat, Null, Random, Sequence 4
Shift Buffer False, True 2
Shift Discount † 0.9, 0.99, 1.0 3
MC Rollouts Length 0.0, 0.5, 1.0, 2.0 4
MC Rollouts Repeat † 1, 5, 10 3
TABLE II: Parameter Search Space, total size . Parameters noted with † are dependent on other parameters.

Iii-a Genetic operators

There are three main genetic operators used by the evolutionary algorithm in RHEA: crossover, selection and mutation. In our implementation, selection is only used to select parents for offspring, subsequent generations being formed directly with the best individuals from the current generation (with no further selection being applied). These three genetic operators each have several implementation options, as discussed below. A hyper-parameter controls which operators should be applied, with options of only using crossover (and selection), only using mutation, or using all three to first obtain an offspring from crossover and then mutate it as well. It is worth noting that the operator type parameters are dependent on the choice of genetic operator: changing the mutation type parameter would not have any effect on the phenotype if no mutation is used in the algorithm, and similarly for crossover and selection.

Selection. Three types of selection are available in the system: tournament, roulette and rank. Tournament selection picks a percentage of the population () randomly and then chooses the best individuals from these to reproduce. Roulette selection chooses individuals with probabilities equal to their fitness (therefore, higher fitness individuals have a higher chance of being selected). Rank selection first assigns inverse-ranks to all individuals in the population according to their fitness (the lowest fitness individual would have rank 1, second lowest rank 2 etc.) and then choose individuals with probabilities equal to their rank (therefore, higher fitness individuals have a higher chance of being selected, but the selection pressure is reduced by minimizing the differences in fitness).

Crossover. Two types of crossover are available in the system: uniform and -point. Uniform crossover selects genes from either of the parents with equal probability. -point crossover randomly selects points along the individuals which would split all individuals in subsections, the offspring being formed then by alternatively choosing subsections of genes from the parents; we use and as possible values for , leading to three total values for the crossover parameter.

Mutation. Four types of mutation are available in the system: uniform, softmax, diversity and -bits. Uniform mutation assigns each gene an equal probability of mutation (, where is the individual length) and picks a different value for the genes mutating uniformly at random. Softmax mutation uses the softmax equation (see Equation 3) to bias mutation towards the beginning of the individual, which causes the largest perturbation in the action sequence (changing any gene in the individual, in this context, also changes the meaning of all subsequent genes - therefore changes in the beginning of the genome have the largest impact in the phenotype). Diversity mutation keeps track of all values for all genes from all individuals explored during evolution and chooses to mutate the gene that has currently been explored the most, to the value for the gene that has been explored the least. Finally, -bit mutation chooses genes uniformly at random to mutate to a new and different random value; we use and as possible values for , leading to a total of five values for the mutation parameter.


Iii-B Fitness assignment

A key part of evaluating individuals, represented as action sequences, is the fitness assignment resulting from the phenotype interpretation (i.e. a sequence of game states the AI player traverses through the action sequence). If all game states traversed are evaluated with a heuristic function (see Equation 1), then this array of values corresponding to each of the game states can be translated to a fitness value in different ways: keeping only the value of the last game state reached: ; keeping the difference between the value of the last state and the value of the first state, so state improvement value: ; keeping the average of all game state values: ; keeping the minimum value, a pessimistic model: ; keeping the maximum value, an optimistic model: ; or keeping a discounted sum of all values: , where , which prioritises immediate rewards.

Iii-C Initialisation

In the vanilla version, the algorithm is initialised with random individuals (all genes in all individuals are picked uniformly at random from all possible values). Different initialisation (or seeding) methods have been previously tested in conjunction with the vanilla algorithm with various success [8]. Both One Step Look Ahead (1SLA) and Monte Carlo Tree Search (MCTS) initialisation options, which have shown promise in various games in the previous study by Gaina et. al are included in this system.


This algorithm performs an exhaustive search of all possible options for a gene and picks the action which leads to the highest value for the following game state. To form an individual, this process is followed for each gene, an action is chosen, the game state is advanced with the chosen action and the process is repeated again for the next gene until an action sequence of sufficient length is generated. If the end of the game is reached during the creation of an individual, the individual is padded with random actions until it meets the required length. For initialisation of a RHEA population, the first individual is created with the 1SLA algorithm and the rest become mutated from the first. Given a greedy approach, this reduces the randomness of the initial population and begins search from a local optimum.

MCTS. This algorithm iteratively builds a search tree by selecting nodes in the tree to expand using the UCB1 formula, see Equation 4, where: constant , is the chosen action from the set of possible actions , is the current game state, is the value of choosing action from state , is the number of times state has been visited and is the number of times state has been visited and action was chosen next. It then evaluates nodes with Monte Carlo simulations (a sequence of random actions up to a maximum tree depth ) starting from the newly expanded node and updates the statistics (, and of all nodes traversed during an iteration with the value given by the heuristic for the final game state reached after Monte Carlo simulations. This tree grows asymmetrically as MCTS balances between exploration of uncertain actions and exploitation of seemingly good actions. For initialisation of a RHEA population, MCTS is run for half the budget and the first individual is selected by greedily traversing the tree created. As the tree would not be fully expanded, the path through the tree is capped when a node with less than visits is reached and actions are added randomly up until individual length ; the rest of the individuals are mutated from the first.


Iii-D Frame skip

Frame skipping has become common practice in several Reinforcenemnt Learning works, and key in the success of specific applications [5, 21]: grouping game states when making a decision, to increase the data available and reduce the frequency of decisions returned to only every game states. Statistical forward planning approaches, on the other hand, usually make a new decision at every game tick, repeating their search process in the very limited time. With this modification, we test if SFP methods can also benefit from a longer time for making decision by only returning an action every game ticks, replying according to a specific strategy for the game ticks inbetween and using all the time inbetween decisions for planning the next move. We test (no frame skip, decisions at every game tick), and as values for and four different strategies for actions inbetween decisions: repeat, null, random and sequence. The repeat strategy simply repeats the previously decided action until a new action is decided. The null strategy plays ACTION_NIL (does nothing), which more closely mimics human player gameplay with pauses inbetween actions. The random strategy plays a random action and the sequence strategy continues playing the following actions in the best individual returned with the last decision. The frame skip type parameter is dependant on the frame skip value: if no frame skip is used (value ), then changing the frame skip type would have no effect on the phenotype.

A form of frame skip using the repeat strategy described above was previously tested in GVGAI by Perez et al. [17] with notable success in several games.

Iii-E Shift buffer

This is a population management technique which avoids repeating the entire search process from scratch at every new game tick, which usually loses information gained in previous iterations of the algorithm; this is meant to make the algorithm more sample-efficient by retaining previous computation information. The shift buffer has been employed in several works and tested in GVGAI by Gaina et al. [9], and it works by keeping the final population evolved during one game tick to the next. However, as the first action of the best individual has just been played, all first actions from all individuals in the population are removed and a new random action is added at the end. Additionally, there exists the option in our implementation to apply a discount to the values of all individuals in the new population, which can be either , or (no discount applied); this would weaken the values of previously obtained sequences in the new context. The shift buffer discount parameter is dependent on the shift buffer toggle: if no shift buffer is used, then changing the shift buffer discount would have no effect on the phenotype.

Iii-F Dynamic depth

It is often the case that different games benefit from different algorithm parameters. In particular, the individual length has a high impact in the performance of the vanilla RHEA, as shown in [7]. This was mainly tied to the density of rewards in the various games in [11]: games with dense rewards generally benefit from shorter individuals which would allow for more generations and more statistics gathered to facilitate quick strategic reactions; as opposed to games with sparse or no rewards, where longer individuals are required in order to be able to find those rewards further ahead. This difference in reward density can also be observed at a more granular level, during the play-through of only one game: some areas of the game may contain more rewards, whereas others would require more exploration. Therefore, the dynamic depth modification presented in [11] is included in the system, which has the option to change the length of the individuals at every

game ticks: if the standard deviation of the fitness landscape observed previously falls below a threshold (

), decisions are considered to be uncertain without much variety in rewards observed and the individual length is increased by ; if the opposite happens and the fitness landscape observed previously raises above a threshold (), more generations are prioritised for more informed decision making in a highly varied environment and the individual length is decreased by instead. All parameters for dynamic length adjustments were set based on [11] and could represent one point for further increasing the parameter search space in further studies.

Iii-G MC rollouts

Lastly, we consider the hybridisation of the algorithm and its further combination with MCTS, which has been very successful in many GVGAI games [22]. We have previously described MCTS initialisation, but concepts from MCTS can further be borrowed and integrated into RHEA, such as its Monte Carlo (MC) simulation phase. As described in [9], the evaluation process in RHEA may add MC rollouts of length { (no rollouts used), , or } after advancing through the action sequence of length represented by the individual; these may be repeated , or times for more statistics gathered. In order for this to be compatible with the fitness assignment modifications, the values of all game states traversed (or the average value for a particular game tick if there are repetitions performed) are added at the end of the array of state values obtained from the individual and all fitness assignment methods are applied to the combined array of values () instead. The MC rollout repetition parameter is dependent on the rollout length: if the length is set to , then changing the number of rollout repetitions would have no effect on the phenotype.

Iv Experiments

Given the large number of parameter combinations, estimated at , it would take a significant amount of time to test each combination exhaustively in several games and with repetitions for statistical significance. Therefore we choose to analyse the different parameters indirectly through the evolutionary process described by an N-Tuple Bandit Evolutionary Algorithm (NTBEA). We ran NTBEA for iterations on each of the games described in Section II to perform a search through the RHEA parameter space depicted in Table II. Each individual evaluated by NTBEA would therefore be one parameter combination ( individual length). We seed NTBEA with the previous state-of-the-art parameter configuration for each game (see ‘SotA’ rows in Table III). To evaluate each individual, we run RHEA with the specific parameter configuration on the given game, once in each of the levels of the game and we use the average win rate on the levels as individual fitness. To test the final configuration, we run it times on the given game ( times per level) and we additionally test the tuned parameter configuration on the entire set of games, similarly with runs per game.

All experiments were run on IBM System X iDataPlex dx360 M3 Server nodes, with one game per node, having one Intel Xeon E5645 processor core allocated to it and a maximum of 3GB of RAM of JVM Heap Memory. The runs took between hours and days to complete, including NTBEA tuning and final configuration testing; one run of a game can take up to game ticks to complete, with Forward Model calls per tick for AI decision making (plus game engine computations), the fastest game ending after game ticks on average. The budget for all agents was set as Forward Model calls instead of time limits (which averages as the equivalent of ms in our tests), in order for the experiments to be consistent and replicable across different machines.

In the following sections we aim to analyse not only the performance of the optimised agents on the different games, but also the parameter space explored during the evolution and the parameter choices themselves. We hypothesise that similar games would lead to similar choices in parameters, which would differ across game types.

The paper presents and discusses the most interesting aspects observed, but all results, plotting scripts and additional figures are available on Github111

Fig. 1: Progression of solution fitness during NTBEA optimisation process in 2 games, Missile Command (left) and Intersection (right). The green dots indicate when the solution evaluated becomes the new best (the last green dot before or at iteration X is the solution which NTBEA would return if execution was stopped at iteration X). Trendline in white.

Iv-a Optimisation Effectiveness

We first discuss the effectiveness of the optimisation. We summarise in Table III

the results obtained on all 20 games used for tuning RHEA parameters with NTBEA. For each game, we present the parameter configuration of the previous state of the art (previous highest win rate recorded), its win rate and standard error; similarly, we present the optimised configuration for each game.

There are many games in which the win rate remains at or very close to . This set of games (Dig Dug, Lemmings, Roguelike) remains too difficult for these methods to solve without more game-specific information or better exploration policies.

There are also several games which see win rates at, or very close to, (Intersection, Aliens, Infection, Chopper and Plaque Attack). We do not see a decrease in performance in these games after optimisation (but a definite increase in Plaque Attack to with several modifications in parameter choices, including using dynamic depth, 1SLA initialisation and random frame skip).

We do see several games improving performance significantly: the win rate in Sea Quest increases from to by employing longer individual lengths, a larger population size, a shift buffer and MC rollouts. Missile Command sees an increase in win rate from to with a shift buffer, MC rollouts and a discounted fitness assignment. And performance in Camel Race increases from to by using repetition frame skip, a shift buffer and MC rollouts, amongst many configuration modifications. These 3 games do not immediately show common features as per Table I, with Sea Quest standing out due to its stochastic nature and dense environment, while the other two feature sparser deterministic environments.

We see a decrease in performance in three of the games: Butterflies (from to ), Escape (from to ) and Modality (from to ). As NTBEA was seeded with the previously best solution, we believe these are cases in which the noisy fitness evaluation was shown to be most harmful, as the initial solution ended up with a worse fitness than the solutions returned. However, with more runs of the two configurations on the game, their rank turns out to be opposite. A similar smaller decrease is also observed in Lemmings and Bait - all of these, except for Butterflies, are games with puzzle elements to them, which appear to be most difficult to optimize and estimate solution quality for, as they require more precise action sequences, with one move possibly making the game unsolvable, and therefore more precise evaluation.

Finally, we highlight NTBEA’s optimisation process progression in two games in Figure 1, Missile Command and Intersection. Both of these games see an upwards trend in solution quality, and they represent the games with the slowest and fastest convergence, respectively. We can observe that the algorithm settles on the solution for Intersection very quickly, before iteration , whereas it uses almost all computation budget for Missile Command to find the best option. This could be an indication of not only game difficulty, but also strategic depth: most parameter options work well and obtain very good performance in Intersection, while Missile Command poses a challenge at which not many options are successful and the finding of those few good configurations is more difficult.

Overall, the game-specific optimised agents achieve win-rates of below when tested on the entire set of games, which is not surprising in the general game playing context; the agents do not use any game-specific information. The best performing tuned agent is that for Sea Quest ( average win rate on all games), which shares different features with several other games; this appears to make the specific configuration more generally applicable than the others.

Fig. 2: 1-tuple: rollout length percentage parameter. Color intensity represents number of occurrences of the data point (the brighter the color, the more occurrences).

Iv-B 1-tuple Analysis

We further look into the parameter space explored by NTBEA starting with 1-tuples: that is, looking at each parameter in isolation and its preferred values in the different games tested; we use the term prefer to mean the value achieves highest win rate. We group together the solutions in which the parameter had the same value chosen and plot the fitness of these solutions against parameter values. It is worth noting that it possible the parameter may have not had a great influence in the fitness obtained. We further exclude the data points where the parameter had no influence at all, in the case of dependent parameters (see Table II). The resulting heatmaps show the fitness values observed for each parameter values, as well as how many times each parameter value was explored by NTBEA. The latter is given by the intensity of colours in the figures presented; we cap the maximum number of occurrences of a data point at and normalize all values in [] for visualisation purposes.

games prefer the shift buffer turned on and to keep 1 elite between generations. Additionally, as previously seen in [7], most games prefer long individual length and large population sizes. games further prefer the agent employing Monte Carlo rollouts at the end of its individual evaluation: Chopper, Sea Quest and Missile Command in particular prefer very long rollouts (, see Figure 2); this could be due to these three games featuring different types of rewards and delays in obtaining rewards. The other similar game in terms of rewards, Intersection, does not show a particular preference in this parameter, achieving fitness in all values. Full plots and results are available on GitHub.

In terms of genetic operators, most games prefer the agent to use both mutation and crossover in its evolutionary process. However, there are some exceptions: Chopper and Plaque Attack prefer to use mutation only, whereas Missile Command prefers options that do include crossover and more disturbance in its offspring (see Figure 3). Although these games are seen as similar in [30] and obtain high winning rates, the way the agents achieve their good performance does differ in these games, suggesting win-rate-based clustering methods could be improved by taking into account agent-based features.

Fig. 3: 1-tuple: genetic operator parameter. 0 - crossover and mutation. 1 - mutation only. 2 - crossover only.

When looking at the number of offspring (see Figure 4), Survive Zombies, Missile Command and Chopper prefer more. These games are quite similar in terms of features (win/lose conditions, level sizes, enemy NPCs) and are clustered together in [30]. However, in the same cluster, Butterflies and Plaque Attack don’t show strong preference here - as opposed to the others, these two games have a smoother reward function and score progression, while Missile Command and Chopper show more delay in getting rewards, more actions are required from the player to find particular rewarding scenarios. Sea Quest is placed in a similar cluster by [3], but it shows opposite preference, for less offspring instead. In this game we see large discontinuous rewards as well as many smaller dense rewards - the larger variety in types of rewards could be what leads to favouring less solutions sampled to increase the number of generations in the evolutionary process, and to gain better insight into which reward type is preferable.

Fig. 4: 1-tuple: offspring count parameter.

Lastly, we highlight that Intersection and Wait for Breakfast are the only games that benefit largely from null frame skipping (see Figure 5) - in both of these games it is essential to wait for specific events to happen (a way in the road to clear, or the waiter to arrive) and this is highlighted in choice of method. Plaque Attack prefers sequence frame skipping, as plans evolved are precise enough in line with the constant stream of rewards. And Camel Race prefers repeat or sequence frame skipping, with more frames skipped being better, which are more effective strategies of exploring large sparse environments. Most other games dislike frame skipping and prefer more fine-grained search; however, we note that the choice in values for this parameter is very coarse and it might be that more games could benefit from some or dynamic frame skipping.

Fig. 5: 1-tuple: frame skip type parameter. 0 - repeat. 1 - null. 2 - random. 3 - sequence.
Fig. 6: 2-tuple: individual length and population size. Colors show average fitness for each data point, with green being highest () and red being lowest (). Each data point is highlighted with a black circle; the darker the circle, the more number of times that combination of values was sampled.

Iv-C 2-tuples Analysis

Similarly as with 1-tuples, we can look at how combinations of parameters affect overall solution fitness. In this section we group together solutions which had the same values for each parameter combination, while eliminating the data points where either one or both of the parameter values did not impact solution phenotype, in case of dependent parameters (see Table II). We plot each parameter against all others in the different games tested, each data point representing the average fitness observed in the respective group of solutions. We further add black circles on each data point to highlight the number of times each combination was explored by NTBEA during the optimisation process.

The first thing that stands out in all resulting figures is that NTBEA explores the best combinations the most, while mostly ignoring less promising options. This can be seen as a direct confirmation of the effectiveness of the bandit-based approach, but also as a potential point of improvement: due to the nature of the very noisy optimisation, it might be beneficial to obtain more accurate estimates of some data points which do not immediately stand out as the best: as discussed in a previous section, it was the case in several games that the optimised solution ended up performing worse than the initial solution given to the algorithm, which could have been avoided had a more accurate evaluation of solution quality been done.

In Figure 6 we can observe the combination of individual length and population size parameters. We’ve previously observed that longer individuals lead to higher fitness values, and similar for larger population sizes. It is interesting to see that this holds true also for the combination of and , although specific combinations achieve better results in some games (such as and in Chopper).

Another interesting parameter combination to discuss is that of mutation type and crossover type, shown in Figure 7, which largely decides how offspring are created at each generation. Although the overall fitness of solutions differs, games Hungry Birds and Plaque Attack show a similar distribution of good or bad quality combinations: in particular, 1-point crossover does not agree with diversity mutation, and 2-point crossover does not agree with bit-mutation. This could largely be due to the specific modifications n-point crossover wishes to generate, which are modified unexpectedly by bit-mutation. However, these two games singled out here do not appear to have much in common according to our feature descriptions and clustering in Table I; it is thus interesting to find game similarities beyond those given by traditionally-employed features.

Fig. 7: 2-tuple: mutation type and crossover type.

V Conclusion

In this paper we use the N-Tuple Bandit Evolutionary Algorithm (NTBEA) to optimise the performance of Rolling Horizon Evolutionary Algorithm (RHEA) in 20 GVGAI games, by modifying the configuration of RHEA’s 18 parameters. The various values possible for all parameters form a large search space of , which makes manual optimisation or exhaustive search difficult with limited compute, thus we choose to use NTBEA to attempt to improve the win rate of the agent in each of the games.

As a result of the optimisation, the performance increases in several games. However, puzzles appeared to be the games where NTBEA struggled to estimate the quality of different agent configurations and the solution returned was worse than the state of the art, although NTBEA’s evolutionary process was run with SotA as the initial solution. The optimisation process differed in the games tested, NTBEA being able to converge in under iterations in some games, while taking most of its iteration budget to find good solutions in others: this strengthens the idea that one specific method is unlikely to perform well across all games, and that games might require specialised parameter search spaces to ensure fast optimisation or even the possibility of a high-performing solution being found.

We further analysed RHEA’s parameters through the evolutionary process, by looking at some 1-tuples and 2-tuples and the values explored for each. Several games with similar features in common were found to prefer similar parameter values, although exceptions do exist of game clusters shown in parameter values, but not in the traditional game features considered. This suggests that game clustering methods can be further enhanced by considering agent-based features.

To further expand on the work carried out in this paper, we propose further exploring larger and more complex search spaces, with an enhanced NTBEA which is able to handle tree-structures: we’ve seen several parameters dependent on others and optimisation would be more sample-efficient if this was taken into account during the evolutionary process. More enhancements can also be added into the system, as well as optimising RHEA on a larger set of games, with the possibility of testing approaches at optimising a generally applicable player. Lastly, information gathered during optimisation and in-depth analysis can be used for designing hyper-parameter methods which would be able to identify game features, relate these to previously seen situations and adapt to new unknown environments.


This work was funded by the EPSRC CDT in Intelligent Games and Game Intelligence (IGGI) EP/L015846/1

Game Win Rate Parameters
P.Size I.Len DD O.Cnt Elite Init G.Op Select Cross Mut Fit D.W F.Skip F.Skip.T S.Buffer S.B.Disc MC.Len MC.Rep
0 0% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 0 - 0 - SotA
0% 0 3 0 2 1 2 2 0 0 2 0 0 1 0 1 0 1 0 opt
1 4% 1 2 1 2 1 0 2 1 0 0 0 0 0 - 0 - 0 - SotA
0% 1 1 1 0 0 0 2 1 1 2 0 0 1 0 1 0 1 1 opt
2 0% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 0 - 0 - SotA
0% 0 3 0 2 1 2 2 0 0 2 0 0 1 2 1 0 1 0 opt
3 100% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 0 - 0 - SotA
99% 1 2 1 3 1 1 1 - - 0 5 0 0 - 1 0 2 1 opt
4 10% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 0 - 0 - SotA
10% 3 1 1 2 0 1 1 - - 2 5 0 0 - 1 0 0 - opt
5 13.13% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 0 - 0 - SotA
3% 1 3 0 3 1 0 0 1 2 - 2 0 0 - 0 - 1 0 opt
6 11.00% 0 1 0 1 1 0 2 1 0 0 0 0 0 - 0 - 1 2 SotA
41% 2 2 0 0 1 2 0 1 0 4 5 1 1 0 1 1 2 1 opt
7 46% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 0 - 1 0 SotA
32% 3 1 0 1 0 0 1 - - 4 5 1 0 - 1 0 0 - opt
8 12% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 1 0 1 2 SotA
11% 1 3 0 2 0 1 2 0 2 2 4 0 0 - 1 1 3 1 opt
9 20% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 1 0 1 2 SotA
19% 1 3 0 1 1 0 2 1 1 0 4 0 0 - 1 1 2 1 opt
10 78.33% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 0 - 0 - SotA
83% 1 3 1 1 1 1 0 1 1 - 5 0 1 1 1 1 3 0 opt
11 54.55% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 0 - 0 - SotA
56% 3 2 1 1 1 0 2 1 0 0 5 0 0 2 1 1 3 1 opt
12 37.5% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 0 - 0 - SotA
25% 1 3 1 2 0 0 2 1 1 4 4 0 2 3 0 - 1 2 opt
13 77.78% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 0 - 0 - SotA
86% 2 3 0 2 1 0 2 1 1 3 4 0 0 - 1 0 3 0 opt
14 98.99% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 0 - 0 - SotA
100% 2 3 1 0 1 1 1 0 1 4 4 0 0 2 1 2 3 0 opt
15 65% 0 0 0 0 1 0 2 1 0 0 0 0 0 - 1 0 1 2 SotA
84% 3 3 0 0 1 0 0 0 0 - 2 0 0 - 1 0 3 0 opt
16 100% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 0 - 0 - SotA
99% 2 3 1 0 0 0 1 - - 2 0 0 2 2 1 1 1 2 opt
17 100% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 0 - 0 - SotA
100% 2 3 1 3 0 0 1 - - 2 5 0 0 - 1 0 0 - opt
18 96% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 0 - 0 - SotA
90% 2 0 0 3 0 0 0 2 2 - 5 0 0 - 1 1 3 1 opt
19 100% 1 2 0 2 1 0 2 1 0 0 0 0 0 - 0 - 0 - SotA
100% 1 0 1 0 1 1 2 2 1 2 5 0 0 - 0 - 3 1 opt
TABLE III: RHEA best win rate (and standard error) recorded in all games. “opt” rows show NTBEA optimisation results, “SotA” rows show previously best recorded. The parameter values are indexes to the parameter values array displayed in Table II. Win-rates in bold are the higher values observed, if different.


  • [1] D. Ashlock, D. Perez-Liebana, and A. Saunders (2017) General Video Game Playing Escapes the No Free Lunch Theorem. In 2017 IEEE Conference on Computational Intelligence and Games (CIG), pp. 17–24. Cited by: §I.
  • [2] H. Baier and P. I. Cowling (2018) Evolutionary MCTS with Flexible Search Horizon. In Fourteenth Artificial Intelligence and Interactive Digital Entertainment Conference, Cited by: §I.
  • [3] P. Bontrager, A. Khalifa, A. Mendes, and J. Togelius (2016) Matching Games and Algorithms for General Video Game Playing. In Twelfth Artificial Intelligence and Interactive Digital Entertainment Conference, pp. 122–128. Cited by: TABLE I, §IV-B.
  • [4] I. Bravi, S. Lucas, D. Perez-Liebana, and J. Liu (2019) Rinascimento: Optimising Statistical Forward Planning Agents for Playing Splendor. arXiv preprint arXiv:1904.01883. Cited by: §I.
  • [5] A. Braylan, M. Hollenbeck, E. Meyerson, and R. Miikkulainen (2015) Frame Skip is a Powerful Parameter for Learning to Play Atari. In Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §III-D.
  • [6] R. D. Gaina, A. Couëtoux, D. J.N.J. Soemers, M. H.M. Winands, T. Vodopivec, F. Kirchgessner, J. Liu, S. M. Lucas, and D. Perez-Liebana (2018-06) The 2016 Two-Player GVGAI Competition. IEEE Transactions on Games 10 (2), pp. 209–220. External Links: Document, ISSN 2475-1502 Cited by: §II-C.
  • [7] R. D. Gaina, J. Liu, S. M. Lucas, and D. P. Liébana (2017) Analysis of Vanilla Rolling Horizon Evolution Parameters in General Video Game Playing. In Springer Lecture Notes in Computer Science, Applications of Evolutionary Computation, EvoApplications, pp. 418–434. Cited by: §I, §I, §II-D, §III-F, §III, §IV-B.
  • [8] R. D. Gaina, S. M. Lucas, and D. P. Liébana (2017-06) Population Seeding Techniques for Rolling Horizon Evolution in General Video Game Playing. In Proceedings of the Congress on Evolutionary Computation, pp. 1956–1963. External Links: Document Cited by: §I, §III-C, §III.
  • [9] R. D. Gaina, S. M. Lucas, and D. P. Liébana (2017-08) Rolling Horizon Evolution Enhancements in General Video Game Playing. In Proceedings of IEEE Conference on Computational Intelligence and Games, pp. 88–95. External Links: Document, ISSN 2325-4289 Cited by: §I, §I, §III-E, §III-G, §III.
  • [10] R. D. Gaina, S. M. Lucas, and D. Perez-Liebana (2019) Project Thyia: A Forever Gameplayer. In IEEE Conference on Games (COG), pp. 1–8. External Links: Link, Document Cited by: §I.
  • [11] R. D. Gaina, S. M. Lucas, and D. Perez-Liebana (2019) Tackling Sparse Rewards in Real-Time Games with Statistical Forward Planning Methods. In AAAI Conference on Artificial Intelligence (AAAI-19), Vol. 33, pp. 1691–1698. Cited by: §I, §I, §III-F, §III.
  • [12] N. Justesen, T. Mahlmann, S. Risi, and J. Togelius (2017) Playing Multiaction Adversarial Games: Online Evolutionary Planning Versus Tree Search. IEEE Transactions on Games 10 (3), pp. 281–291. Cited by: §I.
  • [13] A. Khalifa, M. C. Green, D. Perez-Liebana, and J. Togelius (2017) General Video Game Rule Generation. In 2017 IEEE Conference on Computational Intelligence and Games (CIG), pp. 170–177. Cited by: §II-C.
  • [14] A. Khalifa, D. Perez-Liebana, S. M. Lucas, and J. Togelius (2016) General Video Game Level Generation. In Proc. of the Genetic and Evolutionary Computation Conference 2016, pp. 253–259. Cited by: §II-C.
  • [15] K. Kunanusont, R. D. Gaina, J. Liu, D. Perez-Liebana, and S. M. Lucas (2017) The N-Tuple Bandit Evolutionary Algorithm for Automatic Game Improvement. In 2017 IEEE Congress on Evolutionary Computation (CEC), pp. 2201–2208. Cited by: §I.
  • [16] K. Kunanusont, S. M. Lucas, and D. Perez-Liebana (2018) Modelling Player Experience with the N-Tuple Bandit Evolutionary Algorithm. In Artificial intelligence and Interactive Digital Entertainment (AIIDE), Cited by: §I.
  • [17] D. P. Liébana, M. Stephenson, R. D. Gaina, J. Renz, and S. M. Lucas (2017-08) Introducing Real World Physics and Macro-Actions to General Video Game AI. In Proceedings of IEEE Conference on Computational Intelligence and Games, pp. 248–255. External Links: Document, ISSN 2325-4289 Cited by: §III-D.
  • [18] J. Liu, D. Perez-Liebana, and S. M. Lucas (2016) Rolling Horizon Coevolutionary Planning for Two-Player Video Games. In Proceedings of the IEEE Conference on Computational intelligence and Games (CIG), Cited by: §II-A.
  • [19] S. M. Lucas, J. Liu, I. Bravi, R. D. Gaina, J. Woodward, V. Volz, and D. Perez-Liebana (2019) Efficient Evolutionary Methods for Game Agent Optimisation: Model-Based is Best. arXiv preprint arXiv:1901.00723. Cited by: §I, §I.
  • [20] S. M. Lucas, J. Liu, and D. Perez-Liebana (2018) The N-Tuple Bandit Evolutionary Algorithm for Game Agent Optimisation. In 2018 IEEE Congress on Evolutionary Computation (CEC), pp. 1–9. Cited by: §I.
  • [21] M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling (2018) Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents. Journal of Artificial Intelligence Research 61, pp. 523–562. Cited by: §III-D.
  • [22] M. J. Nelson (2016) Investigating Vanilla MCTS Scaling on the GVG-AI Game Corpus. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–7. Cited by: TABLE I, §III-G.
  • [23] D. Perez-Liebana, J. Liu, A. Khalifa, R. D. Gaina, J. Togelius, and S. M. Lucas (2019-09) General Video Game AI: a Multi-Track Framework for Evaluating Agents Games and Content Generation Algorithms. IEEE Transactions on Games 11 (3), pp. 195–214. External Links: Link, Document Cited by: §II-C.
  • [24] D. Perez-Liebana, S. M. Lucas, R. D. Gaina, J. Togelius, A. Khalifa, and J. Liu (2019) General Video Game Artificial Intelligence. Morgan and Claypool Publishers. External Links: Link, Document Cited by: §II-C.
  • [25] D. Perez-Liebana, S. Samothrakis, S. M. Lucas, and P. Rolfshagen (2013) Rolling Horizon Evolution versus Tree Search for Navigation in Single-Player Real-Time Games. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pp. 351–358. Cited by: §I.
  • [26] D. Perez-Liebana, S. Samothrakis, J. Togelius, S. M. Lucas, and T. Schaul (2016) General Video Game AI: Competition, Challenges and Opportunities. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §II-C.
  • [27] S. Samothrakis, S. A. Roberts, D. Perez, and S. M. Lucas (2014) Rolling Horizon Methods for Games with Continuous States and Actions. In 2014 IEEE Conference on Computational Intelligence and Games, pp. 1–8. Cited by: §II-A.
  • [28] B. Santos, H. Bernardino, and E. Hauck (2018) An Improved Rolling Horizon Evolution Algorithm with Shift Buffer for General Game Playing. In 2018 17th Brazilian Symposium on Computer Games and Digital Entertainment (SBGames), pp. 31–316. Cited by: §I.
  • [29] M. H. Segler, M. Preuss, and M. P. Waller (2018)

    Planning Chemical Syntheses with Deep Neural Networks and Symbolic AI

    Nature 555 (7698), pp. 604. Cited by: §I.
  • [30] M. Stephenson, D. Anderson, A. Khalifa, J. Levine, J. Renz, J. Togelius, and C. Salge (2018) A Continuous Information Gain Measure to Find the Most Discriminatory Problems for AI Benchmarking. arXiv preprint arXiv:1809.02904. Cited by: TABLE I, §IV-B, §IV-B.
  • [31] X. Tong, W. Liu, and B. Li (2019) Enhancing Rolling Horizon Evolution with Policy and Value Networks. In 2019 IEEE Conference on Games (CoG), pp. 1–8. Cited by: §I.