Enhancing the Monte Carlo Tree Search Algorithm for Video Game Testing

by   Sinan Ariyurek, et al.
Middle East Technical University

In this paper, we study the effects of several Monte Carlo Tree Search (MCTS) modifications for video game testing. Although MCTS modifications are highly studied in game playing, their impacts on finding bugs are blank. We focused on bug finding in our previous study where we introduced synthetic and human-like test goals and we used these test goals in Sarsa and MCTS agents to find bugs. In this study, we extend the MCTS agent with several modifications for game testing purposes. Furthermore, we present a novel tree reuse strategy. We experiment with these modifications by testing them on three testbed games, four levels each, that contain 45 bugs in total. We use the General Video Game Artificial Intelligence (GVG-AI) framework to create the testbed games and collect 427 human tester trajectories using the GVG-AI framework. We analyze the proposed modifications in three parts: we evaluate their effects on bug finding performances of agents, we measure their success under two different computational budgets, and we assess their effects on human-likeness of the human-like agent. Our results show that MCTS modifications improve the bug finding performance of the agents.



There are no comments yet.


page 5


Automated Video Game Testing Using Synthetic and Human-Like Agents

In this paper, we present a new methodology that employs tester agents t...

Automated Playtesting of Matching Tile Games

Matching tile games are an extremely popular game genre. Arguably the mo...

Accelerating Empowerment Computation with UCT Tree Search

Models of intrinsic motivation present an important means to produce sen...

We'll Fix It in Post: What Do Bug Fixes in Video Game Update Notes Tell Us?

Bugs that persist into releases of video games can have negative impacts...

Improving Hearthstone AI by Combining MCTS and Supervised Learning Algorithms

We investigate the impact of supervised prediction models on the strengt...

StarAlgo: A Squad Movement Planning Library for StarCraft using Monte Carlo Tree Search and Negamax

Real-Time Strategy (RTS) games have recently become a popular testbed fo...

Feasibility Study: Moving Non-Homogeneous Teams in Congested Video Game Environments

Multi-agent path finding (MAPF) is a well-studied problem in artificial ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The success of a video game can be attributed to various qualities, but bugs. The bugs found after release decreases the overall user experience and increases the budget spent on the game. To decrease the number of bugs, video game companies employ tremendous test efforts. However, video game testing is challenging as the game requirements change frequently [25]. The change in the game requirements requires repeating the tests and conducting new tests. Consequently, researchers proposed techniques to automate the video game testing such as scenario-based testing [5], regression testing using record and replay segments [16], generating test sequences from UML and state diagrams [9], creating a Petri net of the game and producing sequences to be tested [8]

, and employing reinforcement learning (RL) to expeditiously test an adventure game

[21]. Nonetheless, these approaches do not either provide an overall game testing experiment, or an automated oracle, or an intelligent tester agent, or comparison with human testers.

In our previous work [1], we generated test goals for Sarsa and MCTS agents to play the game with the purpose of testing the game (see Section II-C). In our experiments, we used Sarsa() [31], and MCTS with transpositions and knowledge-based evaluations (KBE) [3]. We used the GVG-AI framework to create testbed games that contain bugs. We conducted the experiments using these games, and our agents achieved comparable bug finding percentages with the human testers. Additionally, our experiments revealed that the stochasticity of MCTS is beneficial in bug finding. Therefore, in this paper, we investigate MCTS modifications and examine the consequences of different computational budgets for game testing purposes.

MCTS modifications are used by several researchers. In GVG-AI, several enhancements [18], [7], [30], [10] are employed to increase the performance of the Vanilla MCTS. Moreover, the authors [7], [30] noted that not every enhancement has equal contribution to the performance. Furthermore, in board games, different MCTS enhancements [22] were favored in different games. Therefore, in this study, we experiment with several MCTS modifications and we compare them under two distinct computational budgets. Our aim is to analyze their impact on bug finding performances of our agents. In this regard, we propose to use 6 different enhancements, and within these enhancements, we introduce a new tree reuse strategy.

This paper is structured as follows: Section II gives preliminary information about MCTS, GVG-AI, and our previous work. Section III presents the considered modifications and their use in related research. The details of our experiments are given in Section IV, and Section V presents the results. Section VI discusses the outcome of the strategies used and their contributions. Section VII presents the conclusion and proposes further enhancements for future work.

Ii Preliminaries

The following subsections introduce the preliminary material, as follows: MCTS, GVG-AI, and our previous work.

Ii-a Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) [3]

is a search algorithm that builds a tree to get the best available action. MCTS consists of four consecutive steps: selection, expansion, simulation, and backpropagation. These steps are executed iteratively until a certain condition is met. This condition can be a computational budget or finding the desiderata. The selection step chooses a node based on a Tree Policy. Eq.

1 shows Upper Confidence Bounds (UCB1) which is a well-known approach. is the average score of the child, is the exploration constant, represents the visitation amount of the root, and is the visitation count of the child.


The expansion phase expands the search tree by adding one of the unexplored children of the selected node to the search tree. Simulation, starting from this unexplored child, generates actions to be taken based on a default policy. The score obtained from the state reached at the end of the simulation is backpropagated from the unexplored child up to the root. These four steps are executed in succession until the computational budget expires. Afterward, a child of the root node which corresponds to the best available action is returned.

Ii-B Gvg-Ai

GVG-AI [20] is a framework that contains several two-dimensional games. There are more than 120 single-player games, including well-known games such as Mario, Sokoban, and Zelda. The game rules are written using a language called Video Game Description Language (VGDL) [27]. The diversity of the games creates a challenging environment for general video game AI.

Ii-C Synthetic and Human-like Test Goals

Game testing behavior is different from game playing, as a tester’s aim is finding defects and interacting with the game to break it. A tester may have different goals besides finishing the game. In our previous research, we encapsulated these goals as test goals, and we proposed two different approaches to generate them, synthetic and human-like [1].

A test goal consists of features and a criterion for each feature. where is a positive rational number. Each feature stores a weight to define the reward obtained. Criterion defines a percentage for a feature such as the percentage of a wall to be tested and the percentage of space to be explored. If the agent interacts with a feature more than its criterion, the agent should receive less reward as the criterion is fulfilled. We implemented this behavior by using a dampening factor, which is used to diminish the reward obtained.

Moreover, we challenged our agents to test multiple goals in a game, but also preserve the goal order. Hence, the agent is given a sequence of goals , and starting from the first goal, the agent iterates them. When the agent accomplishes a goal, the agent passes on to the next goal.

The accomplishment of a goal is determined by a criteria threshold . The sequence generated by the agent is checked to evaluate how much of the criteria are fulfilled. This evaluated score is then compared with the criteria threshold to determine whether the agent has reached the goal.

A game can be represented with a graph, where nodes are the states of a game, and edges are the actions that progress the story. We generated paths using this graph and a graph coverage criterion. Since playing only these paths corresponds to testing the valid game paths, we modified these paths to examine the effects of unintended game transitions. For a game where the player has to pick up the key to go through the locked door, two examples for unintended game transitions are attacking the key or trying to move through the locked door without picking the key. We created synthetic goals using these game paths and modified game paths. We also created baseline goals by only using the game paths.

We introduced multiple greedy-policy inverse reinforcement learning (MGP-IRL) to extract test goals from the collected human tester trajectories. We called the test goals obtained by this approach as human-like test goals.

Lastly, we introduced the test state, which is a supplementary state to the game state that holds executed interactions. In grid games, interactions occur between sprites. When the avatar attacks a wall, for example, we formulate this interaction and store it in the test state.

In this paper, an agent that uses a synthetic test goal is a synthetic agent, an agent that uses a 2baseline test goal is a baseline agent, and an agent that uses a human-like test goal is a human-like agent.

Iii MCTS Modifications

Researchers designed various enhancements to increase the performance of a game-playing agent. However, we are interested in enhancing the bug-finding performance of our MCTS agent. In this section, we present the modifications we use in our tester agent. These modifications are presented by reviewing their uses in game-playing research. These modifications are namely: Transpositions, Knowledge-Based Evaluations, Tree Reuse, MixMax, Boltzmann Rollout, Single Player MCTS, and Computational Budget [3].

Iii-a Transpositions

In MCTS, the space of the game is explored as a tree, which can lead to having multiple nodes for a game state. Childs et al. [4] introduced transpositions in MCTS. Transposition tables (TT) promote sharing information between nodes of a tree that correspond to the same game state. The authors used this shared information to calculate the UCB1 value of a node, and they proposed three methods for this calculation. Perez et al. [19] used TT for the Deep Sea Treasure game, Świechowski et al. [32] used TT in General Game Playing (GGP), and Choe and Kim [6] used TT for Hearthstone. Xiao et al. [34] used the feature representation of a state to query similar states in memory.

We use transposition tables since it is an effective method. In our implementation, an entry in the TT stores the information corresponding to a node in the search tree. The tree node only holds a pointer to an entry in the table. In the selection step, the information stored in the TT is used. In the backpropagation step, the information corresponding to the nodes, starting from the simulated node to the root node is updated. During the selection phase, the values stored in the table are used. The TT is not used during rollouts.

Iii-B Knowledge-Based Evaluations

In GVG-AI, it is often difficult to find a terminal state or even a state that changes the game score, which may cause the MCTS agent to behave randomly. Powley et al. [22] introduced enhancements to exploit the episodic nature of games. Their information capture and reuse technique are found beneficial in games such as Dou Di Zhu, Hearts, and Othello. Soemers et al. applied 8 modifications to the open-loop implementation of MCTS. These modifications, one of which is KBE, are tested in the GVG-AI corpus, and the authors state that KBE modification is prominent. İlhan and Etaner-Uyar [10] combined MCTS with temporal difference learning in GVG-AI games to exploit domain knowledge using past experience. Silver et al. [28] trained a value network in AlphaGo to effectively evaluate the state of Go, and AlphaGo using this value network in MCTS beat the world champion.

In game testing research, the game being tested may contain bugs, and these bugs may prevent the agent from reaching a terminal state. Furthermore, terminal states or the points received from the game can be deceptive for the game testing agents, as losing the game is also an objective. Hence, we use the KBE to direct an agent.


We evaluate the state using the Eq. 2. with the features that are seen in state resulting from taking an action on state , and is the weights of these features. is the weight of completing a goal and represents the fulfillment amount of the goal criterion in state and , respectively. Note that, if is less than . represents the dampening factor for the weights of the features that surpass their criteria. We employ this enhancement in all of our MCTS agents.

The list of the parameters and their values are as follows: Criteria threshold , goal reward , and the features observed but do not exist in the features .

Iii-C Tree Reuse

Tree reuse strategy uses the previously generated tree to guide the forthcoming MCTS runs. Moreover, pruning this tree is as simple as selecting the subtree of the selected child. Santos et al. [24] employed tree reuse for MCTS agent in Hearthstone. Pepels et al. [17] proposed a decaying tree reuse strategy in Ms. Pac-Man. Soemers et al. [30] used this decaying tree reuse strategy in GVG-AI games, and this strategy is employed in Hearthstone [24]. In Ms. Pac-Man [17] and GVG-AI [30], a decaying reuse strategy is employed.

With transposition tables, e.g. the UCT3 [4] updates every node that precedes the simulated node, which is cumbersome and time-consuming. When the tree is reused, the complexity of updating a node increases as the game progresses. Furthermore, our proposed test state increases the number of states of a game. Consequently, reusing the whole search tree is not applicable, and we need to prune this tree. Pepels et al. [17] used a rule-based method to remove the old nodes, and Powley et al. [23] proposed a node recycling method. Nevertheless, we propose a lightweight tree reuse method which presents effortless integration with transpositions, called as fast expansion.

The fast expansion uses the previously acquired tree in the selection and expansion phases. If the selected node exists in the previous tree, it is flattened and added to the current tree. The flattening process calculates the average score of , and sets the visitation count of the node as . The selection phase continues until it finds a node that does not exist in the previous tree. At this point, MCTS continues with simulation and backpropagation. This algorithm prunes the children that are not chosen in the selection and expansion steps. Fast expansion supports acquiring the previous relevant knowledge and prevents bloating of previous visits and scores by flattening the nodes.

This approach can be perceived as remembering by doing. If MCTS repeats an action, its value is passed on to the next generation; otherwise, it is forgotten. Furthermore, this strategy can also be applied to graphs.

Iii-D MixMax

Jacobsen et al. [11] introduced MixMax to avert cowardly behavior in Super Mario Bros., and Khalifa et al. [13] used MixMax to enhance the human-likeness of an MCTS agent. Frydenberg et al. [7] employed MixMax in GVG-AI games. The authors found mixed results for MixMax. We assume that a tester agent should be able to act more boldly as we want the agents to consider the paths that lead to a goal even though it is risky. Mixmax is used to blend the average score of in Eq. 1 with the , shown in Eq. 3, where is the mixing parameter.


The win condition of our tester agent is the accomplishment of a test goal. Since MixMax modification supports choosing the risky move, it extends the possibility of pursuing this path. In the experiments, is chosen as .

Iii-E Boltzmann Rollout

In Vanilla MCTS, the simulation policy selects random actions in the rollouts. In GGP, Finnsson and Bjornsson used Gibbs sampling to calculate the probability of actions. The authors biased the simulation policy by selecting an action using the probabilities. Tak et al.

[33] argued that this selection mechanism does not fix the selection probability of the best action. Therefore, they used -greedy to fix this probability. Powley et al. [22] stated that -greedy approach is better than Gibbs sampling in GGP. In Go, Silver et al. [29] used softmax to parameterize the simulation policy. In GVG-AI, Perez et al. [18] used the learned experience to bias the rollouts. In this study, we use the Boltzmann rollout. The Boltzmann rollout is based on Boltzmann exploration strategy in RL [31].


represents the probability of choosing the1 node. The Boltzmann beta in Eq. 4 controls the randomness of the move where is the same as random rollout. This equation determines the probability of choosing the node in the rollout. The value is the score obtained from taking the action , in our case , is calculated using Eq. 2.

On the other hand, James et al. [12]

investigated why better-informed rollouts often result in worse-performing agents. In their work, they described that heavy knowledge-based rollouts cause high-bias and low variance which are choices that can result in poor performance. Hence, in the experiments, we choose

to increase the randomness of the simulation policy.

Iii-F Sp-Mcts

Schadd et al. [26] introduced Single-Player MCTS (SP-MCTS). SP-MCTS modifies the UCB1 term which represents the possible deviation of a node. This term offers finer control to the exploration/exploitation dilemma in which the nodes have varying results. The authors showed that their modification outperformed other methods such as IDA, which is a variant of A that limits the depth of search and iteratively increases this depth until the criterion is achieved, in puzzle games.


SP-MCTS adds a third term for finer control in exploitation/exploration dilemma (see the third term in Eq. 5). This third term tilts UCB1 in favor of the nodes that have a high variance, and is a large constant value to mark rarely explored nodes as uncertain. In the experiments, is chosen as .

Iii-G Computational Budget

Although the computational budget is not a modification, it is a parameter of MCTS. Nelson [15] examined various computational budgets for MCTS. The author found that in the GVG-AI framework after a certain computational budget, the win rate becomes stable. Baier and Winands [2] compared different time management strategies of MCTS in detail for five different board games. The experimental results of research on the computational budget are promising. Therefore, we would like to investigate the effect of the computational budget on bug finding behavior.

Iv Experiments

We created three games, each consisting of four levels, using the GVG-AI framework. We inserted a total of 45 bugs to these games which are mostly accomplished by changing the VGDL code. The first game has a 67 grid and it is called Game A. In this game, the player has to pick up the key and go through the locked door to finish the game. The second game has an 89 grid and is called Game B. In this game, the player has to put down the fire by pushing a water bucket, and pick up the key to go through the locked door. The last game, Game C, has a 1011 grid. In this game, the key is broken into two pieces, and the player has to combine them by pushing them into each other, then pick up the key to go through the locked door. For these three games, we used a similar sprite set, but a different layout for each level of a game.

We collected a total of 427 trajectories from 15 different human participants who have various gaming and testing experience. The testers warmed up by playing example levels to get used to the game controls and the environment. During testing, the players were able to test the games in any order and any number of times. The tester trajectories are collected using the GVG-AI framework.

Our human-like test goals are generated using these collected trajectories. During tests, human-like agents used the human-like test goals, which are extracted on the other three levels of the same game. We generated synthetic test goals by sampling paths from the game graph of a level. This game graph is provided by the game developer. The paths are modified using the sprite set of this level. The unmodified test goals are used as baseline test goals. During the tests, the synthetic agent used the synthetic test goals, and the baseline agent used baseline test goals, which are specifically generated for that level.

We created five different MCTS agents using the modifications described in Section III, and for each level, we ran them five times. These agents are KBE-MCTS, FE-MCTS, MM-MCTS, BR-MCTS, and SP-MCTS (see Table I for the modifications). All of the MCTS agents used , and rollout depth of 6. Exploration term is set as in all MCTS agents except SP-MCTS, which is . For the computational budget, we experimented with 40 and 300 milliseconds on i7-8750H (4.1 GHz) using a single core. After these agents generated the test sequences, each sequence is checked by an automated test oracle.

In this study, we asked the following research questions (RQ). RQ1: What is the impact of different computational budgets? RQ2: Which modifications enhance MCTS’s bug finding performance? RQ3: What is the bug finding performances compared to Sarsa() that uses the same test goals? RQ4: What is the effect of modifications on human-like behavior? The answers to these research questions and the results of the experiments are presented in the next section.

Tree Reuse
Boltzmann Rollout
TABLE I: Modifications Used in MCTS Agents

V Results

Table II

presents the results of our experiments. All of the values that are shown with intervals are in the confidence interval of

. For counting the number of bugs found, if there are multiple occurrences of the same bug, it is counted as one. As more than one testers tested a game, Combined indicates all of the bugs found by these testers when their results are merged, and Individual implies bugs found by each agent. As we aim to create tester agents, we also value the agents who find most bugs with shorter test sequences within the shortest computational budget. Lastly, we used cross-entropy to compare the interactions executed by a human tester’s trajectory with that of the human-like agent. The lower the cross-entropy, the more similar are the interactions.

V-a Game A

Game A has a 67 grid size. Table II shows that human testers, when combined, were able to find 90% of the bugs, whereas individual performance is almost half of this score. Human-like MCTS agents with 40ms computational budget were able to find all of the bugs, except BR-MCTS. All MCTS agents generated a similar length sequence, except BR-MCTS. Cross-entropy scores of MM-MCTS and SP-MCTS are lower than the other MCTS agents. Increasing the computational budget increased the bug finding percentages and decreased the sequence lengths. Cross-entropy scores also decreased for every agent except FE-MCTS. The synthetic agent with a 40ms computational budget was not able to find all the bugs. SP-MCTS has the highest bug finding score and FE-MCTS has the lowest sequence length. The increase in the computational budget affects FE-MCTS and MM-MCTS positively. Baseline MCTS agents found at most 44% of the bugs with 40ms computational budget, and an increase to the computational budget decreased the bug finding percentage to 40%. Overall human-like MCTS scores are similar to human-like Sarsa(), but synthetic Sarsa() score is better than synthetic MCTS.

V-B Game B

Fig. 1: The paths that lead to four different bugs in Game B are shown with lines. Yellow Line: When the Avatar pushes the Water Bucket into Key, the two sprites overlap. The rule to prevent this overlap is missing in VGDL, which is a discrepancy from the game design. Orange Line: The Avatar can push the Water Bucket into Wall. The collision rule between this specific Wall and Water Bucket is missing in VGDL, which is another discrepancy from the game design. Purple Line: The Avatar can move through the Wall and can finish the game without picking up the Key. These requirements exist in the game design, but they are not implemented in VGDL.

Game B has an 89 grid size, as shown in Fig. 1. Table II shows that human testers, combined, were able to find all of the bugs. However, when they are evaluated individually, their scores are lower than Game A. In Game B, none of the agents were able to find all of the bugs. Under the 40ms computational budget, FE-MCTS found more bugs than other MCTS agents. The sequence length of all MCTS agents is similar except BR-MCTS. KBE-MCTS and then FE-MCTS has the lowest cross-entropies. The increase in computational budget decreased the cross-entropies of all agents. This increase also positively affected all agents except FE-MCTS. For synthetic agents, under both computational budgets, FE-MCTS found more bugs than other MCTS agents. Baseline scores of KBE-MCTS, FE-MCTS, and BR-MCTS are close and higher than MM-MCTS and SP-MCTS. The bug finding percentage of human-like Sarsa(), and synthetic Sarsa() is higher than human-like MCTS, and synthetic MCTS, respectively.

V-C Game C

Game C has the biggest grid size of all three games, which is 1011. Table II shows that several MCTS agents were able to surpass the Sarsa(). The individual bug finding performances of human testers are the lowest, and their combined performance is 90%. Although synthetic MCTS agents do not surpass synthetic Sarsa() agents, the human-like MCTS agents surpass human-like Sarsa() agents and some, even surpass human testers. The increase in the computational budget also increases the performance of every MCTS agent except baseline BR-MCTS. Using 300ms computational budget, human-like FE-MCTS and SP-MCTS beat every other agent. The shortest trajectory amongst human-like agents is executed by MM-MCTS, and FE-MCTS has the lowest cross-entropy. Synthetic BR-MCTS has the best bug finding percentage amongst MCTS agents. The baseline agent was not able to find any bugs in some runs and was most efficiently played by FEBR-MCTS.

Bug Finding Percentage % Trajectory Length Cross-Entropy
Game A
Game B
Game C
Game A
Game B
Game C
Game A
Game B
Game C
Computational Budget 40ms Sequence Length
Computational Budget 300ms
TABLE II: Bug Finding Percentage, Trajectory Length, Cross-Entropy Results of Human Testers and Agents using Sarsa(), KBE-MCTS, MM-MCTS, FE-MCTS, BR-MCTS, SP-MCTS obtained from Game A (6x7), Game B (8x9), and Game C (10x11). The values shown with range have values Confidence Interval of 0.95.

Vi Discussion

In this paper, we experimented with several modifications to MCTS for creating a better tester agent. We experimented with these modifications on three games with 45 bugs, evaluated their effects on bug finding performance, how they are affected by the computational budget, and compared these findings with the human testers and an agent using Sarsa().

To address RQ1, we discuss the effect of increasing the computational budget on the MCTS modifications. For human-like agents, our experiments reveal that the increase in computational budget positively affected most of the MCTS agents. Amongst those, KBE-MCTS benefits the most. Therefore, we can state that KBE-MCTS was not able to explore the tree using a 40ms computational budget. SP-MCTS and MM-MCTS are also affected positively, but not as much as KBE-MCTS. However, with additional computation, they become more stable agents, as their confidence interval shrinks. BR-MCTS has a better performance than KBE-MCTS in Game B, and Game C using a 40ms computational budget. However, the increase in the computational budget reverses the situation. The bias in the rollouts of BR-MCTS limits its upper bound and becomes the most stable agent. The only advantage of FE-MCTS over KBE-MCTS is tree reuse. Tree reuse improves the bug-finding percentage considerably under 40ms, but with the increase in computational budget, KBE-MCTS outperforms FE-MCTS in Game A and Game B. Since these games are small compared to Game C, tree reuse starts to decrease the stochasticity. FE-MCTS also executes the shortest sequences under 300ms computational budget. For synthetic agents, the increase in computational budget increased the bug finding performance of all agents, which indicates that synthetic test goals are more difficult to reach compared to human-like test goals. The percentage of bugs that a baseline agent can find is limited to the bugs in the scenario, and baseline Sarsa() represents this percentage. In baseline FE-MCTS, the increase in the computational budget had a positive effect. For other baseline MCTS agents, the effect is perplexed. Furthermore, there are instances where a baseline MCTS agent surpasses the baseline Sarsa(). This bug-finding performance boost is also due to the stochasticity of the MCTS, which also contributed to finding the fake walls in [14].

We address RQ2 by comparing the effects of modifications on bug finding performances. For human-like agents, SP-MCTS with 300ms computational budget is the overall best. Although FE-MCTS can reach the same upper bound and even exceed that bound, FE-MCTS achieves this performance when we consider both computational budgets. However, for these instances, the sequence lengths of SP-MCTS are shorter than FE-MCTS. MM-MCTS has more variance than the other agents in Game C. This variance can be explained by the MixMax modification. The upper bounds of MM-MCTS and SP-MCTS are close, but since SP-MCTS explores more, it guarantees a higher lower bound. BR-MCTS is the least successful human-like agent, but it is stable. On the other hand, when we look at synthetic agents, SP-MCTS is one of the least successful agents, and BR-MCTS starts to excel. This indicates that synthetic test goals are located deeper in the tree, compared to human-like goals. Therefore, MixMax, Tree Reuse becomes useful. BR-MCTS is also useful as these goals can be found during biased rollouts. There is no clear winner in synthetic MCTS, but FE-MCTS is promising.

To address RQ3, we compare the MCTS variants with Sarsa() using Table II. In Game A, human-like agents using MCTS variants reach the bug finding percentage obtained with Sarsa(), and they can achieve this with a 40ms computational budget, except BR-MCTS. In Game B, the human-like agent using MM-MCTS, FE-MCTS, and SP-MCTS is 3-5% behind of Sarsa(). In Game C, the MM-MCTS, FE-MCTS, and SP-MCTS with 40ms computational budget and all human-like MCTS agents with 300ms surpass the bug finding performance of Sarsa(). These bug finding percentages show that human-like MCTS agents can compete with Sarsa() in bug-finding with the advantage of using a less computational budget. For synthetic agents, we observe that bug finding performance of Sarsa() is better. Nevertheless, in Game A and Game B, FE-MCTS; in Game C, MM-MCTS are the best competitors. Furthermore, for every game, we can see that Sarsa() produces shorter sequences and these sequences are more human-like compared to MCTS agents. For the baseline agent, FE-MCTS with 300ms computational budget performs closest to Sarsa(), thanks to the tree reuse modification.

We address the RQ4 by comparing the cross-entropy scores in Table II

. There is a direct relation with the human-likeness of KBE-MCTS, SP-MCTS, and MM-MCTS with the computation budget, but not for FE-MCTS, and MM-MCTS. However, we cannot state that if an agent performs closer to the original human it will find more bugs. The heuristics learned from human testers provide the goals to test the game, and any randomization added while generating a sequence will decrease the similarity. However, due to randomization, different runs can find distinct bugs.

Vii Conclusion

In this paper, we employed several modifications to the MCTS algorithm to evaluate their effects on finding bugs. In this regard, we proposed to use transpositions, knowledge-based evaluations, tree reuse, MixMax, Boltzmann rollouts, and SP-MCTS. We exercised these modifications in three games. Our synthetic and human-like test goals were exercised using MCTS to generate sequences that were later replayed in the game to check for bugs with our oracle. We employed two different computational budgets, 40 and 300 milliseconds, to better understand the effect of timing on these modifications.

Our results show that the modifications are useful, but the effectiveness of modification depends on the type of the agent. From our experiments, we found that for the synthetic agent MM-MCTS and FE-MCTS performed better, but BR-MCTS had a better lower bound score with a shorter sequence. FE-MCTS with 300ms computational budget was a better baseline agent than the other baseline MCTS. For human-like agents, SP-MCTS performed solid within both computational budgets, and FE-MCTS was a close contender.

In the future, we would like to experiment with reuse strategies for MM-MCTS and SP-MCTS. MM-MCTS and FE-MCTS are the best performing synthetic MCTS agents, so their combination may beat synthetic Sarsa(). Integrating tree reuse to SP-MCTS may create a more powerful human-like agent. Furthermore, we would like to extend the experiments with various GVG-AI games.


  • [1] S. Ariyurek, A. Betin-Can, and E. Surer (2019) Automated video game testing using synthetic and human-like agents. IEEE Transactions on Games (), pp. 1–1. External Links: Document, ISSN 2475-1510 Cited by: §I, §II-C.
  • [2] H. Baier and M. H. M. Winands (2016-Sep.) Time management for monte carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games 8 (3), pp. 301–314. External Links: Document, ISSN 1943-068X Cited by: §III-G.
  • [3] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton (2012-03) A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games 4 (1), pp. 1–43. External Links: Document, ISSN 1943-068X Cited by: §I, §II-A, §III.
  • [4] B. E. Childs, J. H. Brodeur, and L. Kocsis (2008-12) Transpositions and move groups in monte carlo tree search. In 2008 IEEE Symposium On Computational Intelligence and Games, Vol. , pp. 389–395. External Links: Document, ISSN 2325-4270 Cited by: §III-A, §III-C.
  • [5] C. Cho, D. Lee, K. Sohn, C. Park, and J. Kang (2010-10) Scenario-based approach for blackbox load testing of online game servers. In 2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, Vol. , pp. 259–265. External Links: Document, ISSN Cited by: §I.
  • [6] J. S. B. Choe and J. Kim (2019-08) Enhancing monte carlo tree search for playing hearthstone. In 2019 IEEE Conference on Games (CoG), Vol. , pp. 1–7. External Links: Document, ISSN 2325-4270 Cited by: §III-A.
  • [7] F. Frydenberg, K. R. Andersen, S. Risi, and J. Togelius (2015-08) Investigating mcts modifications in general video game playing. In 2015 IEEE Conference on Computational Intelligence and Games (CIG), Vol. , pp. 107–113. External Links: Document, ISSN 2325-4289 Cited by: §I, §III-D.
  • [8] J. Hernández Bécares, L. Costero, and P. Gómez-Martín (2016-08) An approach to automated videogame beta testing. Entertainment Computing 18, pp. . External Links: Document Cited by: §I.
  • [9] S. Iftikhar, M. Z. Iqbal, M. U. Khan, and W. Mahmood (2015-Sep.) An automated model based testing approach for platform games. In 2015 ACM/IEEE 18th International Conference on Model Driven Engineering Languages and Systems (MODELS), Vol. , pp. 426–435. External Links: Document, ISSN Cited by: §I.
  • [10] E. İlhan and A. Ş. Etaner-Uyar (2017-08) Monte carlo tree search with temporal-difference learning for general video game playing. In 2017 IEEE Conference on Computational Intelligence and Games (CIG), Vol. , pp. 317–324. External Links: Document, ISSN 2325-4289 Cited by: §I, §III-B.
  • [11] E. J. Jacobsen, R. Greve, and J. Togelius (2014) Monte mario: platforming with mcts. In

    Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation

    GECCO ’14, New York, NY, USA, pp. 293–300. External Links: ISBN 978-1-4503-2662-9, Link, Document Cited by: §III-D.
  • [12] S. James, G. Konidaris, and B. Rosman (2017) An analysis of monte carlo tree search. In AAAI, Cited by: §III-E.
  • [13] A. Khalifa, A. Isaksen, J. Togelius, and A. Nealen (2016) Modifying mcts for human-like general video game playing. In IJCAI, pp. 2514–2520. External Links: Link Cited by: §III-D.
  • [14] T. Machado, D. Gopstein, A. Nealen, O. Nov, and J. Togelius (2018) AI-assisted game debugging with cicero. 2018 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8. Cited by: §VI.
  • [15] M. J. Nelson (2016-Sep.) Investigating vanilla mcts scaling on the gvg-ai game corpus. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), Vol. , pp. 1–7. External Links: Document, ISSN 2325-4289 Cited by: §III-G.
  • [16] M. Ostrowski and S. Aroudj (2013) Automated regression testing within video game development. GSTF Journal on Computing (JoC) 3 (2), pp. 1–5. Cited by: §I.
  • [17] T. Pepels, M. H. M. Winands, and M. Lanctot (2014-Sep.) Real-time monte carlo tree search in ms pac-man. IEEE Transactions on Computational Intelligence and AI in Games 6 (3), pp. 245–257. External Links: Document, ISSN 1943-068X Cited by: §III-C, §III-C.
  • [18] D. Perez, S. Samothrakis, and S. Lucas (2014-08) Knowledge-based fast evolutionary mcts for general video game playing. In 2014 IEEE Conference on Computational Intelligence and Games, Vol. , pp. 1–8. External Links: Document, ISSN 2325-4289 Cited by: §I, §III-E.
  • [19] D. Perez Liebana, S. Mostaghim, S. Samothrakis, and S. Lucas (2014-01) Multi-objective monte carlo tree search for real-time games. IEEE Transactions on Computational Intelligence and AI in Games, pp. 1–1. External Links: Document Cited by: §III-A.
  • [20] D. Perez-Liebana, S. Samothrakis, J. Togelius, T. Schaul, S. M. Lucas, A. Couëtoux, J. Lee, C. Lim, and T. Thompson (2016-Sep.) The 2014 general video game playing competition. IEEE Transactions on Computational Intelligence and AI in Games 8 (3), pp. 229–243. External Links: Document, ISSN 1943-068X Cited by: §II-B.
  • [21] J. Pfau, J. Smeddinck, and R. Malaka (2017) Automated game testing with icarus: intelligent completion of adventure riddles via unsupervised solving. In CHI PLAY, Cited by: §I.
  • [22] E. J. Powley, P. I. Cowling, and D. Whitehouse (2014-12) Information capture and reuse strategies in monte carlo tree search, with applications to games of hidden information. Artif. Intell. 217 (C), pp. 92–116. External Links: ISSN 0004-3702, Link, Document Cited by: §I, §III-B, §III-E.
  • [23] E. J. Powley, P. I. Cowling, and D. Whitehouse (2017) Memory bounded monte carlo tree search. In AIIDE, pp. 94–100. Cited by: §III-C.
  • [24] A. Santos, P. A. Santos, and F. S. Melo (2017-08) Monte carlo tree search experiments in hearthstone. pp. 272–279. External Links: Document Cited by: §III-C.
  • [25] R. E. S. Santos, C. V. C. de Magalhães, L. F. Capretz, J. S. C. Neto, F. Q. B. da Silva, and A. Saher (2018) Computer games are serious business and so is their quality: particularities of software testing in game development from the perspective of practitioners. CoRR abs/1812.05164. Cited by: §I.
  • [26] M. P. D. Schadd, M. H. M. Winands, H. J. van den Herik, G. M. J. -B. Chaslot, and J. W. H. M. Uiterwijk (2008) Single-player monte-carlo tree search. In Computers and Games, Berlin, Heidelberg, pp. 1–12. External Links: ISBN 978-3-540-87608-3 Cited by: §III-F.
  • [27] T. Schaul (2014-12) An extensible description language for video games. Computational Intelligence and AI in Games, IEEE Transactions on 6, pp. 325–331. External Links: Document Cited by: §II-B.
  • [28] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016)

    Mastering the game of go with deep neural networks and tree search

    Nature 529, pp. 484–489. Cited by: §III-B.
  • [29] D. Silver and G. Tesauro (2009) Monte-carlo simulation balancing. In

    Proceedings of the 26th Annual International Conference on Machine Learning

    pp. 945–952. Cited by: §III-E.
  • [30] D. J. N. J. Soemers, C. F. Sironi, T. Schuster, and M. H. M. Winands (2016-Sep.) Enhancements for real-time monte-carlo tree search in general video game playing. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), Vol. , pp. 1–8. External Links: Document, ISSN 2325-4289 Cited by: §I, §III-C.
  • [31] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I, §III-E.
  • [32] M. Świechowski, J. Mańdziuk, and Y. S. Ong (2016-Sep.) Specialization of a uct-based general game playing program to single-player games. IEEE Transactions on Computational Intelligence and AI in Games 8 (3), pp. 218–228. External Links: Document, ISSN 1943-068X Cited by: §III-A.
  • [33] M. J. W. Tak, M. H. M. Winands, and Y. Bjornsson (2012-06) N-grams and the last-good-reply policy applied in general game playing. IEEE Transactions on Computational Intelligence and AI in Games 4 (2), pp. 73–83. External Links: Document, ISSN 1943-0698 Cited by: §III-E.
  • [34] C. Xiao, J. Mei, and M. Müller (2018) Memory-augmented monte carlo tree search. In AAAI, pp. 1455–1462. Cited by: §III-A.