Mixed-Initiative Procedural Content Generation using Level Design Patterns and Interactive Evolutionary Optimisation

by   Sean P. Walton, et al.

An approach for building mixed-initiative tools for the procedural generation of game levels using interactive evolutionary optimisation is introduced. A tool is created based on this approach which (a) is focused on supporting the designer to explore the design space and (b) only requires the designer to interact with it by designing levels. The tool identifies level design patterns in an initial hand-designed map and uses that information to drive an optimisation algorithm. This results in a number of suggestions which are presented to the designer, who can then edit them providing the system with valuable designer feedback. The effectiveness of this approach to create levels with similar level design patterns to a target is illustrated through a series of algorithm driven benchmark tests. To test the mixed-initiative aspect of the tool a triple-blind mixed-method, user study was conducted. When compared to a control group, provided with random level suggestions throughout the design process, the mixed-initiative approach increased engagement in the level design task and was effective in inspiring new ideas and design directions. This provides significant evidence that procedural content generation can be used as a powerful tool to support the human design process.



There are no comments yet.


page 5

page 7

page 12

page 13

page 14

page 15


The Impact of Visualizing Design Gradients for Human Designers

Mixed-initiative Procedural Content Generation (PCG) refers to tools or ...

Learning the Designer's Preferences to Drive Evolution

This paper presents the Designer Preference Model, a data-driven solutio...

Interactive Constrained MAP-Elites Analysis and Evaluation of the Expressiveness of the Feature Dimensions

We propose the Interactive Constrained MAP-Elites, a quality-diversity s...

Graphic Narrative with Interactive Stylization Design

We present a system to convert any set of images (e.g., a video clip or ...

Towards Friendly Mixed Initiative Procedural Content Generation: Three Pillars of Industry

While the games industry is moving towards procedural content generation...

Designing Effective Interview Chatbots: Automatic Chatbot Profiling and Design Suggestion Generation for Chatbot Debugging

Recent studies show the effectiveness of interview chatbots for informat...

Empowering Quality Diversity in Dungeon Design with Interactive Constrained MAP-Elites

We propose the use of quality-diversity algorithms for mixed-initiative ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Game developers are under increasing pressure not only to launch games with hours of unique content, but to continue to add new fresh content post launch [16, 10, 6]. Creating new and diverse content is expensive both in terms of time and money [13]. This provides motivation [13, 17, 3] to develop tools which can support or automate content generation, which is the aim of procedural content generation (PCG) algorithms [22]. PCG algorithms have been developed to create a wide variety of content [13]. In addition to saving developers time PCG can also benefit the player experience, resulting in an increased diversity of content [13, 17, 19] and creating a source of curiosity and unpredictability [5]. Perhaps the most notable example of this in recent years is Hello Game’s title No Man’s Sky111https://www.nomanssky.com/, a space exploration game in which almost everything is procedurally generated [1]. There are even examples of using PCG as a game mechanic itself, such as in the game Petalz [15] where players breed and share flowers, becoming part of the PCG algorithm itself.

Despite the clear benefits of PCG algorithms, there are still a number of open challenges in the field. For example, the vast majority of PCG algorithms are highly problem specific, often designed for a single genre of game [19] or limited to specific geometries [23]. A frequently-cited limitation, which motivates our work, is the lack of control human designers have when generating content using PCG [17, 23]. PCG algorithms are often non-intuitive, requiring designers to tweak and adjust tuning parameters which are difficult to relate to their goals. This ultimately limits the control designers have over the generation process [22] and builds a knowledge barrier [3].

Despite significant investment into researching new methods for PCG, there is little research on how designers interact with these tools [7]. In an attempt to address this gap, Craveirinha and Roque [7] undertook a participatory design process involving game designers and researchers to design an interface for a PCG algorithm. In doing so they explored the attitudes of game designers toward PCG tools. They make two key observations which will inform our work:

  • The tool needs an understandable metaphor. This finding highlights the problem with the complexity of PCG algorithms.

  • Exploration is needed before optimisation. Many PCG algorithms work by optimising certain metrics which the algorithm designers have identified as being important for player experience. These metrics, and target values for them, are determined a priori. Designers do not operate in this way, but instead explore the design space to determine metrics which can then be used to optimise player experience.

1.1 Our Contribution

In this paper we present a new mixed-initiative approach to PCG using level design patterns and interactive evolutionary optimisation. Our technique is rooted in the approach introduced by Baldwin et al. [3], but placed into the context of the two observations of Craveirinha and Roque [7]. We have re-framed these observations into two design pillars which are at the core of our decision making:

  1. The designer must interact with the algorithm by designing content, rather than adjusting parameters.

  2. The designer will be supported to explore the design space.

Our test application is designing a series of small dungeon maps (as in [3]), or mazes, which would be shipped with the game, rather than tuning an algorithm which generates new maps at run-time. We validate this approach by running a triple-blind user study comparing our algorithm to a tool which gives the user random suggestions in place of evolutionary optimisation. The tool we created and the source code is available on GitHub222https://github.com/seanwalton/mixed-initiative-procedural-dungeon-designer.

2 Background

2.1 Search-Based Procedural Content Generation

There are numerous approaches to PCG [20]. In our work we adopt a search-based approach to PCG as it aligns well with our second design pillar to support exploration. In search-based PCG an algorithm generates a large volume of content and evaluates each item created using a fitness function. There are two key identifying characteristics of a search-based approach: (a) the fitness function allows the comparison and ranking of content, and (b) this ranking is used to inform the generation of new content [20]

. Search-based PCG approaches are often implemented using evolutionary algorithms (EAs); optimisation algorithms which aim to minimise a fitness function over several generations. In the context of PCG, an EA will initialise a population of potential designs, rank these according to the quality defined by the fitness function, then create the next generation through stochastic mutation and interbreeding 

[3]. Search-based approaches have been used to generate a wide range of content including mazes [2, 12], race tracks [11] and dungeon maps [21]. A common aspect of these contributions is that the authors design and specify a fitness function which they argue will result in a good player experience, in some cases validating it with player testing. Our aim is to allow the designer to directly influence the fitness function through design, rather than relying on the fitness function to dictate what is good.

2.2 Mixed-Initiative Approaches to Content Generation

As mentioned in the introduction, one of the key challenges of PCG algorithms is that level designers often do not have knowledge of how to control them. This challenge is directly related to our first design pillar, that designers should interact with our system by designing content. Many researchers [10, 3, 17, 23] have made contributions towards addressing this challenge. The work by Liapis et al. [10] and Baldwin et al. [3] are particularly relevant to our goals and inform our approach.

Liapis et al. [10] introduced the Sentient Sketchbook, a tool for supporting designers creating levels for games. As the designer sketches ideas via the tool’s interface, real-time feedback is given to the designer based on a number of game play relevant metrics. The tool suggests alternative map designs based on the sketch the designer creates. This is achieved through a genetic search algorithm which attempts to maximise the map’s score based on a number of metrics, or a diversity measure. The results of all these searches are presented to the designer. The general feedback from their user study was positive, with users reporting that the tool started pushing them in design directions they did not initially expect.

Baldwin et al. [3] present a mixed-initiative tool for generating dungeon levels using evolutionary algorithms. Their aim was to allow the designer to control the algorithm using parameters with which they are familiar with, based on what they term game design patterns, such as mean corridor length or number of enemies. We suggest a slight change in terminology by referring to these as level design patterns hereafter. Game design often refers to the design of mechanics in a game rather than the level geometry, so we feel it is clearer to use the term level design patterns when describing these metrics. Essentially, the level designer specifies targets for the various level design pattern metrics and an evolutionary algorithm attempts to optimise a fitness function based on this. Their results show an impressive ability of control based on these patterns, hence we were motivated to extend this technique further.

3 Methodology

3.1 Specification and Design of the System/Tool

Figure 1: Artwork used to represent map layout and tiles. Assets are distributed by LazerGunStudios without license at {https://lazergunstudios.itch.io/roguelike-asset-pack}. In (a), a complete map with a path from entrance to exit is shown, while in (b) we show different adjustable components of the map (Floor, Wall, Treasure, Enemy, Entrance and Exit).

A system was designed in the context of two design pillars – described and justified in Section 1.1 – to support a level designer in creating a series of 2D maps/levels for a simple dungeon game. An example of a map is shown in Figure (a)a. In this study the dungeon maps are made up of 12 by 12 tiles. Each tile has one of six possible values: (1) Wall: this is impassable by the player. (2) Floor: this is passable by the player. (3) Treasure: this is an item which is desirable for the player to reach. (4) Enemy: this is a non-player character which can damage the player, something the player wishes to avoid. (5) Entrance: this is where the player starts in the level, there is only one entrance per level. (6) Exit: the player’s goal is to reach the exit. There is only one exit per level. There must be a passable path between the entrance and exit for a level to be valid. The graphical representation of these tiles is shown in Figure (b)b.

When surveying the search-based PCG literature we observed two key points:

  1. Search-based PCG is inherently a multi-objective problem

  2. The majority of researchers tackle this multi-objective problem by combining the results from multiple fitness functions into one scalar value through a weighted sum.

An exception to this is the work by Loiacono et al. [11] who used a multi-objective optimisation algorithm without scalarisation. They found an interesting diversity of solutions along the Pareto fronts, which has the potential to support our second design pillar. Although there are many advanced techniques for multi-objective optimisation and finding the Pareto front [27] we opt for a simple approach for two reasons. Advanced techniques are often less efficient and therefore take longer to produce solutions and, in our practical experience, simpler approaches tend to be more robust and easier to adapt to new applications. Therefore we adopt a scalarising approach based on a multi-criterion ranking. One map ranks higher than another if all of its fitness values are better or the same. This comparison method will be used to determine the outcome of tournaments which are used to select which individuals in the evolutionary optimisation algorithm procreate to create the next generation.

Since we wish our designers to interact with our system through designing levels, we turn to the approach by Liapis et al. [10] as a starting point. In their approach suggestions are presented to the designer by optimising predetermined fitness functions with the designer’s initial design as a starting point. In our approach the level designer will design the first level, the system will then calculate some metrics which describe that level and record those as targets. An evolutionary optimisation algorithm will then randomly initialise a population and try to match the metrics from the user-designed level. Preuss et al. [14] found that restarting their evolutionary algorithm performs as well as advanced approaches to increasing novelty and diversity. Therefore we will restart our algorithm at regular intervals and use this opportunity to allow the level designer to influence the target metrics at run time. This will be achieved by allowing the level designer to edit and select maps produced by the system which are desirable. The system will store the metrics of these liked maps and use them in fitness function evaluations.

3.2 System Overview

1:user designs first level
2:store in the list of liked maps and the list of levels
4:     run optimisation algorithm
5:     display a subset of maps from the final generation of the GA
6:     user may edit maps and tag them as like and/or keep
7:     for each map  do
8:         if  is tagged like or keep then
9:              store in the list of liked maps
10:              if user has tagged to keep then
11:                  add to the list of game levels
12:              end if
13:         end if
14:     end for
15:until the list of game levels is full
Algorithm 1 System Overview
Figure 2: Feedback view of the system. At the top row (five smaller windows), the user can see the designs that they have already chosen or created. In middle and bottom rows, we show the generated levels, and provide options for keeping or liking designs. On the right, the user has the option to request further suggestions.

In Algorithm 1, we show how the final system works. The user is initially asked to design a map from scratch, once happy with the map the user clicks the submit button. After the optimisation algorithm has finished running the user is presented with the feedback view shown in Figure 2. The eight maps displayed are a selection from the feasible population of the final generation produced by the optimisation algorithm. These are interactive allowing the user to edit them. Underneath each map are tick boxes which allow the user to tag them as like and/or keep. Any that are tagged like or keep are then used in subsequent fitness evaluations by the optimisation algorithm, explained in more detail in Section 3.4.1. Maps tagged keep are added to the list of levels at the top of the view.

3.3 Metrics used to Define the Fitness Functions

The fitness functions based on level design patterns designed by Baldwin et al. [3] show an impressive ability to control the types of maps generated by their search algorithm. Therefore, we have opted to use these functions along with visual impression metrics which Preuss et al. [14] found to be highly effective. In total there were 31 metrics used to characterise a map design. The metrics are split into two broad categories: level design patterns (3.3.1 to 3.3.7) and visual impression metrics (3.3.8 to 3.3.9). We use the notation that is metric calculated for the map .

3.3.1 Path Length

is simply the path length, , measured in number of tiles, divided by the total number of tiles in the map, .

3.3.2 Global Wall to Passable Tile Ratio

is the ratio of walls to non–wall tiles in the map.

3.3.3 Corridor Metrics

Corridors are defined as horizontal or vertical series of passable tiles enclosed by impassible tiles on either side [3]. In our implementation corridors of length one are counted as corridors. A simple fill algorithm is used to identify corridors within a map, each corridor, , is then stored where is the number of corridors. The metrics to are the number of corridors followed by the maximum, minimum and mean corridor lengths.

3.3.4 Chamber Metrics

A chamber is defined as a continuous block of passable tiles which are wider than a corridor. A less rigid definition is followed than the one outlined by Baldwin et al. [3]. In their work these metrics are used to generate dungeons using user inputs such as chamber size, therefore they have to consider what a user might expect a chamber to look like. In our work these metrics are only used to compare the structure of two maps, as long as the metrics are consistent they will achieve this goal. We do not want to assume a minimum chamber size. Chambers are identified following corridor identification. Once chambers are identified two qualities for each chamber is calculated, the area and squareness given by:


Where and are the height and width of chamber . This then leads to 7 metrics () for chambers. The total number of chambers, the maximum, minimum and mean chamber areas and maximum, minimum and mean chamber squareness.

3.3.5 Dead Floor Tiles

A dead tile is defined as a passable tile which has not been identified as a chamber or corridor. These often appear as tiles which connect multiple corridors or chambers. The metric, is simply the number of these tiles divided by the total number of tiles.

3.3.6 Entrance Metrics

Two metrics are defined for the number of treasure and enemy tiles around the entrance [3]. is the minimum area around the entrance tile which does not contain an enemy tile, and is the minimum area around the entrance tile which does not contain a treasure tile. Dividing these by results in two metrics and respectively.

3.3.7 Enemy and Treasure Metrics

and are simply the fraction of enemy and treasure tiles respectively. In addition a safety measure, defined in [3], is calculated for each treasure and and

are the mean and standard deviation of this.

3.3.8 Visual Symmetry of Wall Tiles

Preuss et al. [14] introduced a number of visual symmetry metrics which we have adapted for use here. Two lines of symmetry are defined along the centre of the map horizontally and vertically. The number of a specific type of tile is counted either side of these lines then used to calculate ratios. For example,

is the number of wall tiles in the top half of the map and

is the number of wall tiles in the left half of the map. A total of 8 metrics are defined based on these ratios, for example the left to right wall tile ratio is:


There is also a top to bottom wall ratio , left to right and top to bottom enemy and treasure ratios (), and treasure to enemy ratios defined as:


For equations 3 to 5 if the denominator would be zero the metric is given a value of zero.

3.3.9 Exact Symmetry Metrics

As well as the visual symmetries we also introduce and define 3 metrics which give a measure of exact reflection over the symmetry lines used for (). In addition a measure of rotational symmetry is considered by comparing the map against its transpose. These metrics () are calculated by simply counting the number of tiles that exactly match their reflected counterpart across the various symmetry lines (or to the tile at its transposed location) and express them as a fraction of the total number of tiles.

3.4 Genetic Algorithm

Following the approach of Baldwin et al. [3]

we use a feasible–infeasible two-population (FI-2Pop) genetic algorithm (GA) 

[9] as our evolutionary optimisation algorithm. In short, FI-2Pop works the same as a standard GA but splits the population into two sub-populations, feasible and infeasible. In our application feasible maps have a valid path from the entrance to exit, infeasible maps do not. Individuals are automatically entered into the feasible and infeasible populations once created. In our system the only difference between evaluating fitness in these two populations is that , the path length, is not considered for the infeasible maps. Crossover is limited within each population, so only infeasible maps procreate with infeasible maps, and only feasible with feasible. A tournament selection approach is used to select individuals to reproduce and a number of elite individuals survive from one generation to the next.

3.4.1 Fitness Functions and Ranking Procedure

The fitnesses of an individual map are defined as


where , is one of the levels a user has liked or stored. This is a form of goal programming approach, where our aim is to generate maps that have metrics similar to those provided by the designer. Thus, a lower fitness is considered desirable.

Map ranks higher than map , denoted as , iff for all , and there is at least one fitness function for which [8]. Thus the Pareto set of mutually non-dominated solutions is defined as: , where is the feasible decision space. Typically, it is impossible to exactly locate the Pareto set, so often a good approximation is sufficient. We denote this approximation as .

Now, it is well known for a large number of objectives Pareto ranking may become ineffective as most solutions in the population tend to become part of the estimated non-dominated set

[4]. To reduce the number of candidate solutions, we chose to use a subset of : given , if more than for , then we prefer over . We call this new set . This ranking approach prioritises the solutions that are most common across all fitness functions over other mutually non-dominated solutions.

3.4.2 Crossover

Three stochastic crossover methods were designed and tested. In the notation which follows, two parent maps and are crossed to create a child map . Regardless of which method is used the first step is to randomly pick an entrance and exit from and such that has exactly one entrance and one exit. Random Selection: In this method is created by selecting each tile from and

with equal probability.

Edit Distance: This method is a slight modification of Random Selection. Here the edit distance is calculated between and . is simply the number of tiles whose values would need changing to change to . Then random tiles in , which do not equal those in , are switched to the values in . The resulting map is . Fixed Point: In this method two random tile positions are selected as corners of a bounding box. The child map is created by taking the tiles within the bounding box from and the tiles outside the bounding box from .

3.4.3 Mutation

Three mutation methods were designed to mutate a map . These methods were compared and tested. Replace: A random tile in , which is neither the entrance or exit, is selected. The value of that tile is then changed randomly to a non entrance or exit tile. Swap: A random tile in is selected and then swapped with a random adjacent tile to create the mutated map. Rotate: Two random tile positions are selected as corners of a bounding box. The values of tiles within that box are then transposed to create the mutated map.

4 Algorithm-Driven Benchmarks

4.1 Methodology

To test the effectiveness of the GA itself a series of studies were performed using an entirely algorithm driven-approach. For benchmark tests we use six maps presented by Baldwin et al. [3] to show the different styles of maps which could be created by varying the configurations of their approach. Conveniently they represent a range of styles, from maps with no corridors to maps with no chambers. We use these maps as benchmarks to avoid unconscious bias which could result from us designing our own. For each test the target map was entered as the initial user designed level. The GA was then run and the highest ranked map in the population at the end of the optimisation is presented.

4.2 Selecting Tuning Parameters

The performance of GAs is highly problem dependent [25]. It is therefore crucial to carry out a parameter sensitivity study for each new application to maximise performance. [24]. The map shown in Figure (a)a was selected for use in the parameter study since it has a balance of corridors and chambers. For brevity we do not present the detailed results of our studies, but explain our process and present the final parameters used. For each combination of parameters we performed tests with different random seeds for the random number generator. We then compared mean performance to select the final set of parameters. In all tests the number of objective function evaluations was kept constant at . This was a decision made based on the time taken for an optimisation run to complete on the machines used in the user study, with the aim of limiting the participation time of the user study to minutes. The best performing set of parameters and methods were found to be: Mutation Method: Swap, Crossover Method: Random Selection, Mutation Rate: , Tournament Size: , Number of Elite: , Population Size: and Number of Generations: .

4.3 Results

(a) Initial Design
(b) Output
(c) Sum of fitness functions for the highest ranked individual each generation, and mean of population
Figure 3: Algorithm-Driven Test A.

The first-algorithm driven test is one which is made up of corridors with zero chambers. The target map is shown in Figure (a)a, and the map created by the GA is shown in Figure (b)b. The created map contains only one chamber, is predominately made up of corridors and has the same number of treasure and enemy tiles as the targets. For this first test we have included an optimisation history graph, Figure (c)c, constructed by taking the sum of fitnesses for the best individual each generation. It is typical of the behaviour observed in all tests.

(a) Initial Design
(b) Output
Figure 4: Algorithm-Driven Test B

The results from test B are shown in Figure 4. In this test the target map has a single chamber and many corridors. The resulting output is dominated by corridors, although some of them are unreachable by the player. The output design has a similar ratio of passable to impassible tiles. Both maps have a single treasure and enemy tile.

(a) Initial Design
(b) Output
Figure 5: Algorithm-Driven Test C

Test C is a map with a comparable number of corridors to chambers. The results of this study are shown in Figure 5. The output has a similar balance of corridors and chambers, and a similar distribution of treasures and enemies.

(a) Initial Design
(b) Output
Figure 6: Algorithm-Driven Test D

The results for test D are shown in Figure 6. This target design is largely made up of chambers with a few corridors. The GA is capable of matching this distribution.

(a) Initial Design
(b) Output
Figure 7: Algorithm-Driven Test E

In test E the target map is made up of chambers connected by single tile corridors. The results of this test are shown in Figure 7. Much like the target the output is made up of chambers and single tile corridors with the same number of treasure and enemy tiles.

(a) Initial Design
(b) Output
Figure 8: Algorithm-Driven Test F

The final test, F, is simply a map with zero wall tiles. Figure 8 shows that the GA handles this edge case. Also notice that the path length is almost the same in both.

Our benchmark tests show that the algorithm at the core of our system is able to produce maps with similar qualities to those the designer presents to it.

5 User Study

5.1 Methodology

Yannakakis et al. [26] introduce an assessment methodology for mixed-initiative systems. They recommend evaluating how often the computational creations are used by the designer, and whether or not those creations changed the thinking process of the designer; our user study was designed to evaluate these aspects. Ethical approval for the study was obtained from the Swansea University College of Science ethics committee333SU-Ethics-Staff-100220/214. Our original plan was to perform the study in lab conditions, however due to the COVID-19 pandemic we had to change our methodology and carry out the study on-line. Four participants did complete the study in lab conditions prior to the UK lock-down, every effort was made to ensure parity between the lab and on-line experiments. Participants were recruited through social media and the research team’s professional networks. The only requirements for taking part in the study were that you had to be aged 18 or over and have access to an internet-connected computer running Windows or Linux. Participants were each given an information sheet which explained that we were investigating approaches people take when designing levels for video games, with the aim to better understand this process to enable us to make level design tools. They were then asked to create 5 levels for a simple dungeon game using a computer assisted tool. A set of instructions for using the tool and what constituted a valid level were provided along with the tool itself.

Some slight modifications to the tool were made for the user study. Alongside the suggestions from the system, participants were given a blank canvas where they could design a new level from scratch. Before starting the process the tool asked the participant to enter a unique ID. Based on this ID there was a 50% chance the tool used the GA designed in this paper to generate suggested maps. In all other cases maps were randomly generated with no optimisation at all. This was done using a triple-blind approach, neither the participant or researchers knew which algorithm had been selected until after the data was analysed. The result is that we have two groups of participants to compare, the GA group and the control group (who were given random suggestions). Once the participant completed the game design task they were asked to upload log files which contained quantitative results and answer a series of free response questions.

5.1.1 Quantitative Measures

Each participant submitted a log file which contained the following quantitative measures:

  • Which participant group they belong to (GA or control)

  • The number of maps the participant marked as like or keep at each iteration.

  • The number of times the participant created a map from scratch using the blank editor.

  • How much a participant tweaked a suggested design if they decided to keep or like it.

5.1.2 Qualitative Questions

Each participant was then asked 4 questions with a free text response. The questions were:

  1. Describe the process you took to design a new level.

  2. Was designing 5 levels challenging, or could you have easily designed many more? Explain your answer.

  3. Did the tool affect the way you designed your levels? Explain your answer.

  4. How would you describe the tool to someone else?

To analyse the responses an inductive coding approach was adopted. Codes were created by reading through all responses, to all questions, independently by each member of the research team. These codes were combined into a final set of codes for each question, which were used for the final coding which was performed by SW. This analysis was all carried out before participant responses were linked to their group, making our study triple-blind.

5.2 Results and Discussion

5.2.1 Materials

A total of 24 participants took part in the study. Of those 17 (71%) were male, 6 (25%) female and one (4%) did not disclose their gender. The mean age of participants was 25.2 years (SD = 7.81, range = 18 to 48). Participants were asked two questions relating to the frequency with which they play video games and their experience with designing levels. The majority (83%) of participants reported that they play games frequently, more than once a month. Around half (54%) of the participants reported designing game levels as an occasional pastime/hobby and a quarter had never attempted level design previous to the study. 21% of participants reported that level design was either their primary or secondary job.

5.2.2 Quantitative Results

5 participants failed to correctly upload log files following the user study resulting in a total of 19 data points, of which 11 (58%) were given level suggestions by the GA and 8 (42%) were in the control group. Welch’s t-test was used to determine statistically significant deviations between the means of the two groups, p-values of less than 0.05 were considered statistically significant.

The mean number of iterations taken to create 5 levels by the GA group was 5.08 (SD = 2.39) and the control group 3.88 (SD = 1.53). The p-value was 0.22 when comparing these groups, therefore there was no statistically significant difference between the two groups for number of iterations.

Figure 9: Comparing the number of edits of liked and kept levels between the GA and control group. The differences in the means of these distributions is statistically significant (p-value ).

The mean number of edits of liked maps was 13.00 (SD = 14.59) and 2.82 (SD = 5.34) for the GA and control groups respectively. For kept maps the mean number of edits was 14.81 (SD = 14.89) for GA, and 4.10 (SD = 6.02) for the control group. In both cases the p-value was less than 0.01, therefore the difference in the means is statistically significant. The full distributions are shown in Figure 9.

The mean number of likes per iteration for GA group was 1.31 (SD = 1.24) and control group 1.84 (SD = 1.15). The p-value for this comparison was 0.71 meaning that there was no statistical significance. In the group of participants who were presented suggestions by the GA there were 2 (3.5%) cases where a user used the blank canvas to create a new design from scratch, and 1 (3.2%) case in the control group. We can not conclude a statistically significant difference from this data.

5.2.3 Qualitative Results

A total of 24 participants answered the qualitative questions after they completed the task. Even though 5 participants failed to upload log files, we were able to determine which group they belonged to using their participant number. A total of 14 (58.4%) participants were given suggestions from the GA and 10 (41.6%) were in the control group.

Code Total Control GA
Thoughts relating to level design approach
Considered player experience/game mechanics 20 8 12
Creating Risk-Reward Trade-off/Balance 11 3 8
Encourage/reward exploration 11 3 8
Focused on the path from entrance to exit 10 4 6
Creating interesting decisions for the player 10 4 6
Incremental complexity/difficulty 6 2 4
Considered visual aesthetics 5 4 1
Aimed to create diversity 3 0 3
Used prior experience 2 1 1
Unstructured approach 2 1 1
Thoughts relating to the system/tool
Tweaked/edited suggestions from the system 4 1 3
Not satisfied by the suggested levels 1 0 1
Used suggestions from the system 1 1 0
Table 1: Describe the Process you Took to Design a New Level

Table 1 shows all of the codes and frequencies for the first question, a single response can have multiple codes assigned to it. Two broad topics of discussion were identified. The first were thoughts relating to the approach participants took to level design. For example, many (N=20, 83.3%) participants considered the player experience when designing levels. For example, “The first level ensured an easy layout where everything is encountered and choice is allowed…” Participants also used the answer to this first question to discuss how they used the system. The most frequent (N=4, 16.7%) subject was that they tweaked and edited suggestions from the system as part of their design process: “I kept generating layouts until I found one I liked, then tweaked it with a few things in mind.” Out of the 4 participants who discussed tweaking the suggestions from the system, 3 (75%) were given suggestions from the GA and only 1 (25%) was in the control group, suggesting that the suggestions had more of an influence on the experience of participants using the GA.

Code Total Control GA
Comments related to challenge
It was challenging to design multiple levels 10 4 6
It was easy to produce lots of maps 10 3 7
The designs I created ended up similar 3 1 2
Comments related to tool/system
The tool was useful/helped 5 3 2
The levels generated by the system changed my approach 2 1 1
The tool made it difficult 1 0 1
Comments related to the task
The limited design space/options made it challenging 5 2 3
It was enjoyable/fun/interesting 4 1 3
The rules of the game were not well defined, so it was difficult 3 2 1
Took longer than expected 2 0 2
Table 2: Was Designing 5 Levels Challenging?

Participant responses to the second question were broadly split into three categories shown, along with the codes and frequencies, in Table 2. Exactly the same number of participants (N=10, 41.7%) described the task as challenging or not challenging, with no apparent relationship between these answers and which group a participant was in. When explaining their answers there were two categories of responses, one relating to the tool itself and one to the level design task. For example, some (N=5, 20.8%) described the tool as helpful: “Developing levels from the pre-generated ones was a lot easier than from a blank canvas…” These positive responses were close to evenly split between the two groups, suggesting that any sort of map suggestion can support the design process. Many participants (N=14, 58.3%) related the challenge of the task to the level design task itself. Five (20.8%) participants commented that the limited design space and options made the task challenging. Four (16.7%) participants, largely made up of participants who were given suggestions by the GA (N=3, 75%), described the task as fun or enjoyable: “I could have done many more levels because its fun to do, and the tool is handy.” This suggests that participants who were given suggestions by the GA felt more engaged in the process.

Code Total Control GA
Description of the effectiveness
It did effect my approach 5 1 4
It moderately effected my approach 4 1 3
It did not effect my approach 3 2 1
Discussion of the suggestions presented by tool/system
I tweaked suggestions from the system 6 2 4
The suggestions changed my approach 6 2 4
It is good for generating starting points 6 2 4
The suggestions seemed random 3 1 2
I kept generating maps until something good appeared 3 2 1
Suggestions not varied enough 2 0 2
No suggestions were useful/helpful 2 1 1
I had to significantly modify the suggestions 2 0 2
Suggestions rarely got the treasure/enemy layout right 2 0 2
I tried to influence the suggestions 1 0 1
Some of the generated maps were unsuitable 1 0 1
Table 3: Did the Tool Effect the way you Designed your Levels?

Only half of participants directly stated if the tool affected their approach when answering question 3. These participants were disproportionately made up of those who were given suggestions by the GA. Of those, 7 participants from the GA group stated that the tool affected their approach, compared to 2 from the control group. The exact breakdown is presented in Table 3. When explaining their answer, the three most common responses (N=18, 75%) were that participants tweaked suggestions from the system, that the suggestions changed a participants approach or thinking and that the system created good starting points. For example, “…the next three were mostly local modifications of levels that were generated, the fourth was a more extensive reconceptualization of something I noticed the tool was doing…” Of the 18 participants that gave responses such as this 12 were from the group given suggestions by the GA and 6 were from the control group. Two participants who were given suggestions from the GA noted that the suggestions were not varied enough, two that they had to significantly modify the suggestions and one that some of the generated maps were unsuitable. Participant responses would sometimes include both positive and negative comments towards the suggestions given by the system. In total over twice as many comments relating to the suggestions from the system came from participants who were given suggestions by the GA (24 compared to 10). This further supports the conclusion that the GA group were more engaged with the process.

Code Total Control GA
It is a tool which works with the designer
It learns from your seed designs 7 2 5
It generates starting points - you’ll need to edit them 5 1 4
It suggests different levels to you 5 1 4
It helps inspire new ideas 2 1 1
It is a rapid prototyping tool 2 1 1
It is an interactive tool for PCG 1 0 1
It is a tool which works independently from the designer
It randomly generates levels 4 2 2
No inclusion of human approach to games 1 0 1
Description of UI/UX
Functional description of UI 5 3 2
It is fun/enjoyable 2 1 1
The tool can be tedious 1 0 1
Table 4: How Would you Describe the Tool to Someone Else?

In the final question participants were asked to describe the tool, Table 4 shows the results from coding the answers to this question. There were two identifiable groups of description based on how a level designer interacts with the tool. Participants either described it as a tool which works with or independently to the designer. Overall there were more comments which described the tool as working with the designer than independently, 22 compared to 5. Of those comments which described the tool as working with the designer 16 came from participants who were given suggestions by the GA and 6 from the control group. The most frequent (N=7, 29.2%) comment was that the tool learns from your seed design, such as “…it takes the level I originally made, and adapts it, creating 7 new maps as suggestions…” A number of participants (N=4, 16.7%) described the tool as generating random levels, evenly split between both groups. We opted to put these responses the category of describing a tool which works independently from the designer. In some cases it might be true that participants have used the term random in place of procedural. For example “It will randomly generate a level for you, which can help inspire new thoughts…” could be interpreted as implying that the tool does work with the designer. However whenever the case appeared where a comment could be interpreted in multiple ways, we always interpreted it in the way which does not provide evidence that our design goals were met.

6 Discussion

At the start of this paper we detailed two design pillars we aimed to satisfy with our system. The first, that the designer should interact with the algorithm by designing content, was shown to be achieved in the algorithm-driven benchmarks. In that study the results show that the GA can reproduce maps with similar qualities to a supplied map design without needing to adjust or specify parameters. To evaluate the second pillar, that designers will be supported to explore the design space, we conducted the user study. A common thread throughout the qualitative data was that those participants who were given suggestions by the GA talked a lot more about the suggestions the system gave them. They described the tool as learning from the designs they created and as a tool which works with the designer to support prototyping, they clearly understood the metaphor of our system. The participants from the control group focused more on the functional description of the UI and generally provided less detailed responses to questions. This general lack of engagement from participants in the control group is further supported by the quantitative data which showed that the GA group edited suggestions by the system more than the control group. Initially we thought this was an indication that the GA was doing a bad job, but when taken in context of the qualitative data we found that these participants were considering their designs much more—as one participant stated, the suggestions sparked new ideas. Although we would need to collect more data to determine if this is a significant difference, there was a general trend that the GA group spent longer on the task (in terms of iterations to design 5 levels). We were surprised to find that as many participants in the control group described the suggestions as being useful as in the GA group, suggesting that any suggestions are helpful to the creative process. There is the possibility that this is due to the simplicity of the levels, in an application with a larger design space perhaps random suggestions are less useful. Overall our data shows that the system we designed does support designers through the design process and is more effective than random suggestions.

6.1 Evaluation of Our Scientific Approach

In hindsight it would have been appropriate to have included some Likert scale questions as part of our user survey. In particular, with question 3 we found that not all participants clearly stated if the tool affected their approach, which would have been captured by a scale response. Performing the study on-line introduces problems such as not all participants correctly submitting log files and possible minor differences in experience based on the hardware they are running. As with all studies it would have been good to have had more data, there were a few trends in our quantitative data which were not statistically significant and this may have been due to a small sample size.

6.2 Future Work

One limitation of our approach was that a number of participants in the GA group noted that the suggestions which were given were all too similar to each other. Our aim was to ensure diversity in design by restarting the GA regularly, although this creates diversity over time what we failed to realise is that when the designer sees the suggestions they are all from a single GA run. This could have limited the system’s ability to support exploration of the design space. In the future it would be interesting to add such a mechanism (such as in [14, 18, 13]) to ensure diversity in the suggestions. It would also be interesting to try this approach with different applications, and even simply larger maps.


The authors would like to thank Stephen Lindsay for his excellent advice for designing the user study.


  • [1] H. Alexandra. A look at how no man’s sky’s procedural generation works. Kotaku, Oct. 2016. Accessed: 2020-1-22.
  • [2] D. Ashlock, C. Lee, and C. McGuinness. Search-Based procedural generation of Maze-Like levels. IEEE Trans. Comput. Intell. AI Games, 3(3):260–273, Sept. 2011.
  • [3] A. Baldwin, S. Dahlskog, J. M. Font, and J. Holmberg. Mixed-initiative procedural generation of dungeons using game design patterns. In 2017 IEEE Conference on Computational Intelligence and Games (CIG), pages 25–32, Aug. 2017.
  • [4] C. A. C. Coello, G. B. Lamont, D. A. Van Veldhuizen, et al. Evolutionary algorithms for solving multi-objective problems, volume 5. Springer, 2007.
  • [5] M. Cook. Make something that makes something: A report on the first procedural generation jam. In H. Toivonen, S. Colton, M. Cook, and D. Ventura, editors, Proceedings of the Sixth International Conference on Computational Creativity (ICCC 2015), pages 197–203, Park City, Utah, June 2015. Brigham Young University.
  • [6] M. Cook and S. Colton. Multi-faceted evolution of simple arcade games. In 2011 IEEE Conference on Computational Intelligence and Games (CIG’11), pages 289–296, Aug. 2011.
  • [7] R. Craveirinha and L. Roque. Studying an Author-Oriented approach to procedural content generation through participatory design. In Entertainment Computing - ICEC 2015, pages 383–390. Springer International Publishing, 2015.
  • [8] K. Deb. Multi-objective optimization using evolutionary algorithms, volume 16. John Wiley & Sons, 2001.
  • [9] S. O. Kimbrough, G. J. Koehler, M. Lu, and D. H. Wood. On a Feasible–Infeasible Two-Population (FI-2Pop) genetic algorithm for constrained optimization: Distance tracing and no free lunch. Eur. J. Oper. Res., 190(2):310–327, Oct. 2008.
  • [10] A. Liapis, G. N. Yannakakis, and J. Togelius. Sentient sketchbook: Computer-aided game level authoring. In FDG, pages 213–220, 2013.
  • [11] D. Loiacono, L. Cardamone, and P. L. Lanzi.

    Automatic track generation for High-End racing games using evolutionary computation.

    IEEE Trans. Comput. Intell. AI Games, 3(3):245–259, Sept. 2011.
  • [12] C. McGuinness and D. Ashlock. Decomposing the level generation problem with tiles. In 2011 IEEE Congress of Evolutionary Computation (CEC), pages 849–856, June 2011.
  • [13] A. S. Melotti and C. H. V. de Moraes. Evolving roguelike dungeons with deluged novelty search local competition. IEEE Trans. Comput. Intell. AI Games, 11(2):173–182, June 2019.
  • [14] M. Preuss, A. Liapis, and J. Togelius. Searching for good and diverse game levels. In 2014 IEEE Conference on Computational Intelligence and Games, pages 1–8, Aug. 2014.
  • [15] S. Risi, J. Lehman, D. B. D’Ambrosio, R. Hall, and K. O. Stanley. Petalz: Search-Based procedural content generation for the casual gamer. IEEE Trans. Comput. Intell. AI Games, 8(3):244–255, Sept. 2016.
  • [16] J. Roberts and K. Chen. Learning-Based procedural content generation. IEEE Trans. Comput. Intell. AI Games, 7(1):88–101, Mar. 2015.
  • [17] A. S. Ruela and K. Valdivia Delgado. Scale-Free evolutionary level generation. In 2018 IEEE Conference on Computational Intelligence and Games (CIG), pages 1–8, Aug. 2018.
  • [18] P. Sampaio, A. Baffa, B. Feijó, and M. Lana. A fast approach for automatic generation of populated maps with seed and difficulty control. In 2017 16th Brazilian Symposium on Computer Games and Digital Entertainment (SBGames), pages 10–18, Nov. 2017.
  • [19] S. Snodgrass and S. Ontañón.

    Learning to generate video game maps using markov models.

    IEEE Trans. Comput. Intell. AI Games, 9(4):410–422, Dec. 2017.
  • [20] J. Togelius, G. N. Yannakakis, K. O. Stanley, and C. Browne. Search-Based procedural content generation: A taxonomy and survey. IEEE Trans. Comput. Intell. AI Games, 3(3):172–186, Sept. 2011.
  • [21] V. Valtchanov and J. A. Brown. Evolving dungeon crawler levels with relative placement. In Proceedings of the Fifth International C* Conference on Computer Science and Software Engineering, C3S2E ’12, pages 27–35, New York, NY, USA, 2012. ACM.
  • [22] R. van der Linden, R. Lopes, and R. Bidarra. Procedural generation of dungeons. IEEE Trans. Comput. Intell. AI Games, 6(1):78–89, Mar. 2014.
  • [23] B. von Rymon Lipinski, S. Seibt, J. Roth, and D. Abé. Level graph – incremental procedural generation of indoor levels using minimum spanning trees. In 2019 IEEE Conference on Games (CoG), pages 1–7, Aug. 2019.
  • [24] S. P. Walton and M. R. Brown.

    Predicting effective control parameters for differential evolution using cluster analysis of objective function features.

    Journal of Heuristics

    , 25(6):1015–1031, Dec. 2019.
  • [25] D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization. IEEE Trans. Evol. Comput., 1(1):67–82, Apr. 1997.
  • [26] G. N. Yannakakis, A. Liapis, and C. Alexopoulos. Mixed-initiative co-creativity. In 9th International Conference on the Foundations of Digital Games. Foundations of Digital Games, 2014.
  • [27] A. Zhou, B.-Y. Qu, H. Li, S.-Z. Zhao, P. N. Suganthan, and Q. Zhang. Multiobjective evolutionary algorithms: A survey of the state of the art. Swarm and Evolutionary Computation, 1(1):32–49, Mar. 2011.