Experience-Driven PCG via Reinforcement Learning: A Super Mario Bros Study

by   Tianye Shu, et al.
University of Malta

We introduce a procedural content generation (PCG) framework at the intersections of experience-driven PCG and PCG via reinforcement learning, named ED(PCG)RL, EDRL in short. EDRL is able to teach RL designers to generate endless playable levels in an online manner while respecting particular experiences for the player as designed in the form of reward functions. The framework is tested initially in the Super Mario Bros game. In particular, the RL designers of Super Mario Bros generate and concatenate level segments while considering the diversity among the segments. The correctness of the generation is ensured by a neural net-assisted evolutionary level repairer and the playability of the whole level is determined through AI-based testing. Our agents in this EDRL implementation learn to maximise a quantification of Koster's principle of fun by moderating the degree of diversity across level segments. Moreover, we test their ability to design fun levels that are diverse over time and playable. Our proposed framework is capable of generating endless, playable Super Mario Bros levels with varying degrees of fun, deviation from earlier segments, and playability. EDRL can be generalised to any game that is built as a segment-based sequential process and features a built-in compressed representation of its game content.


page 1

page 2

page 3

page 4


Combining Evolutionary Search with Behaviour Cloning for Procedurally Generated Content

In this work, we consider the problem of procedural content generation f...

Online Game Level Generation from Music

Game consists of multiple types of content, while the harmony of differe...

On Linking Level Segments

An increasingly common area of study in procedural content generation is...

Generating Game Levels of Diverse Behaviour Engagement

Recent years, there has been growing interests in experience-driven proc...

Intentional Computational Level Design

The procedural generation of levels and content in video games is a chal...

Learning Constructive Primitives for Online Level Generation and Real-time Content Adaptation in Super Mario Bros

Procedural content generation (PCG) is of great interest to game design ...

A super scalable algorithm for short segment detection

In many applications such as copy number variant (CNV) detection, the go...

I Introduction

Procedural content generation (PCG) [31, 46] is the algorithmic process that enables the (semi-)autonomous design of games to satisfy the needs of designers or players. As games become more complex and less linear, and uses of PCG tools become more diverse, the need for generators that are reliable, expressive, and trustworthy is increasing. Largely speaking, game content generators can produce outcomes either in an offline or in an online manner [46]. Compared to offline PCG, online PCG is flexible, dynamic and interactive but it comes with several drawbacks: it needs to be able to generate meaningful content rapidly without causing any catastrophic failure to the existing game content. Because of the many challenges that arise when PCG systems operate online (i.e. during play), only limited studies have focused on that mode of generation [33, 14, 37, 35].

In this paper we introduce a framework for online PCG at the intersections of the experience-driven PCG (EDPCG) [48] and the PCG via reinforcement learning (PCGRL) [17] frameworks. The ED(PCG)RL framework, EDRL for short, enables the generation of personalised content via the RL paradigm. EDRL builds upon and extends PCGRL as it makes it possible to generate endless levels of arcade-like games beyond the General Video Game AI (GVGAI) framework [26, 27] in an online fashion. It also extends the EDPCG framework as it enables RL agents to create personalised content that is driven by experience-based reward functions. EDRL benefits from experience-driven reward functions and RL agents that are able to design in real-time based on these functions. The result is a versatile generator that can yield game content in an online fashion respecting certain functional and aesthetic properties as determined by the selected reward functions.

We test EDRL initially in Super Mario Bros (SMB) (Nintendo, 1985)[25] through the generative designs of RL agents that learn to optimise certain reward functions relevant to level design. In particular, we are inspired by Koster’s theory of fun [18] and train our RL agents to moderate the level of diversity across SMB level segments. Moreover, we test the notion of historical deviation by considering earlier segment creations when diversifying the current segment. Finally, we repair defects in levels (e.g., broken pipes) via a neural net-assisted evolutionary repairer [36]

, and then check the playability of levels through agent-based testing. Importantly, EDRL is able to operate online in SMB as it represents the state and action via a latent vector. The key findings of the paper suggest that EDRL is possible in games like SMB; the RL agents are able to online generate playable levels of varying degrees of

fun that deviate over time.

Beyond introducing the EDRL framework, we highlight a number of ways this paper contributes to the current state of the art. First, to the best of our knowledge, this is the first functional implementation of PCGRL in SMB, a platformer game with potentially infinite level length, that is arguably more complex than the GVGAI games studied in [17]. Second, compared to tile-scale design in [17], the proposed approach is (required to be) faster as level segments are generated online through the latent vectors of pre-trained generators. Finally, we quantify Koster’s fun [18]

as a function that maintains moderate levels of Kullback-Leibler divergence (KL-divergence) within a level teaching our RL agent to generate levels with such properties.

Ii Background

A substantial body or literature has defined the area of PCG in recent years [43, 41, 31, 48, 46, 39, 7, 28, 23]. In this section, we focus on related studies in PCG via reinforcement learning and approaches for online level generation.

Ii-a PCG via Reinforcement Learning

Togelius et al. [41] have proposed three fundamental goals for PCG: “multi-level multi-content PCG, PCG-based game design and generating complete games”. To achieve these goals, various types of PCG methods and frameworks have been researched and applied in games since then [31, 43, 48, 46]

. The development of machine learning (ML) has brought revolution to PCG 

[46]. The combination of ML and PCG (PCGML) [39]

shows great potential compared with classical PCG methods; in particular, deep learning methods have been playing an increasingly important role in PCG in recent years 

[23]. Furthermore, PCG methods can be used to increase the generality in ML [28]. PCGML, however, is limited by the lack of training data that is often the case in games [39]. Khalifa et al. [17] proposed PCG via reinforcement learning (PCGRL) which frames level generation as a game and uses an RL agent to solve it. A core advantage of PCGRL compared with existing PCGML frameworks [39] is that no training data is required. More recently, adversarial reinforcement learning has been applied to PCG [9]; specifically, a PCGRL agent for generating different environments is co-evolved with a problem solving RL agent that acts in the generated environments [9]. Engelsvoll et al. [8] applied a Deep Q-Network (DQN) agent to play SMB levels generated by a DQN-based level designer, which takes as input the latest columns of a played level and outputs new level columns in a tile-by-tile manner. Although, the work of [8] was the first attempt of implementing PCGRL in SMB, the emphasis in that work was not in the online generation of experience-driven PCG.

Ii-B Online Level Generation

Content generation that occurs in real-time (i.e. online) requires rapid generation times. Towards that aim, Greuter et al. [12] managed to generate “pseudo infinite” virtual cities in real-time via simple constructive methods. Johnson et al. [15] used cellular automata to generate infinite cave levels in real-time. In [14], the model named polymorph was proposed for dynamic difficulty adjustment during the generation of levels. Stammer et al. [37] generated personalised and difficulty adjusted levels in the 2D platformer Spelunky (Mossmouth, LLC, 2008) while Shaker et al. formed the basis of the experience-driven PCG framework [48] by generating personalised platformer levels online [33]. Shi and Chen [34] combined rule-based and learning-based methods to generate online level segments of high quality, called constructive primitives (CPs). In the work of [35]

, a dynamic difficulty adjustment algorithm with Thompson Sampling 

[40] was proposed to combine these CPs.

Iii EDRL: Learn to Design Experiences via RL

Fig. 1: General overview of the EDRL framework interweaving elements of both the EDPCG and the PCGRL frameworks. White rounded boxes and blue boxes depict components of EDPCG [48] and PCGRL [17], respectively. Content representation (i.e., environment in RL terms) is a common component across the two frameworks and is depicted with a dashed line.

Our aim is to introduce a framework that is able to generate endless, novel yet consistent and playable levels, ideally for any type of game. Given an appropriate level-segment generator, our level generator selects iteratively the suitable segment to be concatenated successively to the current segment to build a level. This process resembles a jigsaw puzzle game, of which the goal is to pick suitable puzzle pieces (i.e., level segment) and put them together to match a certain pattern or “style”.

Playing this jigsaw puzzle can be modelled as a Markov Decision Process (MDP) 

[3]. In this MDP, a state models the current puzzle piece (level segment) and an action is the next puzzle piece to be placed. The reward evaluates how these two segments fit. The next state after selecting is set as itself assuming deterministic actions, i.e., . An optimal action at state is defined as the segment that maximises the reward if being placed after segment , i.e., . denotes the actions space, i.e., the set of all possible segments. An RL agent is trained to learn the optimal policy that selects the optimal segments, thus . Endless-level generation can be achieved if this jigsaw puzzle game is played infinitely.

One could argue that the jigsaw puzzle described above does not satisfy the Markov property as the level consistency and diversity should be measured based on the current and the concatenated segment. When a human plays such a game, however, she can only perceive the current game screen. In fast-paced reactive games, the player’s short term memory is highly active during reactive play and, thus episodic memory of the game’s surroundings is limited to a local area around play [44]. Therefore, we can assume that the Markov property is satisfied to a good degree within a suitable length of such level segments.

The general overview of the EDRL framework is shown in Fig. 1. The framework builds on EDPCG [48] and PCGRL [17] and extends them both by enabling experience-driven PCG via the RL paradigm. According to EDRL an RL agent learns to design content with certain player experience and aesthetic aspects (experience model) by interacting with the RL environment which is defined through a content representation. The content quality component guarantees that the experience model will consider content of certain quality (e.g., tested via gameplay simulations). The RL designer takes an action that corresponds to a generative act that alters the state of the represented content () and receives a reward through the experience model function. The agent iteratively traverses the design (state-action representation) and experience (reward) space to find a design policy that optimises the experience model.

The framework is directly applicable to any game featuring levels that can be segmented and represented rapidly through a compressed representation such as a latent vector. This includes Atari-like 2D games [2, 26] but can also include more complex 3D games (e.g., VizDoom [16]) if latent vectors are available and can synthesise game content.

Iii-a Novelty Of EDRL

The presented EDRL framework sits at the intersection of experience-driven PCG and PCGRL being able to create personalised (i.e., experience-tailored) levels via RL agents in a real-time manner. EDRL builds on the core principles of PCGRL [17] but it extends it in a number of ways. First, vanilla PCGRL focuses on training an RL agent to design levels from scratch. Our framework, instead, teaches the agent to learn to select suitable level segments based on content and game-play features. Another key difference is the action space of RL agents. Instead of tile-scale design, our designer agent selects actions in the latent space of the generator and uses the output segment to design the level in an online manner. The length of the level is not predefined in our framework. Therefore, game levels can in principle be generated and played endlessly.

For designing levels in real-time, we consider the diversity of new segments compared to the ones created already, which yields reward functions for fun and deviation over time; earlier work (e.g., [35, 14]) largely focused on objective measures for dynamic difficulty adjustment. Additionally, we use a repairer that corrects level segments without human intervention, and game-playing agents that ensure their playability. It is a hard challenge to determine which level segment will contribute best to the level generation process—i.e. a type of credit assignment problem for level generation. To tackle this challenge, Shi and Chen [35] formulated dynamic difficulty adjustment as an MDP [3] with binary reward and a Thompson sampling method. We, instead, frame online level generation as an RL game bounded by any reward function which is not limited to difficulty, but rather to player experience.

Iv EDRL for Mario: MarioPuzzle

In this section we introduce an implementation of EDRL for SMB level generation and the specific reward functions designed and used in our experiments (Section IV-D). EDRL in this implementation enables online and endless generation of content under functional (i.e., playbility) and aesthetic (e.g., fun) metrics. As illustrated in Fig. 2, the implementation for SMB—namely MarioPuzzle111Available on GitHub: https://github.com/SliverySky/mariopuzzle—features three main components: (i) a generator and repairer of non-defective segments, (ii) an artificial SMB player that tests the playability of the segments and (iii) an RL agent that plays this MarioPuzzle endless-platform generation game. We describe each component in dedicated sections below.

Fig. 2: Implementing EDRL on Super Mario Bros: the MarioPuzzle framework.
0:  , trained GAN
0:  , trained repairer
0:  , agent
0:  , RL agent
0:  , reward function

: maximum number of training epochs

0:  : maximum segment number of one game
2:  while  do
6:     repeat
7:         uniformly sampled from
8:         // Generate a segment
9:         // Repair the segment if applicable
10:         // Play this segment
11:        if  then
13:        end if
14:     until  is not
15:     Add to
16:     while  and  do
17:         // Select an action
18:         // Generate a segment
19:         // Repair the segment if applicable
20:         // Play this segment
21:         with previous // According to metrics
22:        Update with
23:        Update with // According to metrics
25:     end while
27:  end while
28:  return  
Algorithm 1 Training procedure of MarioPuzzle.

Iv-a Generate and Repair via the Latent Vector

As presented earlier in Section III, both the actions and states in this MDP represent different level segments. Our framework naturally requires a level segment generator to operate. For that purpose we use and combine the MarioGAN [45] generator and CNet-assisted Evolutionary Repairer [36] to, respectively, generate and repair the generated level segments. The CNet-assisted Evolutionary Repairer has shown to be capable of determining wrong tiles in segments generated by MarioGAN [45] and repairing them [36]; hence, the correctness of generated segments is guaranteed. In particular, we train a GAN222https://github.com/schrum2/GameGAN on fifteen SMB levels of three types (overworld, underground and athletic) in VGLC [38]. The CNet-assisted Evolutionary Repairer trained by [36] is used directly and unmodified 333https://github.com/SUSTechGameAI/MarioLevelRepairer.

It is important to note that we are not using a direct tile-based representation as in [17]

or one-hot encoded tiles as in

[8]; instead we use the latent vector of MarioGAN to represent the agent’s actions and states for the RL agent. Thus, the selected action or state are sampled from the latent space, rather than the game space.

Iv-B The AI Player

The agent in the Mario AI framework [32] is used as an artificial player for determining the playability of generated segments444https://github.com/amidos2006/Mario-AI-Framework.

Iv-C The RL Designer

MarioPuzzle is embedded into the OpenAI gym [4] and the PPO algorithm[30] is used to train our agent [19]. The training procedure of MarioPuzzle is described in Algorithm 1. The actions and states are represented by latent vectors of length . When training the PPO agent, the initial state is a playable segment randomly sampled from the latent space of the trained segment generator, as shown in Algorithm 1. The latent vector of the current level segment feeds the current observation of the PPO agent. Then, an action (i.e., a latent vector) is selected by the PPO agent and used as an input of the generator to generate a new segment. The repairer determines and fixes broken pipes in the new segment. The fixed segment is concatenated to the earlier one and the agent tests if the addition of the new segment is playable. Then, the latent vector of this new segment (same as action) is returned as a new observation to the agent. The agent receives an immediate reward for taking an action in a particular state. The various reward functions we considered in this study are based on both content and game-play features and are detailed below.

Iv-D Reward Functions

Designing a suitable reward function for the RL agent to generate desired levels is crucial. In this implementation of EDRL, we design three metrics that formulate various reward functions ( in Algorithm 1) aiming at guiding the RL agent to learn to generate playable levels with desired player experiences.

Iv-D1 Moderating Diversity Makes Fun!

Koster’s theory of fun [18]

suggests that a game is fun when the patterns a player perceives are neither too unfamiliar (i.e., changeling) nor too familiar (i.e., boring). Inspired by this principle, when our agent concatenates two segments, we assume it should keep the diversity between them at moderate levels; too high diversity leads to odd connections (e.g., mix of styles) whereas too low diversity yields segments that look the same. To do so, we first define a diversity measure, and then we moderate diversity as determined from human-designed levels.

When a player plays through a level, the upcoming level segment is compared with previous ones for diversity. We thus define the diversity of a segment through its dissimilarity to previous segments. While there have been several studies focusing on quantifying content similarity (e.g., [13]), we adopt the tile-based KL-divergence [24] as a simple and efficient measure of similarity between segments. More specifically diversity, , is formulated as follows:


where is a generated level segment with height and width ; is the tile-based KL-divergence that considers the standard KL-Divergence between the distributions over occurrences of tile patterns in two given level segments, and  [24]. We define a window with the same size as . represents the segment contained in the window . A sliding window moves from the position of () to the previous segment

times with stride

. The parameter limits the number of times that the window moves. According to (1), larger values consider more level segments in the past. After preliminary hyper-parameter tuning, we use a window and for calculating tile-based KL-divergence [24]. and are set as and in this work.

Once we have diversity defined, we attempt to moderate diversity by considering the fifteen human-authored SMB levels used for training our GAN. We thus calculate the average and standard deviation of

for all segments across the three different level types of our dataset. Unsurprisingly, Table I shows that different level types yield significantly different degrees of diversity. It thus appears that by varying the degree of diversity we could potentially vary the design style of our RL agent. Randomly concatenating level segments, however, cannot guarantee moderated diversity as its value depends highly on the expressive range of the generator. On that end, we moderate the diversity of each segment in predetermined ranges—thereby defining our fun () reward function—as follows.


where and denote the lower and upper bounds of diversity, respectively. According to the analysis on human-authored SMB levels (Table I), we assume that the diversity of overworld levels is moderate and arbitrarily set and , based on the diversity values obtained for this type of levels.

Type #Level #Segment [Eq. (1)]
Overworld 9
Athletic 4
Underground 2
Total 15
TABLE I: Average diversity values and corresponding standard deviations of segments across the three types of SMB levels.

Iv-D2 Historical Deviation

Inspired by the notion of novelty score [20], we consider historical deviation, , as a measure that encourages our agent to generate segments that deviate from earlier creations. In particular, we define of a segment as the average similarity of the most similar segments among the previous segments, as formalised in (3).


where is the index of the most similar segment among previous segments compared to ; represents the number of segments that our RL agent holds in its memory. This parameter is conceptually similar to the sparseness parameter in novelty search [21]. After preliminary hyper-parameter tuning, and are set as and , respectively, in the experiments of this paper.

Iv-D3 Playablility

The playablility of a newly generated segment is tested by an agent. The generated segment, however, is not played on its own; its concatenation to the three previous segments is played as a whole instead, to ensure that the concatenation will not yield unplayable levels. When testing, Mario starts playing from its ending position in the previous level segment. A segment is determined as playable only if Mario succeeds in reaching the right-most end of .

Playability is essential in online level generation. Therefore, when an unplayable segment is generated, the current episode of game design ends—see Algorithm 1—and the reward returned, , is set as the number of segments that the Mario managed to complete. With such a reward function the agent is encouraged to generate the longest playable levels possible.

V Design Experiments Across Reward Functions

In this section, we test the effectiveness of the metrics detailed in Section IV-D through empirical experimentation. We use each of the fun and historical deviation metrics as independent reward functions and observe their impact on generated segments. Then, we use combinations of these metrics and playability to form new reward functions.

V-a Experimental Details and Results

Agent Evaluation metrics Number of level elements in generated segments
[] Gaps Pipes Enemies Bullets Coins Question-marks
-0.0050.044 [87.114.1] 0.860.28 29.628.3 0.600.40 0.430.17 2.110.63 0.050.13 0.640.52 1.000.59
-0.0920.092 [57.018.8] 1.430.32 24.221.8 0.730.34 0.400.31 1.480.58 0.090.10 1.100.62 0.680.55
-0.0650.086 [63.421.8] 1.380.38 16.416.2 0.740.37 0.490.33 1.690.74 0.110.19 1.060.79 0.530.59
-0.0230.013 [76.75.1] 0.720.07 97.311.8 0.120.04 0.150.04 1.580.15 0.100.03 2.370.25 0.290.09
-0.0320.017 [75.25.2] 0.830.09 96.614.4 0.180.05 0.170.04 1.280.17 0.100.04 2.510.25 0.460.11
-0.0370.020 [74.25.7] 0.840.09 97.013.9 0.180.07 0.220.05 1.400.20 0.100.04 2.480.29 0.460.18
-0.0340.018 [74.06.0] 0.840.09 97.413.7 0.230.17 0.180.09 1.190.40 0.120.04 2.430.36 0.550.13
-0.0640.098 [64.321.9] 1.350.37 16.014.3 0.820.44 0.450.36 1.530.75 0.100.18 0.940.71 0.520.55
TABLE II: Evaluation metrics of levels generated across different RL agents over 300 levels each. The Table presents average values and corresponding standard deviations across the 300 levels. refers to the trained policy using the reward function. refers to a random agent that designs SMB levels. , and are respectively the , and values averaged over playable segments by the agent; in addition to , the value appearing in square brackets refers to the average percentage of playable segments within the bounds of moderate diversity (cf. Section IV-D1). In addition to the three evaluation metrics, the Table presents the number of gaps, pipes, enemies, bullet blocks, coins, and question-mark blocks that exist in the generated playable segments. Values in bold indicate the highest value in the column.

When calculating , the size of the sliding window and the segment are both and the control parameters and of Eq. (1) are set as and , respectively. When calculating , the control parameters and of Eq. (3) are set as and , respectively. For each RL agent presented in this paper, a PPO algorithm is used and trained for epochs.

In all results and illustrations presented in this paper, , and refer to independent metrics that consider, respectively, the fun (2), historical deviation (3) and playability (cf. Section IV-D3) of a level. The notation refers to the PPO agent trained solely through the reward function. Similarly, refers to the agent trained with the sum of and as a reward, while refers to the one trained with the sum of , and as a reward. When a reward function is composed by multiple metrics, each of the metrics included is normalised within based on the range determined by the maximum and minimum of its most recent values.

For testing the trained agents, different level segments are used as initial states. Given an initial state, an agent is tested independently times for a maximum of segments—or until an unplayable segment is reached if is a component of the reward function. As a result, each agent designs levels of potentially different number of segments when is considered; otherwise, the generation of an unplayable segment will not terminate the design process, thus segments will be generated. For comparison, a random agent, referred to as , is used to generate levels by randomly sampling up to segments from the latent space or till an unplayable one is generated.

The degrees of fun, historical deviation, and playability of these generated levels are evaluated and summarised in Table II. Table II also showcases average numbers of core level elements in generated levels, including pipes, enemies and question-mark blocks. Figure 3 illustrates a number of arbitrarily chosen segments clipped from levels generated by each RL agent.

[: surprisingly uninteresting levels when designing sorely for fun] [: highly diverse levels when designing historical deviation] [: levels balancing between fun and historical deviation] [: levels with more ground tiles in assistance of ] [: playable levels with clearly repeated patterns ] [: playable and diverse levels with limited gaps] [: fun, diverse and playable levels]

Fig. 3: Example segments clipped from levels generated by RL agents trained with different reward functions. The key characteristics of each EDRL policy are outlined in the corresponding captions.

V-B Analysis of Findings

The key findings presented in Table II suggest that the evaluation functions of and across all generated levels reach their highest value when the corresponding reward functions are used. In particular, the policy yields appropriate levels of diversity in out of 100 segments, on average; higher than any other policy (see values in Table II). Similarly reaches its highest value on average () when policy is used. Finally, playabilty is boosted when is used as a reward function but it reaches its peak value of out of 100 segments on average when it is combined with both and .

focuses primarily on enemy and question-mark block placement, but the levels it generates are far from looking diverse (see Fig. 3). does not yield any distinct level element and results in rather diverse yet unplayable levels (see Fig. 3). The combination of and without playability () augments the number of pipes existent in the level and offers interesting but largely unplayable levels (see Table II and Fig. 3).

Column index 1 2 3 4 5 6
Agent Resampling method Failed generations Unplayable segments Resamples Generation Time (s) Faulty tiles
Max Total Segment Sample Original Repaired
Random 205/300 6.15 1.97 7.24 1.09 1.02 50.2 8.4
Policy 196/300 6.36 2.18 7.61 1.11 1.03 54.1 10.2
Random 0/300 0.10 0.11 0.11 0.79 0.79 6.1 0.3
Policy 5/300 0.12 0.11 0.12 0.79 0.79 6.1 0.3
TABLE III: Generating 100-segment long levels. All values—expect from the number of failed generations—are averaged across 300 levels (10 trails each from 30 different initial level segments). In the “Generation Time“ column, “Sample” refers to the time averaged over the total number of generated segments, including unplayable and playable ones, while “Segment” refers to the time averaged over successful level generations.
Fig. 4: Example of partial game levels generated online by the agent.

When is considered by the reward function, the outcome is a series of playable levels (i.e., the average value is over 96 out of 100 segments per level). The most interesting agent from all the combinations of with the other two reward functions, however, seems to be the policy that is trained on all three (; see Fig. 3). That policy yields highly playable levels that maintain high levels of fun—i.e. on average, 74 out of 100 segments reach appropriate levels of diversity—and historical deviation ( on average). The agent appears to design more playable levels, but at the same time, the generated levels yield interesting patterns and have more gaps, coins and bullet blocks that increase the challenge and augment the exploration paths available for the player (see Table II and Fig. 3). Based on the performance of —compared against all other SMB level design agents explored—the next section puts this agent under the magnifying glass and analyses its online generation capacities.

Vi Online Endless-level Generation

To test the performance of the trained agent for online level generation, we let it generate 300 levels (10 trials from 30 initial segments) composed of 100 playable segments each; although, in principle, the agent can generate endless levels online. The results obtained are shown in Table III and are discussed in this section. As a baseline we compare the performance of against , i.e., an agent that does not consider playability during training.

When generating a level, it is necessary to repair the defects and test its playability. The faulty tiles of the generated segments before and after repairing are detected by CNet [36] (see column 6 in Table III). Clearly, the levels generated by policy feature more faulty tiles than levels generated by . After visualising the generated levels, we observe that the pipes in the levels generated by have diverse locations. Most of the pipes in levels generated by , instead, are located in similar positions (i.e., near the middle bottom of the segment). We assume that the segments with this pattern are more likely to be chosen by the agent as they are easier to complete by the testing agent.

The repairer operation by the CNet-assisted Evolutionary Algorithm only satisfies the logical constraints but does not guarantee the playability of the level. If a generated segment is unplayable, the RL agent will re-sample a new segment with one of the two methods: either by (i) sampling a new action according to its policy (ii) sampling a new action randomly from a normal distribution, and then clipping it into

. This resampling process repeats until a playable segment is obtained or a sampling counter reaches . Note that this resampling technique is only used in this online level generation test to ensure the generation of levels composed by playable segments only. Resampling is not enabled during training as an RL agent trained with playability as part of its reward is expected to learn to generate playable segments.

To evaluate the real-time generation efficiency of MarioPuzzle we record the number of times resampling is required (see column 3 in Table III), the total number of resamplings (see column 4 in Table III), and the time taken for generating a level of 100 playable segments (see column 5 in Table III). Unsurprisingly, the agent—which is trained without playability as part of its reward—appears to generate more unplayable segments and is required to resample more.

As a baseline of real-time performance we asked three human players (students in our research group) to play 8 levels that are randomly selected from the level creations of . Their average playing time for one segment was , and . It thus appears that the average segment generation time (see column 5 in Table III) of is acceptable for online generation when compared to the time needed for a human player to complete one segment.

According to Table III, with random resampling never fails in generating playable 100-segment long levels. Comparing with

, it is obvious that integrating playability into the reward function can reduce the probability of generating unplayable segments and resampling times, and, in turn, make the online generation of playable levels easier and faster.

Figure 4 displays examples from the levels generated by . The EDRL designer resolves the playability issues by placing more ground tiles while maintaining appropriate levels of and . It is important to remember that EDRL in this work operates and designs fun, diverse and playable levels for an playing agent. The outcome of Fig. 4 reflects on the attempt of the algorithm to maximise all three reward functions for this particular and rather predictable player, thereby offering largely linear levels without dead-ends, with limited gaps (ensuring playability) and limited degrees of exploration for the player. The resulting levels for agents that depict more varied gameplay are expected to vary substantially.

Vii Discussion and Future Work

In this paper we introduced EDRL as a framework, and instantiated it in SMB as MarioPuzzle. We observed the capacity of MarioPuzzle to generate endless SMB levels that are playable and attempt to maximise certain experience metrics. As this is the first instance of EDRL in a game, there is a number of limitations that need to be explored in future studies; we discuss these limitations in this section.

By integrating playability in the reward function we aimed to generate playable levels that are endless in principle [42]. As a result, the RL agent generates more ground tiles to make it easier for the to pass though. The generated levels highly depend both on the reward function and the behaviour of the test agent. Various human-like agents and reward functions will need to be studied in future work to encourage the RL agent to learn to select suitable level segments for different types of players. Moreover, the behaviour of human-like agents when playing earlier segments can be used to continuously train the RL agent. In that way our RL designer may continually (online) learn through the behaviour of a human-like agent and adapt to the player’s skill, preferences and even annotated experiences.

In this initial implementation of EDRL, we used a number of metrics to directly represent player experience in a theory-driven manner [46]. In particular, we tested expressions of fun and historical deviation for generating segments of game levels. As future work, we intend to generate levels with adaptive levels of diversity, surprise [47, 11], and/or novelty and investigate additional proxies of player experience including difficulty-related functions stemming from the theory of flow [6]. Similarly to [5, 10], we are also interested in applying multi-objective optimisation procedures to consider simultaneously the quality and diversity of levels, instead of their linear aggregation with fixed weights. Viewing the EDRL framework from an intrinsic motivation [1] or an artificial curiosity [29] perspective is another research direction we consider. All above reward functions, current and future ones, will need to be cross-verified and tested against human players as in [49].

While EDRL builds on two general frameworks and it is expected to operate in games with dissimilar characteristics, our plan is to test the degrees to which EDRL is viable and scalable to more complex games that feature large game and action space representations. We argue that given a learned game content representation—such as latent vector of a GAN or an autoencoder—and a function that is able to segment the level, EDRL is directly applicable. Level design was our test-bed in this study; as both EDPCG and PCGRL (to a lesser degree) have explored their application to other forms of content beyond levels, EDRL also needs to be tested to other types of content independently or even in an orchestrated manner


Viii Conclusion

In this paper, we introduced a novel framework that realises personalised online content generation by interweaving the EDPCG [48] and the PCGRL [17] frameworks. We test the ED(PRCG)RL framework, EDRL in short, in Super Mario Bros and train RL agents to design endless and playable levels that maximise notions of fun [18] and historical deviation. To realise endless generation in real-time, we employ a pre-trained GAN generator that designs level segments; the RL agent selects suitable segments (as represented by their latent vectors) to be concatenated to the existing level. The application of EDRL in SMB makes online level generation possible while ensuring certain degree of fun and deviation across level segments. The generated segments are automatically repaired by a CNet-assisted Evolutionary Algorithm [36] and tested by an agent that guarantees playability. This initial study showcases the potential of EDRL in fast-paced games like SMB and opens new research horizons for realising experience-driven PCG though the RL paradigm.


The authors thank the reviewers for their careful reviews and insightful comments.


  • [1] A. G. Barto (2013) Intrinsic motivation and reinforcement learning. In Intrinsically motivated learning in natural and artificial systems, pp. 17–47. Cited by: §VII.
  • [2] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling (2013) The arcade learning environment: an evaluation platform for general agents.

    Journal of Artificial Intelligence Research

    47, pp. 253–279.
    Cited by: §III.
  • [3] R. Bellman (1957) A markovian decision process. Journal of mathematics and mechanics 6 (5), pp. 679–684. Cited by: §III-A, §III.
  • [4] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI Gym. arXiv preprint arXiv:1606.01540. Cited by: §IV-C.
  • [5] G. Cideron, T. Pierrot, N. Perrin, K. Beguir, and O. Sigaud (2020) QD-RL: efficient mixing of quality and diversity in reinforcement learning. arXiv preprint arXiv:2006.08505. Cited by: §VII.
  • [6] M. Csikszentmihalyi, S. Abuhamdeh, and J. Nakamura (2014) Flow. In Flow and the foundations of positive psychology, pp. 227–238. Cited by: §VII.
  • [7] B. De Kegel and M. Haahr (2020) Procedural puzzle generation: a survey. IEEE Transactions on Games 12 (1), pp. 21–40. Cited by: §II.
  • [8] R. N. Engelsvoll, A. Gammelsrød, and B. S. Thoresen (2020) Generating levels and playing Super Mario Bros. with deep reinforcement learning using various techniques for level generation and deep q-networks for playing. Master’s Thesis, University of Agder. Cited by: §II-A, §IV-A.
  • [9] L. Gisslén, A. Eakins, C. Gordillo, J. Bergdahl, and K. Tollmar (2021) Adversarial reinforcement learning for procedural content generation. In 2021 IEEE Conference on Games (CoG), pp. accepted. Cited by: §II-A.
  • [10] D. Gravina, A. Khalifa, A. Liapis, J. Togelius, and G. N. Yannakakis (2019) Procedural content generation through quality diversity. In 2019 IEEE Conference on Games (CoG), pp. 1–8. Cited by: §VII.
  • [11] D. Gravina, A. Liapis, and G. N. Yannakakis (2018) Quality diversity through surprise.

    IEEE Transactions on Evolutionary Computation

    23 (4), pp. 603–616.
    Cited by: §VII.
  • [12] S. Greuter, J. Parker, N. Stewart, and G. Leach (2003) Real-time procedural generation of pseudo infinite cities. In Proceedings of the 1st international conference on Computer graphics and interactive techniques in Australasia and South East Asia, pp. 87–ff. Cited by: §II-B.
  • [13] A. Isaksen, C. Holmgård, and J. Togelius (2017) Semantic hashing for video game levels. Game Puzzle Des 3 (1), pp. 10–16. Cited by: §IV-D1.
  • [14] M. Jennings-Teats, G. Smith, and N. Wardrip-Fruin (2010) Polymorph: a model for dynamic level generation. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Vol. 5. Cited by: §I, §II-B, §III-A.
  • [15] L. Johnson, G. N. Yannakakis, and J. Togelius (2010) Cellular automata for real-time generation of infinite cave levels. In Proceedings of the 2010 Workshop on Procedural Content Generation in Games, pp. 1–4. Cited by: §II-B.
  • [16] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaśkowski (2016) Vizdoom: a Doom-based AI research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. Cited by: §III.
  • [17] A. Khalifa, P. Bontrager, S. Earle, and J. Togelius (2020) PCGRL: procedural content generation via reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Vol. 16, pp. 95–101. Cited by: §I, §I, §II-A, Fig. 1, §III-A, §III, §IV-A, §VIII.
  • [18] R. Koster (2013) Theory of fun for game design. “O’Reilly Media, Inc.”. Cited by: §I, §I, §IV-D1, §VIII.
  • [19] I. Kostrikov (2018) PyTorch implementations of reinforcement learning algorithms. GitHub. Note: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail Cited by: §IV-C.
  • [20] J. Lehman and K. O. Stanley (2008) Exploiting open-endedness to solve problems through the search for novelty.. In ALIFE, pp. 329–336. Cited by: §IV-D2.
  • [21] J. Lehman and K. O. Stanley (2011) Novelty search and the problem with objectives. In Genetic programming theory and practice IX, pp. 37–56. Cited by: §IV-D2.
  • [22] A. Liapis, G. N. Yannakakis, M. J. Nelson, M. Preuss, and R. Bidarra (2018) Orchestrating game generation. IEEE Transactions on Games 11 (1), pp. 48–68. Cited by: §VII.
  • [23] J. Liu, A. Khalifa, Snodgrass, S. Risi, G. N. Yannakakis, and J. T. Togelius (2021) Deep learning for procedural content generation. Neural Computing and Applications volume 33, pp. 19–37. Cited by: §II-A, §II.
  • [24] S. M. Lucas and V. Volz (2019) Tile pattern KL-divergence for analysing and evolving game levels. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 170–178. Cited by: §IV-D1.
  • [25] R. Nintendo (1985) Super Mario Bros. Game [NES].(13 September 1985). Nintendo, Kyoto, Japan. Cited by: §I.
  • [26] D. Perez-Liebana, J. Liu, A. Khalifa, R. D. Gaina, J. Togelius, and S. M. Lucas (2019) General video game AI: a multitrack framework for evaluating agents, games, and content generation algorithms. IEEE Transactions on Games 11 (3), pp. 195–214. Cited by: §I, §III.
  • [27] D. Perez-Liebana, S. M. Lucas, R. D. Gaina, J. Togelius, A. Khalifa, and J. Liu (2019) General video game artificial intelligence. Morgan & Claypool Publishers. Note: https://gaigresearch.github.io/gvgaibook/ Cited by: §I.
  • [28] S. Risi and J. Togelius (2020) Increasing generality in machine learning through procedural content generation. Nature Machine Intelligence 2 (8), pp. 428–436. Cited by: §II-A, §II.
  • [29] J. Schmidhuber (2006) Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts. Connection Science 18 (2), pp. 173–187. Cited by: §VII.
  • [30] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §IV-C.
  • [31] N. Shaker, J. Togelius, and M. J. Nelson (2016) Procedural content generation in games. Springer. Cited by: §I, §II-A, §II.
  • [32] N. Shaker, J. Togelius, G. N. Yannakakis, B. Weber, T. Shimizu, T. Hashiyama, N. Sorenson, P. Pasquier, P. Mawhorter, G. Takahashi, et al. (2011) The 2010 Mario AI championship: level generation track. IEEE Transactions on Computational Intelligence and AI in Games 3 (4), pp. 332–347. Cited by: §IV-B.
  • [33] N. Shaker, G. Yannakakis, and J. Togelius (2010) Towards automatic personalized content generation for platform games. In Proceedings of the Sixth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, pp. 63–68. Cited by: §I, §II-B.
  • [34] P. Shi and K. Chen (2016) Online level generation in Super Mario Bros via learning constructive primitives. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. Cited by: §II-B.
  • [35] P. Shi and K. Chen (2017) Learning constructive primitives for real-time dynamic difficulty adjustment in Super Mario Bros. IEEE Transactions on Games 10 (2), pp. 155–169. Cited by: §I, §II-B, §III-A.
  • [36] T. Shu, Z. Wang, J. Liu, and X. Yao (2020) A novel cnet-assisted evolutionary level repairer and its applications to Super Mario Bros. In 2020 IEEE Congress on Evolutionary Computation (CEC), Cited by: §I, §IV-A, §VI, §VIII.
  • [37] D. Stammer, T. Günther, and M. Preuss (2015) Player-adaptive spelunky level generation. In 2015 IEEE Conference on Computational Intelligence and Games (CIG), pp. 130–137. Cited by: §I, §II-B.
  • [38] A. J. Summerville, S. Snodgrass, M. Mateas, and S. Ontanón (2016) The VGLC: the video game level corpus. In Proceedings of 1st International Joint Conference of DiGRA and FDG, External Links: Link Cited by: §IV-A.
  • [39] A. Summerville, S. Snodgrass, M. Guzdial, C. Holmgård, A. K. Hoover, A. Isaksen, A. Nealen, and J. Togelius (2018) Procedural content generation via machine learning (PCGML). IEEE Transactions on Games 10 (3), pp. 257–270. Cited by: §II-A, §II.
  • [40] W. R. Thompson (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §II-B.
  • [41] J. Togelius, A. J. Champandard, P. L. Lanzi, M. Mateas, A. Paiva, M. Preuss, and K. O. Stanley (2013) Procedural content generation: goals, challenges and actionable steps. Cited by: §II-A, §II.
  • [42] J. Togelius, S. Karakovskiy, and R. Baumgarten (2010) The 2009 Mario AI competition. In IEEE Congress on Evolutionary Computation, pp. 1–8. Cited by: §VII.
  • [43] J. Togelius, G. N. Yannakakis, K. O. Stanley, and C. Browne (2011) Search-based procedural content generation: a taxonomy and survey. IEEE Transactions on Computational Intelligence and AI in Games 3 (3), pp. 172–186. Cited by: §II-A, §II.
  • [44] E. Tulving (2002) Episodic memory: from mind to brain. Annual review of psychology 53 (1), pp. 1–25. Cited by: §III.
  • [45] V. Volz, J. Schrum, J. Liu, S. M. Lucas, A. Smith, and S. Risi (2018)

    Evolving mario levels in the latent space of a deep convolutional generative adversarial network

    In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 221–228. Cited by: §IV-A.
  • [46] G. N. Yannakakis and J. Togelius (2018) Artificial intelligence and games. Vol. 2, Springer. Cited by: §I, §II-A, §II, §VII.
  • [47] G. N. Yannakakis and A. Liapis (2016) Searching for surprise. Cited by: §VII.
  • [48] G. N. Yannakakis and J. Togelius (2011) Experience-driven procedural content generation. IEEE Transactions on Affective Computing 2 (3), pp. 147–161. Cited by: §I, §II-A, §II-B, §II, Fig. 1, §III, §VIII.
  • [49] G. N. Yannakakis (2005) AI in computer games: generating interesting interactive opponents by the use of evolutionary computation. Ph.D. Thesis, University of Edinburgh. Cited by: §VII.