Winning Isn't Everything: Enhancing Game Development with Intelligent Agents

by   Yunqi Zhao, et al.

Recently, there have been several high-profile achievements of agents learning to play games against humans and beat them. In this paper, we study the problem of training intelligent agents in service of game development. Unlike the agents built to "beat the game", our agents aim to produce human-like behavior to help with game evaluation and balancing. We discuss two fundamental metrics based on which we measure the human-likeness of agents, namely skill and style, which are multi-faceted concepts with practical implications outlined in this paper. We discuss how this framework applies to multiple games under development at Electronic Arts, followed by some of the lessons learned.



There are no comments yet.



Winning Isn't Everything: Training Human-Like Agents for Playtesting and Game AI

Recently, there have been several high-profile achievements of agents le...

On the Development of Intelligent Agents for MOBA Games

Multiplayer Online Battle Arena (MOBA) is one of the most played game ge...

Policy Fusion for Adaptive and Customizable Reinforcement Learning Agents

In this article we study the problem of training intelligent agents usin...

Gapoera: Application Programming Interface for AI Environment of Indonesian Board Game

Currently, the development of computer games has shown a tremendous surg...

Monte-Carlo Tree Search for Simulation-based Strategy Analysis

Games are often designed to shape player behavior in a desired way; howe...

Automatable Evaluation Method Oriented toward Behaviour Believability for Video Games

Classic evaluation methods of believable agents are time-consuming becau...

Learning-Based Video Game Development in MLP@UoM: An Overview

In general, video games not only prevail in entertainment but also have ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The history of artificial intelligence (AI) can be mapped by its achievements playing and winning various games. From the early days of Chess-playing machines to the most recent accomplishments of Deep Blue [23], AlphaGo [64], and AlphaStar [3], game-playing AI111 We refer to game-playing AI as any AI solution that powers an agent in the game. This can range from scripted AI solutions all the way to the state-of-the-art deep reinforcement learning agents. has advanced from competent, to competitive, to champion in even the most complex games. Games have been instrumental in advancing AI, and most notably in recent times through tree search and deep reinforcement learning (RL).

Fig. 1: A depiction of the possible ranges of AI agents and the possible tradeoff/balance between skill and style. In this tradeoff, there is a region that captures human-like skill and style. AI Agents may not necessarily land in the human-like region. High-skill AI agents land in the green region while their style may fall out of the human-like region.

Complementary to these great efforts on training high-skill gameplaying agents, at Electronic Arts, our primary goal is to train agents that assist in the game design process, which is iterative and laborious. The complexity of modern games steadily increases, making the corresponding design tasks even more challenging. To support designers in this context, we train game-playing AI agents to perform tasks ranging from automated playtesting to interaction with human players tailored to enhance game-play immersion.

To approach the challenge of creating agents that generate meaningful interactions that inform game developers, we propose techniques to model different behaviors. Each of these has to strike a different balance between style and skill. We define skill as how efficient the agent is at completing the task it is designed for. Style is vaguely defined as how the player engages with the game and what makes the player enjoy their game-play. Defining and gauging skill is usually much easier than that of style. However, we attempt to evaluate style of an artificial agent using statistical properties of the underlying model in this paper.

One of the most crucial tasks in game design is the process of playtesting. Game designers usually rely on playtesting sessions and feedback they receive from playtesters to make design choices in the game development process. Playtesting is performed to guarantee quality game-play that is free of game-breaking exceptions (e.g., bugs and glitches) and delivers the experience intended by the designers. Since games are complex entities with many moving parts, solving this multi-faceted optimization problem is even more challenging. An iterative loop where data is gathered from the game by one or more playtesters, followed by designer analysis is repeated many times throughout the game development process.

To mitigate this expensive process, one of our major efforts is to implement agents that can help automate aspects of playtesting. These agents are meant to play through the game, or a slice of it, trying to explore behaviors that can generate data to assist is answering questions that designers pose. These can range from exhaustively exploring certain sequences of actions, to trying to play a scenario from start to finish in the least amount of actions possible. We showcase use-cases focused on creating AI agents to playtest games at Electronic Arts and discuss the related challenges.

Another key task in game development is the creation of in-game characters that are human-facing and interact with real human players. Agents must be trained and delicate tuning has to be performed to guarantee quality experience. An AI adversary that reacts in a small amount of frames can be deemed unfair rather than challenging. On the other hand, a pushover agent might be an appropriate introductory opponent for novice players, while it fails to retain player interest after a handful of matches. While traditional AI solutions are already providing excellent experiences for the players, it is becoming increasingly more difficult to scale those traditional solutions up as the game worlds are becoming larger and the content is becoming dynamic.

As Fig. 1 shows, there is a range of style/skill pairs that are achievable by human players, and hence called human-like. On the other hand, high-skill game-playing agents may have an unrealistic style rating, if they rely on high computational power and memory size, unachievable by humans. Efforts to evaluate techniques to emulate human-like behavior have been presented [51], but measuring non-objective metrics such as fun and immersion is an open research question [26, 17]. Further, we cannot evaluate player engagement prior to the game launch, so we rely on our best approximation: designer feedback. Through an iterative process, the designers evaluate the game-play experience by interacting with the AI agents to measure whether the intended game-play experience is provided.

These challenges each require a unique equilibrium between style and skill. Certain agents could take advantage of superhuman computation to perform exploratory tasks, most likely relying more heavily on skill. Others need to interact with human players, requiring a style that won’t break player immersion. Then there are agents that need to play the game with players cooperatively, which makes them rely on a much more delicate balance that is required to pursue a human-like play style. Each of these these individual problems call for different approaches and have significant challenges. Pursuing human-like style and skill can be as challenging (if not more) than achieving high performance agents.

Finally, training an agent to satisfy a specific need is often more efficient than trying to reach such solution through high-skill AI agents. This is the case, for example, when using game-playing AI to automatically run multiple playthroughs of a specific in-game scenario to trace the origin of an unintended game-play behavior. In this scenario, an agent that would explore the game space would potentially be a better option than one that reaches the goal state of the level more quickly. Another advantage in creating specialized AI is the cost of implementation and training. The agents needed for these tasks are, commonly, of smaller complexity than their optimal play alternative, making it easier to implement as well as faster to train.

To summarize, we mainly pursue two use-cases for having AI agents enhance the game development process.

  1. playtesting AI agents to provide feedback during game design, particularly when a large number of concurrent players are to interact in huge game worlds.

  2. game-playing AI agents to interact with real human players to shape their game-play experience.

The rest of the paper is organized as follows. In Section II, we review the related work on training agents for playtesting and NPCs. In Section III, we describe our training pipeline. In Sections IV and V, we provide four case studies that cover playtesting and game-playing, respectively. These studies are performed to help with the development process of multiple games at Electronic Arts. These games vary considerably in many aspects, such as the game-play platform, the target audience, and the engagement duration. The solutions in these case studies were created in constant collaboration with the game designers. The first case study in Section IV-A

, which covers game balancing and playtesting was done in conjunction with the development of The Sims Mobile. The other case studies are performed on games that are still under development, at the moment this paper was written. Hence, we had to omit specific details regarding them purposely to comply with company confidentiality. Finally, the concluding remarks are provided in Section 


Ii Related Work

Ii-a Playtesting AI agents

To validate their design, game designers conduct playtesting sessions. Playtesting consists of having a group of players interact with the game in the development cycle to not only gauge the engagement of players, but also to discover elements and states that result in undesirable outcomes. As a game goes through the various stages of development, it is essential to continuously iterate and improve the relevant aspects of the game-play and its balance. Relying exclusively on playtesting conducted by humans can be costly and inefficient. Artificial agents could perform much faster play sessions, allowing the exploration of much more of the game space in much shorter time. This becomes even more valuable as game worlds grow large enough to hold tens of thousands of simultaneously interacting players. Games at this scale render traditional human playtesting infeasible.

Recent advances in the field of RL, when applied to playing computer games assume that the goal of a trained agent is to achieve the best possible performance with respect to clearly defined rewards while the game itself remains fixed for the foreseen future. In contrast, during game development the objectives and the settings are quite different and vary over time. The agents can play a variety of roles with the rewards that are not obvious to define formally, e.g., an objective of an agent exploring a game level is different from foraging, defeating all adversaries, or solving a puzzle. Also, the game environment changes frequently between the game builds. In such settings, it is desirable to quickly train agents that help with automated testing, data generation for the game balance evaluation and wider coverage of the game-play features. It is also desirable that the agent be mostly re-usable as the game build is updated with new appearance and game-play features. Following the direct path of throwing computational resources combined with substantial engineering efforts at training agents in such conditions is far from practical and calls for a different approach.

The idea of using artificial agents for playtesting is not new. Algorithmic approaches have been proposed to address the issue of game balance, in board games [22, 35] and card games [40, 43, 21]. More recently, Holmgard et al. [34], as well as, Mugrai et al. [47] built variants of MCTS to create a player model for AI Agent based playtesting. Guerrero-Romero et al. created different goals for general game-playing agents in order to playtest games emulating players of different profiles  [28]. These techniques are relevant to creating rewarding mechanisms for mimicking player behavior. AI and machine learning can also play the role of a co-designer, making suggestions during development process [77]. Tools for creating game maps [41] and level design [67, 62] are also proposed. See [73, 69] for a survey of these techniques in game design.

In this paper, we describe our framework that supports game designers with automated playtesting. This also entail a training pipeline that universally applies this framework to a variety of games. We then provide two case studies that entail different solution techniques.

Ii-B Game-playing AI agents

Game-playing AI has been a main constituent of games since the dawn of video gaming. Analogously, games, given their challenging nature, have been a target for AI research [78]. Over the years, AI agents have become more sophisticated and have been providing excellent experiences to millions of players as games have grown in complexity. Scaling traditional AI solutions in ever growing worlds with thousands of agents and dynamic content is a challenging problem calling for alternative approaches.

The idea of using machine learning for game-playing AI dates back to Arthur Samuel [59], who applied some form of tree search combined with basic reinforcement learning to the game of checkers. His success motivated researchers to target other games using machine learning, and particularly reinforcement learning.

IBM Deep Blue followed the tree search path and was the first artificial game agent who beat the chess world champion, Gary Kasparov [23]. A decade later, Monte Carlo Tree Search (MCTS) [19, 39] was a big leap in AI to train game agents. MCTS agents for playing Settlers of Catan were reported in [70, 18]

and shown to beat previous heuristics. Other work compares multiple approaches of agents to one another in the game Carcassonne on the two-player variant of the game and discusses variations of MCTS and Minimax search for playing the game 

[31]. MCTS has also been applied to the game of 7 Wonders [56] and Ticket to Ride [36]. Furthermore, Baier et al. biased MCTS with a player model, extracted from game-play data, to have an agent that was competitive while approximating human-like play [6]. Tesauro [71], on the other hand, used TD-Lambda which is a temporal difference RL algorithm to train Backgammon agents at a superhuman level. The impressive recent progress on RL to solve video games is partly due to the advancements in processing power and AI computing technology.222The amount of AI compute has been doubling every 3-4 months in the past few years [2].

More recently, following the success stories in deep learning, deep Q networks (DQNs) use deep neural networks as function approximators within Q-learning 

[45]. DQNs can use convolutional function approximators as a general representation learning framework from the pixels in a frame buffer without need for task-specific feature engineering.

DeepMind researchers remarried the two approaches by showing that DQNs combined with MCTS would lead to AI agents that play Go at a superhuman level [64], and solely via self-play [66, 65]. Subsequently, OpenAI researchers showed that a policy optimization approach with function approximation, called Proximal Policy Optimization (PPO) [61], would lead to training agents at a superhuman level in Dota 2 [24]. Cuccu et al. proposed learning policies and state representations individually, but at the same time, and did so using two novel algorithms [20]

. With such approach they were able to play Atari games with neural networks of 18 neurons or less. Recently, highly publicized progress was reported by DeepMind on StarCraft II, where AlphaStar was unveiled to play the game at a superhuman level by combining a variety of techniques including attention networks 


Fig. 2: The AI agent training pipeline, which is consisted of two main components: 1) game-play environment; and 2) agent environment. The agent submits actions to the game-play environment and receives back the next state.

Iii Training Pipeline

To train AI agents efficiently, we have developed a unified training pipeline that is applicable to all of EA games, regardless of the platform and the genre of the game. In this section, we present our training pipeline that is used for solving the case studies presented in the section that follows.

Iii-a Gameplay and Agent Environments

The AI agent training pipeline, which is depicted in Fig. 2, consists of two key components:

  • Gameplay environment refers to the simulated game world that executes the game logic with actions submitted by the agent every timestep and produces the next state.333 Note that the reward is usually defined by the user as a function of the state and action outside of the game-play environment.

  • Agent environment refers to the medium where the agent interacts with the game world. The agent observes the game state and produces an action. This is where training occurs. Note that in case of reinforcement learning, the reward computation and shaping also happens in the agent environment.

In practice, the game architecture can be complex and it might be too costly for the game to directly communicate the complete state space information to the agent at every timestep. To train artificial agents, we create a universal interface between the game-play environment and the learning environment.444These environments may be physically separated, and hence, we prefer a thin (i.e., headless) client that supports fast cloud execution, and is not tied to frame rendering. The interface extends OpenAI Gym [49] and supports actions that take arguments, which is necessary to encode action functions and is consistent with PySC2 [74, 55]. In addition, our training pipeline enables creating new players on the game server, logging in/out an existing player, and gathering data from expert demonstrations. We also adapt Dopamine [8] to this pipeline to make DQN [45] and Rainbow [30] agents available for training in the game. Additionally, we add support for more complex preprocessing other than the usual frame buffer stacking, which we explicitly exclude following the motivation presented in the next section.

Iii-B State Abstraction

The use of frame buffer as an observation of the game state has proved advantageous in eliminating the need for manual feature-engineering in Atari games [45]. However, to achieve the objectives of RL in a fast-paced game development process, the drawbacks of using frame buffer outweigh its advantages. The main considerations which we take into account when deciding in favor of a lower-dimensional engineered representation of game state are:

  1. [(a)]

  2. During almost all stages of the game development, the game parameters are evolving on a daily basis. In particular, the art may change at any moment and the look of already learned environments can change overnight. Hence, it is desirable to train agents using features that are more stable to minimize the need for retraining agents.

  3. Another important advantage of state abstraction is that it allows us to train much smaller models (networks) because of the smaller input size and use of carefully engineered features. This is critical for deployment for real time applications in console and consumer PC environments where rendering, animation and physics are occupying much of the GPU and CPU power.

  4. In playtesting, the game-play environment and the learning environment may reside in physically separate nodes. Naturally, closing the RL state-action-reward loop in such environments requires a lot of network communication. The presence of frame buffers as the representative of game state would significantly increase this communication cost whereas derived game state features enable more compact encodings.

  5. Obtaining an artificial agent in a reasonable time (a few hours at most) usually requires that the game be clocked at a rate much higher than the usual game-play speed. As rendering each frame takes a significant portion of every frame’s time, overclocking with rendering enabled is not practical. Additionally, moving large amount of data from GPU to main memory drastically slows down the game execution and can potentially introduce simulation artifacts, by interfering with the target timestep rate.

  6. Last but not least, we can leverage the advantage of having privileged access to the game code to let the game engine distill a compact state representation that could be inferred by a human player from the game and pass it to the agent environment. By doing so we also have a better hope of learning in environments where the pixel frames only contain partial information about the the state space.

The compact state representation could include the inventory, resources, buildings, the state of neighboring players, and the distance to target. In an open-world shooter game the features may include the distance to the adversary, angle at which the agent approaches the adversary, presence of line of sight to the adversary, direction to the nearest waypoint generated by the game navigation system, and other features. The feature selection may require some engineering efforts but it is logically straightforward after the initial familiarization with the game-play mechanics, and often similar to that of traditional game-playing AI, which will be informed by the game designer. We remind the reader that our goal is not to train agents that win but to simulate human-like behavior, so we train on information that would be accessible to a human player.

Iv Playtesting AI Agents

Iv-a Measuring player experience for different player styles

In this section, we consider the early development of The Sims Mobile, whose game-play is about “emulating life”: players create avatars, called Sims, and conduct them through a variety of everyday activities. In this game, there is no single predetermined goal to achieve. Instead, players craft their own experiences, and the designer’s objective is to evaluate different aspects of that experience. In particular, each player can pursue different careers, and as a result will have a different experience and trajectory in the game. In this specific case study, the designer’s goal is to evaluate if the current tuning of the game achieves the intended balanced game-play experience across different careers. For example, different careers should prove similarly difficult to complete. We refer the interested reader to [63] for a more comprehensive study of this problem.

The game is single-player, deterministic, real-time, fully observable and the dynamics are fully known. We also have access to the complete game state, which is composed mostly of character and on-going action attributes. This simplified case allows for the extraction of a lightweight model of the game (i.e., state transition probabilities). While this requires some additional development effort, we can achieve a dramatic speedup in training agents by avoiding (reinforcement) learning and resorting to planning techniques instead.

In particular, we use the A* algorithm for the simplicity of proposing a heuristic that can be tailored to the specific designer need by exploring the state transition graph instead of the more expensive iterative processes, such as dynamic programming,555Unfortunately, in the dynamic programming every node will participate in the computation while it is often true that most of the nodes are not relevant to the shortest path problem in the sense that they are unlikely candidates for inclusion in a shortest path [9]. and even more expensive Monte Carlo search based algorithms. The customizable heuristics and the target states corresponding to different game-play objectives, which represent the style we are trying to achieve, provide sufficient control to conduct various experiments and explore multiple aspects of the game.

Our heuristic for the A* algorithm is the weighted sum of the 3 main parameters that contribute for career progression: career level, current career experience points and amount of completed career events. These parameters are directly related. To gain career levels players have to accumulate career experience points and to obtain experience, players have to complete career events. The weights are attributed based on the order of magnitude each parameter has. Since levels are the most important, it receives the highest weight. The amount of completed career events has the lowest weight because it is already partially factored into the the amount of career points received so far.

Fig. 3: Comparison of the average amount of career actions (appointments) taken to complete the career using A* search and evolution strategy adapted from [63].

We also compare A* results to the results from an optimization problem over a subspace of utility-based policies approximately solved with an evolution strategy (ES) [58]. Our goal, in this case, is to achieve a high environment reward against selected objective, e.g., reach the end of a career track while maximizing earned career event points. We design ES objective accordingly. The agent performs an action based on a probabilistic policy by taking a softmax on the utility measure of the actions in a game state . Utility here serves as an action selection mechanism to compactly represent a policy. In a sense, it is a proxy to a state-action value -function in RL. However, we do not attempt to derive utility from Bellman’s equation and the actual environment reward . Instead, we learn parameters that define to optimize the environment rewards using the black-box ES optimization technique. In that sense optimizing by learning parameters of is similar to Proximal Policy Optimization (PPO), however, in much more constrained settings. To this end, we design utility of an action as a weighted sum of the immediate action rewards and costs

. These are vector-valued quantities and are explicitly present in the game tuning describing the outcome of executing such actions. The parameters evolving by the ES are the linear weights for the utility function

explained below and the temperature of the softmax function. An additional advantage of the proposed linear design of the utility function is a certain level of interpretability of the weights corresponding to the perceived by the agent utilities of the individual components of the resources or the immediate rewards. Such interpretability can guide changes to the tuning data.

Concretely, given the game state , we design the utility of an action as

The immediate reward here is a vector that can include quantities like the amount of experience, amount of career points earned for the action and the events triggered by it. The costs is a vector defined similarly. The action costs specify the amounts of resources required to execute such an action, e.g., how much time, energy, hunger, etc. a player needs to spend to successfully trigger and complete the action. The design of the tuning data makes the both quantities and only depend on the action itself. Since both - the immediate reward and are vector values, the products in the definition of above are dot products. The vectors and introduce dependence of the utility on the current game state and are the weights defining relative contribution of the immediate resource costs and immediate rewards towards the current goals of the agent.

The inferred utilities of the actions depend on the state since some actions in certain states are more beneficial than in other states. E.g., triggering a career event while not having enough resources to complete it successfully may be wasteful and an optimal policy should avoid it. The relevant state components include available commodities like energy and hunger and a categorical event indicator (0 if outside of the event and 1 otherwise) wrapped into a vector. The total number of the relevant dimensions here is . We design the weights and as bi-linear functions with the coefficients and that we are learning: and .

To define the optimization objective , we construct it as a function of the number of successfully completed events and the number of attempted events . We aim to maximize the ratio of successful to attempted events times total number of successful events in the episode as follows:

where is a small number less than 1 eliminating division by zero when the policy fails to attempt any events. The overall optimization problem looks like:

subject to the policy defined by actions selected with a softmax over their utilities parameterized with the (learned) parameters and .

The utility-based ES, as we describe it here, captures the design intention of driving career progression in the game-play by successful completion of career events. Due to the emphasis on the events completion, our evolution strategy setup is not necessarily resulting in an optimization problem equivalent to the one we solve with . However, as we discuss below, it has similar optimum most of the time, supporting the design view on the progression. A similar approach works for evaluating relationship progression, which is another important element of the game-play.

We compare the number of actions that it takes to reach the goal for each career in Fig. 3 as computed by the two approaches. We emphasize that our goal is to show that a simple planning method, such as A*, can sufficiently satisfy the designer’s goal in this case. We can see that the more expensive optimization based evolution strategy performs similarly to the much simpler A* search.

The largest discrepancy arises for the Barista career, which might be explained by the fact that this career has an action that does not reward experience by itself, but rather enables another action that does it. This action can be repeated often and can explain the high numbers despite having half the number of levels. Also, we observe that in the case of the medical career, the 2,000 node A* cutoff was potentially responsible for the under performance in that solution.

When running the two approaches, another point of comparison can be made: how many sample runs are required to obtain statistically significant results? We performed 2,000 runs for the evolution strategy while it is notable that the A* agent learns a deterministic playstyle, which has no variance. On the other hand, the agent trained using an evolution strategy has a high variance and requires a sufficiently high number of runs of the simulation to approach a final reasonable strategy 


In this use case, we were able to use a planning algorithm, A*, to explore the game space to gather data for the game designers to evaluate the current tuning of the game. This was possible due to the goal being straightforward, to evaluate progression in the different careers. With such, the requirements of skill and style for the agent were achievable and simple to model. Over the next use cases, we analyze scenarios that call for different approaches as consequence of having more complex requirements and subjective agent goals.

Fig. 4: This plot belongs to Section IV-B. Average cumulative reward (return) in training and evaluation for the agents as a function of the number of iterations. Each iteration is worth 60 minutes of game-play. The trained agents are: (1) a DQN agent with complete state space, (2) a Rainbow agent with complete state space, (3) a DQN agent with augmented observation space, and (4) a Rainbow agent with augmented observation space.

Iv-B Measuring competent player progression

In the next case study, we consider a real-time multi-player mobile game, with a stochastic environment, with sequential actions. The game is designed to engage a large number of players for months. The game dynamics are governed by a complex physics engine, which makes it impractical to apply planning methods. This game is much more complex than The Sims Mobile in the sense that it requires the players to exhibit strategic decision making for them to progress in the game. When the game dynamics are unknown or complex, most recent success stories are based on model-free RL (and particularly variants of DQN and PPO). In this section, we show how such model-free control techniques fit into the paradigm of playtesting modern games.

In this game, the goal of the player (and subsequently the agent) is to level up and reach a particular milestone in the game. To this end, the player needs to make smart decisions in terms of resource mining and resource management for different tasks. In the process, the agent needs to also upgrade some buildings. Each upgrade requires a certain level of resources. If the player’s resources are insufficient, the upgrade is not possible. A human player will be able to visually discern the validity of such action by clicking on the particular building for upgrade. The designer’s primary concern in this case study is to measure how a competent player would progress in the early stages of this game. In particular, the competent player is required to balance resources and make other strategic choices that the agent needs to discern as well.

We consider a simplified version of the state space that contains information about this early stage of the game ignoring the full state space. The relevant part of the state space consists of 50 continuous and 100 discrete state variables. The set of possible actions is a subset of a space , which consists of 25 action classes, some of which are from a continuous range of possible action values, and some are from a discrete set of action choices. The agent has the ability to generate actions but not all of them are valid at every game state since , i.e., depends on the timestep and the game state. Moreover, the subset of valid actions may only partially be known to the agent. If the agent attempts to take an unavailable action, such as a building upgrade without sufficient resources, the action will be deemed invalid and no actual action will be taken by the game server.

While the problem of a huge state space [33, 68, 54], a continuous action space [42], and a parametric action space [29] could be dealt with, these techniques are not directly applicable to our problem. This is because, as we shall see more, some actions will be invalid at times and inferring that information may not be fully possible from the observation space. Finally, the game is designed to last tens of millions of timesteps, taking the problem of training a functional agent in such an environment outside of the domain of previously explored problems.

We study game progression while taking only valid actions. As we already mentioned, the set of valid actions

may not be fully determined by the current observation, and hence, we deal with a partially observable Markov decision process (POMDP). Given the practical constraints outlined above, it is infeasible to apply deep reinforcement learning to train agents in the game in its entirety. In this game, we show progress toward training an artificial agent that takes valid actions and progresses in the game like a competent human player. To this end, we wrap this game in the game environment and connect it to our training pipeline with DQN and Rainbow agents. In the agent environment, we use a feedforward neural network with two fully connected hidden layers, each with 256 neurons followed by ReLU activation.

As a first step in measuring game progression, we define an episode by setting an early goal state in the game that takes an expert human player 5 minutes to reach. We let the agent submit actions to the game server every second. We may have to revisit this assumption for longer episodes where the human player is expected to interact with the game more periodically. We use a simple rewarding mechanism, where we reward the agent with ‘+1’ when they reach the goal state, ‘-1’ when they submit an invalid action, ‘0’ when they take a valid action, and ‘-0.1’ when they choose the “do nothing” action. The game is such that at times the agent has no other valid action to choose, and hence they should choose “do nothing”, but such periods do not last more than a few seconds in the early stages of the game, which is the focus of this case study.

We consider two different versions of the observation space, both extracted from the game engine (state abstraction). The first is what we call the “complete” state space. The complete state space contains information that is not straightforward to infer from the real observation in the game and is only used as a baseline for the agent. In particular, the complete state space also includes the list of available actions at each state. The polar opposite of this state space could be called the “naive” state space, which only contains straightforward information that is always shown on the screen of the player. The second state space we consider is what we call the “augmented” observation space, which contains information from the “naive” state space and information the agent would reasonably infer and retain from current and previous game observations. For example, this includes the amount of resources needed for an upgrade after the agent has checked a particular building for an upgrade. The augmented observation space does not include the set of all available actions, and hence, we rely on the game server to validate whether a submitted action is available because it is not possible to encode and pass the set of available actions. Hence, if an invalid action is chosen by the agent, the game server will ignore the action and will flag the action so that we can provide a ‘-1’ reward.

We trained four types of agents as shown in Fig. 4, where we are plotting the average undiscounted return in each training episode. By design, this quantity is upper bounded by ‘+1’, which is achieved if the agent keeps taking valid actions until reaching the final goal state. In reality, this may not always be achievable as there are periods of time where no action is available and the agent has to choose the “do nothing” action and be rewarded with ‘-0.1’. Hence, the best a competent human player would achieve on these episodes would be around zero.

We see that after a few iterations, both the Rainbow and DQN agents converge to their asymptotic performance values. The Rainbow agent converges to a better asymptotic performance level as compared to the DQN agent. However, in the the transient behavior we observe that the DQN agent achieves the asymptotic behavior faster than the Rainbow agent. We believe this might be due to the fact that we did not tune hyperparameters of prioritized experience replay 

[60], and distributional RL [7].666 This is consistent with the results of Section V-C, where Rainbow with default hyperparameters does not outperform DQN either. We used the default values that worked best on Atari games with frame buffer as state space. Extra hyperparameter tuning would have been costly in terms of cloud infrastructure for this particular problem as the game server does not allow speedup and training once already takes a few hours.

As expected, we see in Fig. 4 that the augmented observation space makes the training slower and also results in a worse performance on the final strategy. In addition, at evaluation time, the agent keeps attempting invalid actions in some cases as the state remains mostly unchanged after each attempt and the policy is (almost) deterministic. These results in accumulating large negative returns in such episodes which account for the dips in the right-hand-side panel in Fig. 4 at evaluation time.

The observed behavior drew our attention to the question of whether it is too difficult to discern and keep track of the set of valid actions for a human player as well. In fact, after seeking more extensive human feedback the game designers concluded that better visual cues were needed for a human player on information about valid actions at each state so that the human players could progress more smoothly without being blocked by invalid actions. As next steps, we intend to experiment with shaping the reward function for achieving different play styles to be able to better model different player clusters. We also intend to investigate augmenting the replay buffer with expert demonstrations for faster training and also for generative adversarial imitation learning [32] once the game is released and human play data is available.

We remark that without state abstraction (streamlined access to the game state), the neural network function approximator used for Q-learning would have needed to discern all such information from the pixels in the frame buffer, and hence we would not have been able to get away with such a simple two-layer feedforward function approximator to solve this problem. However, we observe that the training within the paradigm of model-free RL remains costly. Specifically, even using the complete state space, it takes several hours to train an agent that achieves a level of performance expected of a competent human player on this relatively short episode of 5 minutes. This calls for the exploration of complementary approaches to augment the training process. In particular, we also would like to streamline this process by training reusable agents and capitalizing on existing human data through imitation learning.

V Game-playing AI

We have shown the value of simulated agents in a fully modeled game, and the potential of training agents in a complex game to model player progression for game balancing. We can take these techniques a step further and make use of agent training to help build the game itself. Instead of applying RL to capture player behaviors, we consider an approach to game-play design where the player agents learn behavior policies from the game designers. The primary motivation of that is to give direct control into the designer hands and enable easy interactive creation of various behavior types, e.g., aggressive, exploratory, stealthy, etc. At the same time, we aim to complement organic demonstrations with bootstrap and heuristics to eliminate the need for a human to train an agent on the states normally not encountered by humans, e.g., unblocking an agent using obstacle avoidance.

V-a Human-Like Exploration in an Open-World Game

To bridge the gap between the agent and the designer, we introduce imitation learning (IL) to our system [32, 53, 5, 10]. In the present application, IL allows us to translate the intentions of the game designer into a primer and a target for our agent learning system. Learning from expert demonstrations has traditionally proved very helpful in training agents, including in games [72]. In particular, the original Alpha Go [64] used expert demonstrations in training a deep Q network. While it is argued in subsequent work that learning via self-play could achieve a better asymptotic return compared to relying on expert demonstrations, the better performance comes with significantly higher cost in terms of training computational resources and the superhuman performance is not what we are seeking in this work. There are also other cases where the preferred solution for training agents would utilize a few relatively short demonstration episodes played by the software developers or designers at the end of the current development cycle [25].

Fig. 5: Model performance measures the probability of the event that the Markov agent finds at least one previous action from human-played demonstration episodes in the current game state. The goal of interactive learning is to add support for new game features to the already trained model or improve its performance in underexplored game states. Plotted is the model performance during interactive training from demonstrations in a proprietary open-world game as a function of time measured in milliseconds (with the total duration around 10 minutes). The corresponding section of the paper offers additional details for the presented experiment.

In this application, we consider training artificial agents in an open-world video game, where the game designer is interested in training non-player characters that exhibit certain behavioral styles. The game we are exploring is a shooter with contextual game-play and destructible environment. While the game can run in multi-player mode, we focus on single-player, which provides us with an environment tractable yet rich enough to test the approach we discuss in this section. An agent in such a game would have its state composed of a 3D-vector location, velocity, animation pose, health, weapon type, ammunition, scope on-off, cone of sight, collision state, distance to the obstacles in the principle directions, etc. Overall the dimensionality of the agent state can grow to several dozens of variables with some of them continuous and the other categorical. We construct similar vectors for NPCs with which the agent needs to engage.

The NPC state variables account for partial observability and line-of-site constraints imposed by the level layout, its destruction state, and the current location of the agent relative to the NPCs. The NPCs in this environment represent adversarial entities, trying to eliminate the agent by attacking it until the agent runs out of health. Additionally, the environment can contain objects of interest, like health and ammo boxes, dropped weapons, etc. The environment itself is non-deterministic stochastic, i.e., there is no single random seed which we can set to control all random choices in the environment. Additionally, frequent saving and reloading game state is not practical due to relatively long loading times.

The main objective for us in this application of the agents training is to provide designer with a tool to playtest the game by interacting with the game in a number of particular styles to emulate different players. The styles can include:

  • Aggressive, when an agent tries to find, approach and defeat any adversarial NPC,

  • Sniper, when an agent finds a good sniping spot and waits for adversarial NPCs to appear in the cone of sight to shoot them,

  • Exploratory, when an agent attempts to visit as many locations and uncover as many objects of interest in the level as possible without getting engaged into combat unless encountering an adversarial NPC,

  • Sneaky, when an agent actively avoids combat while trying to achieve its objectives like reaching a particular points on the map.

Additionally, combat style can vary with deploying different weapons, e.g., long-range or hand-to-hand.

Obviously, manual test and evaluation of its results while following outlined styles is very time consuming and tedious. Having an agent that can replace a designer in this process would be a great time saver. Conceivably, an agent trained as the design helper can also be playing as a stand-in or an “avatar” of an actual human player to replace the player when she is not online or the connection drops out for short period of time. An agent trained in a specific style can also fill a vacant spot in a squad, or to help with the cold start problem for multi-player games.

In terms of the skill/style tradeoff laid out earlier in the paper, these agents are not designed to have any specific level of performance (e.g., a certain kill-death ratio) and they may not necessarily follow any long-term goals. These agents are intended to explore the game and also be able to interact with human players at a relatively shallow level of engagement. Hence, the problem boils down to efficiently training an agent using demonstrations capturing only certain elements of the game-play. The training process has to be computationally inexpensive and the agent has to imitate the behavior of the teacher(s) by mimicking their relevant style (in a statistical sense) for implicit representation of the teacher’s objectives.

Casting this problem directly into the RL framework is complicated by two issues. First, it is not straightforward how to design a rewarding mechanism for imitating the style of the expert. While inverse RL aims at solving this problem, its applicability is not obvious given the reduced representation of the huge state-action space that we deal with and the ill-posed nature of the inverse RL problem [1, 48]. Second, the RL training loop often requires thousands of episodes to learn useful policies, directly translating to a high cost of training in terms of time and computational resources. Hence, rather than using more complex solutions such as generative adversarial imitation learning [32]

which use an RL network as their generator, we propose a solution to the stated problem based on an ensemble of multi-resolution Markov models. One of the major benefits of the proposed model is the ability to perform an interactive training within the same episode. As useful byproduct of our formulation, we can also sketch a mechanism for numerical evaluation of the style associated with the agents we train. We outline the main elements of the approach next and for additional details point at

[14, 11, 12, 13] .

V-A1 Markov Decision Process with Extended State

Following the established treatment of training artificial agents, we place the problem into the standard MDP framework and augment it as follows. Firstly, we ignore the difference between the observation and the actual state of the environment. The actual state may be observable by the teacher but may be impractical to expose to the agent. To mitigate the partial observability, we extended observations with a short history of previously taken actions. In addition to implicitly encoding the intent of a teacher and her reactions to potentially richer observations, it also helps to preserve the stylistic elements of human demonstrations.

Concretely, we assume the following. The interaction of the agent and the environment takes place at discrete moments with the value of trivially observable by the agent. The agent, after receiving an observation at time , can take an action from the set of allowed actions using policy . Executing an action in the environment results in a new state . Since we focus on the stylistic elements of the agent behavior, the rewards are are inconsequential for the model we build, and we drop them from the discussion. Next, we consider the episode-based environment, i.e., after reaching a certain condition, the execution of the described state-action loop ends. A complete episode is a sequence . The fundamental assumption regarding the described decision process is that it has the Markov property.

Besides the most recent action taken before time , i.e., action , we also consider a recent history of the past actions, where , , whenever it is defined in an episode . For , we define as the empty sequence. We augment the directly observed state with the action history , to obtain an extended state .

The purpose of including the action history is to capture additional information (i.e., stylistic features and the elements of the objective-driven behavior of the teacher) from human controlling the input during interactive demonstrations. An extended policy , which operates on the extended states , is useful for modeling human actions in a manner similar to

-grams text models in natural language processing (NLP) (e.g.,

[38], [76], [4]). Of course, the analogy with -gram models in NLP works only if both state and action spaces are discrete. We will address this restriction in the next subsection using multi-resolution quantization.

For a discrete state-action space and various , we can compute probabilities of transitions occurring in (human) demonstrations and use them as a Markov model of order of (human) actions. We say that the model is defined on an extended state if the demonstrations contain at least one occurrence of . When a model is defined on , we can use to sample the next action from all ever observed next actions in state . Hence, defines a partial stochastic mapping from extended states to action space .

V-A2 Stacked Markov models

We call a sequence of Markov models a stack of models. A (partial) policy defined by computes the next action at a state , see [14] for the pseudo-code of the corresponding algorithm. Such policy performs a simple behavior cloning. The policy is partial since it may not be defined on all possible extended states and needs a fallback policy to provide a functional agent acting in the environment.

Note that it is possible to implement sampling from a Markov model using an complexity operation with hash tables, making the inference very efficient and suitable for real-time execution in a video game or other interactive application where expected inference time has to be on the scale of 1 ms or less 777A modern video game runs at least at 30 frames per second with lots computations happening during about 33 ms allowed per frame, drastically limiting the “budget” allocated for inference..

V-A3 Quantization

Quantization (aka discretization) allows us to work around the limitation of discrete state-action space enabling the application of the Markov Ensemble approach to environments with continuous dimensions. Quantization is commonly used in solving MDPs [75] and has been extensively studied in the signal processing literature [50], [27]. Using quantization schemes that have been optimized for specific objectives can lead to significant gains in model performance, improving various metrics vs. ad-hoc quantization schemes, e.g., [75], [52].

Instead of trying to pose and solve the problem of optimal quantization, we use a set of quantizers covering a range of schemes from coarse to fine. At the conceptual level, such an approach is similar to multi-resolution methods in image processing, mip-mapping and Level-of-Detail (LoD) representations in computer graphics [37]. The simplest quantization is a uniform one with step :

For illustration purposes, it is sufficient to consider only the uniform quantization

. In practice, most variables have naturally defined limits which are at least approximately known. Knowing the environment scale gives an estimate of the smallest step size

at which we will have complete information loss, i.e., all observed values map to a single bin. For each continuous variable in the state-action space, we consider a sequence of quantizers with decreasing step size , , which naturally gives a quantization sequence for the entire state-action space, provided is fixed across the continuous dimensions. To simplify notation, we collapse the sub index and write to stand for . For more general quantization schemes, the main requirement is the decreasingly smaller reconstruction error for in comparison to .

For an episode , we compute its quantized representation in an obvious component-wise manner:


which defines a multi-resolution representation of the episode as a corresponding ordered set of quantized episodes, where is the vector version of quantization .

In the quantized Markov model , which we construct from the episode , we compute extended states using the corresponding quantized values. Hence, the extended state is . Further, we define the model to contain probabilities for the original action values. In other words, we do not rely on the reconstruction mapping to recover action but store the original actions explicitly. In practice, continuous action values tend to be unique and the model samples from the set of values observed after the occurrences of the corresponding extended state. Our experiments show that replaying the original actions instead of their quantized representation provides better continuity and natural true-to-the-demonstration look of the cloned behavior.

V-A4 Markov Ensemble

Combining together stacking and multi-resolution quantization of Markov models, we obtain Markov Ensemble as an array of Markov models parameterized by the model order and the quantization schema :


The policy defined by the ensemble (2) computes each next action in an obvious manner (). The Markov Ensemble technique, together with the policy defined by it, are our primary tools for cloning behavior from demonstrations.

Note, that with the coarsest quantization present in the multi-resolution schema, the policy should always return an action sampled using one of the quantized models, which at the level always finds a match. Hence, such models always “generalize” by resorting to simple sampling of actions when no better match found in the observations. Excluding too coarse quantizers and Markov order 0 will result in executing some “default policy” , which we discuss in the next section. The agent execution with the outlined ensemble of quantized stacked Markov models is easy to express as an algorithm, which in essence boils down to a look-up tables [14].

V-A5 Interactive Training of Markov Ensemble

If the environment allows a human to override the currently executing policy and record new actions (demonstrations), then we can generate a sequence of demonstrations produced interactively. For each demonstration, we construct a new Markov Ensemble and add it to the sequence (stack) of already existing models. The policy based on these models consults with the latest one first. If the consulted model fails to produce an action, the next model is asked, etc. until there are no other models or one of them returns sampled action. Thanks to the sequential organization, the latest demonstrations take precedence of the earlier ones, allowing correcting previous mistakes or adding new behavior for the previously unobserved situations. We illustrate the logic of such an interaction with the sample git repository [16]. The computational costs for each ensemble, as already noted, is small constant while the overall complexity grows linearly with the number of demonstrations, allowing sufficiently long interaction of a user with the environment and training a more powerful policy. In our case studies, we show that often even a small number of strategically provided demonstrations results in a well-behaving policy.

While the spirit of the outlined idea is similar to that of DAgger [57], providing labels on the newly generated samples is more time consuming than providing new demonstrations. The interactivity could also be used to support newly added features or to update the existing model otherwise. The designer can directly interact with the game, select a particular moment where a new demonstration is required, adjust the initial location of the character object, and run a short demonstration without reloading the game. The interactivity eliminates most of the complexity of the agent design process and brings down the cost of gathering data from under-explored parts of the state space.

We report an example chart for such an interactive training in Fig. 5. The goal in this example is to train an agent capable of an attack behavior. The training on the figure starts with the most basic game-play when the designer provides a demonstration for approaching the target. The next training period happens after observing the trained model for a short period of time. In between the training training periods, the designer makes sure that the agent reaches the intended state and is capable of executing already learned actions. The second training period adds more elements to the behavior, i.e., an agent learns to attack the target with a gun. It is possible to continue training from that point and introduce more sophisticated game-play, e.g., closer approach and melee combat. To provide an additional feedback to the designer, we plot a (near) real-time chart showing how well the current model generalizes to the current conditions in the game environment. The figure covers several minutes of the game-play and the sliding window size is approximately one second, or 30 frames. The designer judges the quality of the model both visually and using the plotted metric of the model “competence”, i.e., its ability to generalize. The competence here is equated to the model performance and is a metric of how many states the model can handle by returning an action. The figure shows that the model competence grows as it accumulates more demonstrations in each of the two training segments. The confidence metric is a natural proxy for evaluating how close stylistically is the model behavior to the demonstrations. Additional details on the interactive training are available from [14] and the repository [16] which allows experimentation with two classic control OpenAI environments.

The prolonged period of training may increase the size of the model with many older demonstrations already irrelevant, not used for inference, but still contributing to the model size. Instead of using rule-based compression of the resulting model ensemble, in the next subsection, we discuss the creation of a DNN model trained from the ensemble of Markov models via a novel bootstrap approach using the game itself as the way to compress the model representation and strip off obsolete demonstration data. Using the proposed approach, we train an agent that satisfies the design needs in only a few hours of interactive training.

V-A6 A sketch of style distance with Markov Ensemble

The ensemble of models defined above allows us to introduce a candidate metric for measuring stylistic difference between behaviors and represented by the corresponding set of episodes. For a fixed quantization scheme, we can compute a sample distribution of the -grams for both behaviors, which we denote as and . Then the “style” distance between and can be estimated using the formula:

where parameter emphases shorter or longer -grams contribution. As defined, larger puts more weight on longer -grams and as such values more complex sequence of actions more. The function is one of the probability distances. We used Jensen-Shannon (JSD) and Hellinger (HD); both vary in the range , hence is also in . The introduced distance can augment the traditional RL rewards to preserve style during training of an agent without human inputs as we discuss in [13]. However, the main motivation of introducing distance is to provide numerical metric to evaluate how demonstrations and the learned policy differ in terms of style without need to visually inspect them in the environment.

V-B Bootstrapped DNN agent

While the ensemble of multi-resolution Markov models described in the previous section has many useful properties, it suffers from several drawbacks. One is the linear growth of the demonstrations dataset resident in RAM, making it less efficient as the number of demonstrations grows. The other problem stems from the limited nature of the human demonstrations. In particular, humans proactively take certain actions, e.g., dynamically avoid obstacles, and there are only few if any “negative” examples where human fails to navigate smoothly and has to recover from, say, a blocked state. Due to the lack of such states in demonstrations, Markov agent wouldn’t be able to deal with the blocked states efficiently and can escape only by chance via random sampling of actions. To address both issues, we introduce a bootstrapped DNN agent described in this subsection.

When generating boostrapped episodes, we use Markov model augmented with the heuristics dealing with the states not encountered in the demonstrations. For instance, for the blocked state, it is possible to implement a simple obstacle avoidance fall-back policy that consistently unblocks an agent using its state as an input. Combining such heuristics with the demonstrations allows to make the boostrapped training dataset much richer.

We treat the existing demonstrations as a training set for a supervised learning problem where we need to predict the next action from a sequence of observed state-action pairs. This approach has proved to be useful in pre-training of self-driving cars

[46] and also is a common starting point for many imitation learning methods. Since our database of demonstrations is relatively small, it is desirable to generate more data by bootstrapping the dataset for which we use our base Markov agent interacting with the game.

OpenAI Bootstrapped
1v1 Bot Agent
Experience 300 years 5 min
(per day) human
Bootstrap using N/A 5-20
game client
CPU 60,000 CPU 1 local CPU
cores on Azure
GPU 256 K80 GPUs N/A
on Azure
Size of 3.3kB 0.5kB
Observations 10 33
per second
of game-play
TABLE I: Comparison between OpenAI 1v1 Dota 2 Bot [24] training metrics and training an agent via bootstrap from human demonstrations in a proprietary open-world game. The comparison is not apples-to-apples because the objectives of the training are very different. However, the environments are in a similar ballpark of complexity. The metrics below highlight the details of the practical training of agents during the game development cycle. The point of this seemingly unfair comparison is to illustrate that the training objectives play a critical role in the process.

Such a bootstrap process is easy to parallelize since we can have multiple simulations running without the need to cross-interact as in some learning algorithms like A3C [44]. The generated augmented data set is used to train a DNN that predicts the next action from the already observed state-action pairs. Due to partial observability, the low dimensionality of the feature space results in fast training in a wide range of model architectures, allowing a quick experimentation loop. We converged on a simple model with a single “wide” hidden layer for motion control channels and a DNN model for discrete channels responsible for turning on/off actions like sprint, firing, climbing. The approach shows promise even with many yet unexplored opportunities to improve its efficiency.

A reasonable architecture for both DNN models can be inferred from the tasks they solve. For the motion controller, the only hidden layer roughly corresponds to the temporal-spacial quantization levels in the base Markov model. When using ReLUs for motion controller hidden layer, we start experimentation with their number equal to the double number of the quantization steps per input variable. Intuitively, training encodes those quantization levels into the layer weights. Adding more depth may help to better capture stylistic elements of the motion, e.g., moving by strafing left-right to avoid being shot. To prevent overfitting, the model complexity in the traditional ML should be minimized depending on the size of the training dataset. In our case, overfitting to the few demonstrations may result in better representation of the style, yet may lead to the degraded in-game performance, e.g., an agent will not achieve game-play objectives as efficiently. In our experiments, we find that consistent (vs. random) demonstrations require only single hidden layer for the motion controller to reproduce basic stylistic features of the agent motion. A useful rule of thumb for discrete actions DNN is to start with the number of layers roughly equal to the maximum order of Markov model used to drive the bootstrap and conservatively increase the model complexity only as needed. For such a DNN, we are using fully connected layers with the number of ReLUs per layer roughly equal to doubled the dimensionality of the input space.

A practical solution emerging from the discussed techniques (multi-resolution Markov ensemble model, interactive training, heuristics and a DNN model) is to combine all of them in a single iterative training framework. In such a workflow, we interleave interactive training using Markov models, developing heuristics to complement demonstration, bootstrapping and training a DNN, which is the output of this workflow. By integrating this workflow with the game development process, we can naturally discard obsolete demonstrations, add new game-play features using version control, and deliver compact real-time models suitable for exploration of levels using differently styled behaviors. We envision the proposed workflow as a major tool in the toolbox of a designer or quality engineer working on a game project.

Table I illustrates the computational resources that solving this problem with the outlined 3-step process required as compared to training 1v1 agents in Dota 2 [24]. While we acknowledge that the goal of our agent is not to play optimally against the opponent and win the game, we observe that using model-based training augmented with expert demonstrations to solve the Markov decision process, in a complex game, results in huge computational savings compared to an optimal reinforcement learning approach.

V-C Assistive game-playing AI

Our last case study involves a team sports game, where the designer’s goal is to train agents that can learn strategic teamplay and could complement arbitrary human player styles. For example, if the human player is more offensive, we would like the teammate agents to be more defensive and vice versa. The game in question involves two teams that start on opposite sides of the field trying to score the most points before time runs out. To score a point, the team needs to make the ball go past the goal line on the side of the field where their opponents starts, scoring a goal. Similar to several team sports games, the players (teams) have to fight for the possession of the ball for them to be able to score, and hence ball control is a big component of this game. We emphasize that this is a more complex challenge compared to the previous case study that concerned with exploration of the game world. As the agent in this game is required to make strategic decisions, we resort to reinforcement learning to solve this problem.

Fig. 6: A screen shot of the simple team sports simulator (STS2). The red agents are home agents attempting to score at the upper end and the white agents are away agents attempting to score the lower end. The highlighted player has the possession of the ball and the arrows demonstrate a pass/shoot attempt.

Our training takes place on simple team sports simulator (STS2).888We intend to release this game-play environment as an open-source package. A screenshot of STS2 game-play is shown in Fig. 6. The simulator embeds the rules of the game and the physics at a high level, abstracting away the low-level tactics. The simulator supports v matches for any positive integer . The two teams are shown as red (home) and white (away). Each of the players can be controlled by a human, a pre-built scripted agent, or any other learned policy. The scripted agent consists of a handful of rules and constraints that govern their game-play strategy, and is most similar to game controlled opponents usually implemented in adversarial games. The STS2 state space consists of the coordinates of the players and their velocities as well as an indicator for the possession of the ball. The action space is discrete and is considered to be left, right, forward, backward, pass, and shoot. Although the player can hit two or more of the actions together, we do not consider that possibility to keep the complexity of the action space manageable.

As the simplest multi-agent mode, we consider the game in the 2v2 mode. Our goal in this case study is to train a teammate agent that can adapt to a human player’s style. We show two scenarios. In one scenario, the human player is a novice player, and in the other the human player is a good offensive player. In the first case, we train a cooperative game-playing AI that compensates for the low-skill player. In the second case, we train a cooperative defensive agent that complements a high-skill offensive agent. A more comprehensive study on the material in this section appears in [80].

V-C1 Game-playing AI to assist a low-skill player

We consider training an agent in a 2v2 game that can assist a low-skill player. We let scripted agents take control of the opponent players. We also choose a low-skill scripted AI to control the teammate player. The goal is to train a cooperative agent that complements the low-skill agent. In this experiment, we provide a ‘+/-1’ reward for scoring. We also provide a ‘+/-0.8‘ individual reward for the agent for gaining/losing the possession of the ball. This reward promotes the agent to gain the ball back from the opponent and score. We ran the experiment using DQN, PPO, and Rainbow (with its default hyperparameters). PPO requires an order of magnitude less trajectories for convergence, and the final policy is similar to that of DQN. However, Rainbow did not converge at all with the default hyperparameters and we suspect that the prioritized experience replay [60] is sensitive to hyperparameters.

The team statistics for this agent are shown in Table II. As can be seen, the agent has learned an offensive game-play style where it scores most of the time. It also keeps more possession than the rest of the agents in the game.

Statistics DQN-1 Scripted Agent Opponent 1 Opponent 2
Score rate 54% 20% 13% 13%
Possession 30% 18% 26% 26%
TABLE II: Offensive DQN agent in a 2v2 match against two scripted agents while also partnered a scripted agent, with a sparse ‘+/-1’ reward for scoring and a ‘+/0.8’ individual reward for gaining/losing the possession of the ball.

V-C2 Game-playing AI to assist a high-skill offensive player

Next, we report training an agent that complements a high-skill offensive player. In particular, we train an agent that complements the DQN-1 that was trained in the previous experiment. We train another agent as the teammate using exactly the same rewarding mechanism as the one used in training the offensive DQN-1 agent. The statistics of the game-play for the two agents playing together against the scripted agent are shown in Table III. While the second agent is trained with the same reward function as the first one, it is trained in a different environment as the teammate is now the offensive DQN-1 agent trained in the previous experiment rather than the scripted agent. As can be seen, the second agent now becomes defensive and is more interested in protecting the net, gaining the possession of the ball back, and passing it to the offensive teammate. We can also see that the game stats for DQN-2 are similar to that of the scripted agent in the previous experiment.

Statistics DQN-1 DQN-2 Opponent 1 Opponent 2
Score rate 50% 26% 12% 12%
Possession 28% 22% 25% 25%
TABLE III: Two DQN agents in a 2v2 match against two scripted agents, with a sparse ‘+/-1’ reward for scoring and a ‘+/0.8’ individual reward for gaining/losing the possession of the ball.

We repeated these experiments using PPO and Rainbow as well. We observe that the PPO agent’s policy converges quickly to a simple one. When it is in possession of the ball, it wanders around in its own half without attempting to cross the half-line or to shoot until the game times out. This happens because the scripted agent is programmed not to chase the opponent in their half when the opponent is in possession of the ball, and hence, the game goes on as described until timeout with no scoring on either side. PPO has clearly reached a local minimum in the space of policies, which is not unexpected as it is optimizing the policy directly. Finally, the Rainbow agent does not learn a useful policy in this case.

Vi Concluding Remarks

In this paper we have described our efforts to create intelligent agents that can assist game designers in building games. To this end, we describe our training pipeline, designed to train agents in mobile and HD games. We present four case studies, two on creating playtesting agents and two on creating game-playing agents. Each use case discussed introduces different challenges, both technically, as well as in terms of design requirements. These intelligent agents have to strike a balance between skill and style, e.g. while one agent was created to evaluate if the game tuning was achieving its intentions, another was purposed to be a challenging opponent while approximating the experience of playing with another human player.

In the first case study, we consider The Sims Mobile in its early development stage. We show that the game dynamics could be fully extracted in a lightweight model of the game. Henceforth, this removes the need for learning, and in particular reinforcement learning, and the game-play experience could be modeled using much simpler planning methods. The game designer used this model to tune the game parameters for balancing the game-play experience across different career modes of the game. Each career mode is designed to align with interests of a particular group of the players and the goal is for all interest groups to have a similar game progression. The playtesting agent modeled with the A* algorithm, despite being simple, is effective because the skill requirement is straightforward: progressing with the minimum amount of actions possible.

In the second case study, we consider another mobile game. This game has very complex state and action spaces rendering a direct application of reinforcement learning to the game in its entirety impractical. In this game, the player has to learn to utilize their resources strategically for fast progression. In particular, a greedy progression in the early stages of the game may hinder the player from fast progression in the future stages. The game designer is interested in measuring the progression of an average competent player. We show that this problem conforms to the usual reinforcement learning formulation with a ‘+1’ reward for reaching the goal state, and ‘-1’ reward for attempting an invalid action. We break down the complexity of the state and action spaces by removing much of the unnecessary information for achieving the desired sub-task. In this case study, we showed how model-free RL could be used to inform the designer of some of the design choices for the game. As discussed in the paper, in this game, initially the players should look to foster their resources for long term planing, near the end of the game you should reap the benefits for your initial resource management. Hence, the goal of the players will evolve and change as the game progresses. We are currently investigating breaking the gameplay in this game in a hierarchical manner, where we can define clear rewarding mechanisms that encourage a desirable behavior for the sub-problem at hand.

In the third case study, we consider an open-world HD game. Our goal is to mimic designer-defined organic gameplaying styles, e.g., aggressive, exploratory, etc. We use a composite Markov model to address the style similarity as our primary goal in a data-efficient manner. The model allows interactive training by a human with an incremental introduction of new gameplay elements as well as alteration of already learned behavior within the same game session or across several of them. While the model functions well in simple settings, it has two significant shortcomings - linear growth of the model size with training and poorly defined behavior for unknown states. We address the last one by augmenting the model with simple heuristics capturing implicit human behavior (e.g., avoid running into a wall for an extended time) to handle states not present in human demonstrations. The augmented model interacts with the game in an automated manner and generates bootstrapped data covering both the stylistics and the heuristics. We use the bootstrapped data to train a DNN as a supervised model, which “compresses” the trained behavior into a more compact representation with a fast inference. The proposed workflow naturally fuses interactive human demonstrations capturing the stylistic elements, heuristics for addressing edge cases, and bootstrapped data. The complete end-to-end training can take only a couple of hours, allowing fast iterations and deployment of agents in the production environment. While the agents trained using this approach under-perform in terms of the game target metrics (e.g., kill-death ratio), they serve as a satisfactory proxy of organic gameplaying suitable for game evaluation and balancing.

In the last case study, we consider a team sports game. The goal is to train game-playing agents that approximate the experience that is had when playing human opponents. For such we would have to emulate human players both in terms of tactics and strategies. In order to simplify the problem, we focus on strategy using a high-level simulator, called simple team sports simulator (STS2). We validate the argument that modeling human game-play is challenging, by having sparse rewards for scoring, e.g., ‘+1’ for scoring and ‘-1’ for being scored against. In order to move closer to the desired skill and style, we provide more detailed rewards and observe that even then the resulting policies depend much on the teammates and opponents against which the agent is trained. We are currently investigating training meta-policies that could adapt to a variety of teammates and opponents without much tuning.

To summarize, the approaches shown in each of these four case studies showcases the different challenges for varied skill and style requirements. Evaluating if the intended style and skill were achieved is relative to the problem and approach of each case. In a straightforward scenario, such as with the A* algorithm to playtest The Sims Mobile game, observing the resulting data gathered sufficed. For other cases, such as attempting to incorporate human-like game-play in an agent, the measurement is non-trivial, and so the constant participation of the designer is critical. While it is possible to identify solutions that evidently fail in modeling human behavior, through an iterative feedback process with the designers, we are capable of steering towards an approach that fits the game’s design requirements.


The authors are thankful to EA Digital Platform – Data & AI, EA Sports, and other game team partners for their support. The authors also would like to thank the anonymous reviewers and the EIC for their constructive feedback that helped improve the quality of this paper.


  • [1] P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp. 1. Cited by: §V-A.
  • [2] AI & Compute (2018) Note: [Online, May 2018] Cited by: footnote 2.
  • [3] AlphaStar (2019) Note: [Online, January 2019] Cited by: §I, §II-B.
  • [4] M. Andresen and H. Zinsmeister (2017) Approximating Style by -gram-based Annotation. In Proceedings of the Workshop on Stylistic Variation, pp. 105–115. Cited by: §V-A1.
  • [5] B. D. Argall, S. Chernova, M. Veloso, and B. Browning (2009) A survey of robot learning from demonstration. Robotics and autonomous systems 57 (5), pp. 469–483. Cited by: §V-A.
  • [6] H. Baier, A. Sattaur, E. Powley, S. Devlin, J. Rollason, and P. Cowling (2018) Emulating human play in a leading mobile card game. IEEE Transactions on Games. Cited by: §II-B.
  • [7] M. G. Bellemare, W. Dabney, and R. Munos (2017) A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 449–458. Cited by: §IV-B.
  • [8] M. G. Bellemare, P. S. Castro, C. Gelada, and S. Kumar Note: [Online, 2018] Cited by: §III-A.
  • [9] D. P. Bertsekas (2005) Dynamic programming and optimal control. Vol. 1, Athena scientific Belmont, MA. Cited by: footnote 5.
  • [10] A. Billard, S. Calinon, R. Dillmann, and S. Schaal (2008) Robot programming by demonstration. In Springer handbook of robotics, pp. 1371–1394. Cited by: §V-A.
  • [11] I. Borovikov and A. Beirami (2018-12) Imitation learning via bootstrapped demonstrations in an open-world video game. In NeurIPS 2018 Workshop on Reinforcement Learning under Partial Observability, External Links: Link Cited by: Winning Isn’t Everything: Enhancing Game Development with Intelligent Agents, §V-A.
  • [12] I. Borovikov and A. Beirami (2019-03) From demonstrations and knowledge engineering to a DNN agent in a modern open-world video game. In AAAI 2019 Spring Symposium on Combining Machine Learning with Knowledge Engineering, External Links: Link Cited by: Winning Isn’t Everything: Enhancing Game Development with Intelligent Agents, §V-A.
  • [13] I. Borovikov, J. Harder, M. Sadovsky, and A. Beirami (2019-05) Towards a representative metric of behavior style in imitation and reinforcement learning. In The 23rd Annual Signal and Image Sciences Workshop at Lawrence Livermore National Laboratory, Center for Advanced Signal Image Sciences (CASIS), External Links: Link Cited by: Winning Isn’t Everything: Enhancing Game Development with Intelligent Agents, §V-A6, §V-A.
  • [14] I. Borovikov, J. Harder, M. Sadovsky, and A. Beirami (2019) Towards interactive training of non-player characters in video games. arXiv preprint arXiv:1906.00535. Cited by: Winning Isn’t Everything: Enhancing Game Development with Intelligent Agents, §V-A2, §V-A4, §V-A5, §V-A.
  • [15] I. Borovikov, Y. Zhao, A. Beirami, J. Harder, J. Kolen, J. Pestrak, J. Pinto, R. Pourabolghasem, et al. (2019-01) Winning isn’t everything: training agents to playtest modern games. In AAAI Workshop on Reinforcement Learning in Games, External Links: Link Cited by: Winning Isn’t Everything: Enhancing Game Development with Intelligent Agents.
  • [16] I. Borovikov (2019) Interactive Training (code base). GitHub. Note: or Cited by: Winning Isn’t Everything: Enhancing Game Development with Intelligent Agents, §V-A5, §V-A5.
  • [17] P. Cairns, A. Cox, and A. I. Nordin (2014) Immersion in digital games: review of gaming experience research. Handbook of digital games 1, pp. 767. Cited by: §I.
  • [18] G. Chaslot, S. Bakkes, I. Szita, and P. Spronck (2008) Monte-carlo tree search: a new framework for game ai.. In AIIDE, Cited by: §II-B.
  • [19] R. Coulom (2006) Efficient selectivity and backup operators in Monte-Carlo tree search. In International conference on computers and games, pp. 72–83. Cited by: §II-B.
  • [20] G. Cuccu, J. Togelius, and P. Cudré-Mauroux (2019) Playing atari with six neurons. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 998–1006. Cited by: §II-B.
  • [21] F. de Mesentier Silva, R. Canaan, S. Lee, M. C. Fontaine, J. Togelius, and A. K. Hoover (2019) Evolving the hearthstone meta. In IEEE Conference on Games, Cited by: §II-A.
  • [22] F. De Mesentier Silva, S. Lee, J. Togelius, and A. Nealen (2017) AI-based playtesting of contemporary board games. In Foundations of Digital Games 2017, Cited by: §II-A.
  • [23] Deep Blue (1997) Note: [Online] Cited by: §I, §II-B.
  • [24] OpenAI Five (2018) Note: [Online, June 2018] Cited by: §II-B, §V-B, TABLE I.
  • [25] Y. Duan, M. Andrychowicz, B. Stadie, O. J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba (2017) One-shot imitation learning. In Advances in neural information processing systems, pp. 1087–1098. Cited by: §V-A.
  • [26] M. A. Federoff (2002) Heuristics and usability guidelines for the creation and evaluation of fun in video games. Ph.D. Thesis, Citeseer. Cited by: §I.
  • [27] A. Gersho and R. M. Gray (1991) Vector quantization and signal compression. Technology and Engineering, Springer Science and Business Media. Cited by: §V-A3.
  • [28] C. Guerrero-Romero, S. M. Lucas, and D. Perez-Liebana (2018) Using a team of general ai algorithms to assist game design and testing. In 2018 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. Cited by: §II-A.
  • [29] M. Hausknecht and P. Stone (2015) Deep reinforcement learning in parameterized action space. arXiv preprint arXiv:1511.04143. Cited by: §IV-B.
  • [30] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2017) Rainbow: combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298. Cited by: §III-A.
  • [31] C. Heyden (2009) Implementing a computer player for carcassonne. Ph.D. Thesis, Maastricht University. Cited by: §II-B.
  • [32] J. Ho and S. Ermon (2016) Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565–4573. Cited by: §IV-B, §V-A, §V-A.
  • [33] J. Hoey and P. Poupart (2005) Solving POMDPs with continuous or large discrete observation spaces. In IJCAI, pp. 1332–1338. Cited by: §IV-B.
  • [34] C. Holmgard, M. C. Green, A. Liapis, and J. Togelius (2018) Automated playtesting with procedural personas with evolved heuristics. IEEE Transactions on Games. Cited by: §II-A.
  • [35] V. Hom and J. Marks (2007) Automatic design of balanced board games. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE), pp. 25–30. Cited by: §II-A.
  • [36] C. Huchler (2015) An mcts agent for ticket to ride. Master’s Thesis, Maastricht University. Cited by: §II-B.
  • [37] J. F. Hughes, A. V. Dam, M. McGuire, D. F. Sklar, J. D. Foley, S. K. Feiner, and K. Akeley (2013) Computer graphics. 3rd edition, Addison-Wesley Professional. External Links: ISBN 0321399528 Cited by: §V-A3.
  • [38] M. P. Kamiński (2016) In search of lexical discriminators of definition style: comparing dictionaries through -Grams. International Journal of Lexicography 29 (4), pp. 403–423. External Links: Document, Link, /oup/backfile/content_public/journal/ijl/29/4/10.1093_ijl_ecv038/3/ecv038.pdf Cited by: §V-A1.
  • [39] L. Kocsis and C. Szepesvári (2006) Bandit based Monte-Carlo planning. In European conference on machine learning, pp. 282–293. Cited by: §II-B.
  • [40] J. Krucher (2015) Algorithmically balancing a collectible card game. Bachelor’s Thesis, ETH Zurich. Cited by: §II-A.
  • [41] A. Liapis, G. N. Yannakakis, and J. Togelius (2013) Sentient sketchbook: computer-aided game level authoring.. In FDG, pp. 213–220. Cited by: §II-A.
  • [42] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. Cited by: §IV-B.
  • [43] T. Mahlmann, J. Togelius, and G. N. Yannakakis (2012) Evolving card sets towards balancing dominion. In Evolutionary Computation (CEC), 2012 IEEE Congress on, pp. 1–8. Cited by: §II-A.
  • [44] V. Mnih et al. (2016) Asynchronous Methods for Deep Reinforcement Learning. arXiv:1602.01783v2. Cited by: §V-B.
  • [45] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §II-B, §III-A, §III-B.
  • [46] M. Montemerlo, S. Thrun, H. Dahlkamp, and D. Stavens (2006) Winning the darpa grand challenge with an ai robot. In In Proceedings of the AAAI National Conference on Artificial Intelligence, pp. 17–20. Cited by: §V-B.
  • [47] L. Mugrai, F. de Mesentier Silva, C. Holmgård, and J. Togelius (2019) Automated playtesting of matching tile games. In IEEE Conference on Games, Cited by: §II-A.
  • [48] A. Y. Ng, S. J. Russell, et al. (2000) Algorithms for inverse reinforcement learning.. In Icml, pp. 663–670. Cited by: §V-A.
  • [49] OpenAI Gym (2016) Note: [Online] Cited by: §III-A.
  • [50] A. V. Oppenheim and R. W. Schafer (1975) Digital signal processing. 1st edition, Pearson. External Links: ISBN 0132146355 Cited by: §V-A3.
  • [51] J. Ortega, N. Shaker, J. Togelius, and G. N. Yannakakis (2013) Imitating human playing styles in super mario bros. Entertainment Computing 4 (2), pp. 93–104. Cited by: §I.
  • [52] G. Pagès, H. Pham, and J. Printems (2004) Optimal quantization methods and applications to numerical problems in finance. In Handbook of computational and numerical methods in finance, pp. 253–297. Cited by: §V-A3.
  • [53] D. A. Pomerleau (1989) Alvinn: an autonomous land vehicle in a neural network. In Advances in neural information processing systems, pp. 305–313. Cited by: §V-A.
  • [54] J. M. Porta, N. Vlassis, M. T. Spaan, and P. Poupart (2006) Point-based value iteration for continuous POMDPs. Journal of Machine Learning Research 7 (Nov), pp. 2329–2367. Cited by: §IV-B.
  • [55] PySC2 (2017) Note: [Online] Cited by: §III-A.
  • [56] D. Robilliard, C. Fonlupt, and F. Teytaud (2014) Monte-carlo tree search for the game of “7 wonders”. In Computer Games, pp. 64–77. Cited by: §II-B.
  • [57] S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §V-A5.
  • [58] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864. Cited by: §IV-A.
  • [59] A. Samuel July 1959.“. Some Studies in Machine Learning Using the Game of Checkers.” IBM Journal of Research and Development 3 (3), pp. 210–29. Cited by: §II-B.
  • [60] T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §IV-B, §V-C1.
  • [61] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §II-B.
  • [62] N. Shaker, M. Shaker, and J. Togelius (2013) Ropossum: an authoring tool for designing, optimizing and solving cut the rope levels.. In AIIDE, Cited by: §II-A.
  • [63] F. D. M. Silva, I. Borovikov, J. Kolen, N. Aghdaie, and K. Zaman (2018) Exploring gameplay with AI agents. In AIIDE, Cited by: Winning Isn’t Everything: Enhancing Game Development with Intelligent Agents, Fig. 3, §IV-A, §IV-A.
  • [64] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. Cited by: §I, §II-B, §V-A.
  • [65] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. (2017) Mastering Chess and Shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815. Cited by: §II-B.
  • [66] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of Go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §II-B.
  • [67] G. Smith, J. Whitehead, and M. Mateas (2010) Tanagra: a mixed-initiative level design tool. In Proceedings of the Fifth International Conference on the Foundations of Digital Games, pp. 209–216. Cited by: §II-A.
  • [68] M. T. Spaan and N. Vlassis (2005) Perseus: Randomized point-based value iteration for POMDPs. Journal of artificial intelligence research 24, pp. 195–220. Cited by: §IV-B.
  • [69] A. Summerville, S. Snodgrass, M. Guzdial, C. Holmgård, A. K. Hoover, A. Isaksen, A. Nealen, and J. Togelius (2018) Procedural content generation via machine learning (PCGML). IEEE Transactions on Games 10 (3), pp. 257–270. Cited by: §II-A.
  • [70] I. Szita, G. Chaslot, and P. Spronck (2009) Monte-carlo tree search in settlers of catan. In Advances in Computer Games, pp. 21–32. Cited by: §II-B.
  • [71] G. Tesauro (1995) Temporal difference learning and td-gammon. Communications of the ACM 38 (3), pp. 58–69. Cited by: §II-B.
  • [72] C. Thurau, T. Paczian, G. Sagerer, and C. Bauckhage (2007) Bayesian imitation learning in game characters. International journal of intelligent systems technologies and applications 2 (2), pp. 284. Cited by: §V-A.
  • [73] J. Togelius, G. N. Yannakakis, K. O. Stanley, and C. Browne (2011) Search-based procedural content generation: a taxonomy and survey. IEEE Transactions on Computational Intelligence and AI in Games 3 (3), pp. 172–186. Cited by: §II-A.
  • [74] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, et al. (2017) StarCraft II: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782. Cited by: §III-A.
  • [75] M. Wiering and van Martijn Otterlo (2012) Reinforcement learning. 1st edition, Vol. 12, Springer-Verlag Berlin Heidelberg, Cambridge, MA, USA. External Links: ISBN 9783642276453 Cited by: §V-A3.
  • [76] D. Wright (2017) Using word -grams to identify authors and idiolects. International Journal of Corpus Linguistics 22 (2), pp. 212–241. External Links: Link, Document Cited by: §V-A1.
  • [77] G. N. Yannakakis, A. Liapis, and C. Alexopoulos (2014) Mixed-initiative co-creativity. In Proceedings of the 9th Conference on the Foundations of Digital Games, Cited by: §II-A.
  • [78] G. N. Yannakakis and J. Togelius (2018) Artificial intelligence and games. Vol. 2, Springer. Cited by: §II-B.
  • [79] Y. Zhao, A. Beirami, M. Sardari, N. Aghdaie, and K. Zaman (2018-12) Training agents to play modern games: challenges and opportunities. In NeurIPS 2018 Workshop on Reinforcement Learning under Partial Observability, External Links: Link Cited by: Winning Isn’t Everything: Enhancing Game Development with Intelligent Agents.
  • [80] Y. Zhao, I. Borovikov, J. Rupert, C. Somers, and A. Beirami (2019-06) On multi-agent learning in team sports games. In ICML 2019 Workshop on Imitation, Intent, and Interaction (I3), External Links: Link Cited by: Winning Isn’t Everything: Enhancing Game Development with Intelligent Agents, §V-C.