Winning Isn't Everything: Training Human-Like Agents for Playtesting and Game AI

by   Yunqi Zhao, et al.

Recently, there have been several high-profile achievements of agents learning to play games against humans and beat them. We consider an alternative approach that instead addresses game design for a better player experience by training human-like game agents. Specifically, we study the problem of training game agents in service of the development processes of the game developers that design, build, and operate modern games. We highlight some of the ways in which we think intelligent agents can assist game developers to understand their games, and even to build them. Our early results using the proposed agent framework mark a few steps toward addressing the unique challenges that game developers face.



There are no comments yet.


page 4


Winning Isn't Everything: Enhancing Game Development with Intelligent Agents

Recently, there have been several high-profile achievements of agents le...

Entombed: An archaeological examination of an Atari 2600 game

The act and experience of programming is, at its heart, a fundamentally ...

Exploring Gameplay With AI Agents

The process of playtesting a game is subjective, expensive and incomplet...

A Bayesian Model for Plan Recognition in RTS Games applied to StarCraft

The task of keyhole (unobtrusive) plan recognition is central to adaptiv...

Corrupted Multidimensional Binary Search: Learning in the Presence of Irrational Agents

Standard game-theoretic formulations for settings like contextual pricin...

On the Development of Intelligent Agents for MOBA Games

Multiplayer Online Battle Arena (MOBA) is one of the most played game ge...

Developing a Scenario-Based Video Game Generation Framework: Preliminary Results

Emergency training and planning provide structured curricula, rule-based...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The history of artificial intelligence (AI) can be mapped by its achievements playing and winning various games. From the early days of Chess-playing machines to the most recent accomplishments of Deep Blue, AlphaGo, and AlphaStar, AI has advanced from competent, to competitive, to champion in even the most complex games. Games have been instrumental in advancing AI, and most notably in recent times through tree search and reinforcement learning (RL). Samuel applied some form of tree search combined with basic reinforcement learning to the game of checkers [1]. The success of Samuel motivated researchers to target other games.

IBM Deep Blue followed the tree search path and was the first artificial game agent who beat the chess world champion, Gary Kasparov [2]. A decade later, Monte Carlo Tree Search (MCTS) [3, 4] was a big leap in AI to train game agents. MCTS agents for playing Settlers of Catan were reported in [5, 6]

and shown to beat previous heuristics. Other work compares multiple approaches of agents to one another in the game Carcassonne on the two-player variant of the game and discusses variations of MCTS and Minimax search for playing the game 

[7]. MCTS has also been applied to the game of 7 Wonders [8] and Ticket to Ride [9].

Tesauro [10], on the other hand, used TD-Lambda which is a temporal difference RL algorithm to train Backgammon agents at a superhuman level. The impressive recent progress on RL to solve video games is partly due to the advancements in processing power and AI computing technology.111The amount of AI compute has been doubling every 3-4 months in the past few years [11]. More recently, deep Q networks (DQNs) have emerged as a general representation learning framework from the pixels in a frame buffer combined with Q-Learning with function approximation without need for task-specific feature engineering [12].222While the original DQNs worked with pixels as state space, the same idea could be applied to other cases by changing the network structure appropriately.

DeepMind researchers remarried the two approaches by showing that DQNs combined with MCTS would lead to AI agents that play Go at a superhuman level [13], and solely via self-play [14, 15]. Subsequently, OpenAI researchers showed that a policy optimization approach with function approximation, called Proximal Policy Optimization (PPO) [16], would lead to training agents at a superhuman level in Dota 2 [17]. The most recent progress was reported by DeepMind on StarCraft II, where AlphaStar was unveiled to play the game at a superhuman level by combining a variety of techniques including attention networks [18].

Despite the tremendous success stories of deep RL, at Electronic Arts, we combine a variety of planning methods and machine learning techniques (including state-of-the-art deep RL) to train human-like agents with the goal of making the gameplay experience more enjoyable for human players. The design of a deep network for function approximation and setting the right hyperparameters for it to work well is indeed a daunting task. In addition, it takes hundreds of thousands of state-action pairs equivalent of many years of experience for the agent to reach human-level performance.

333AlphaStar is trained using the equivalent of 60,000 years of human experience. Applying these same techniques to modern games for playtesting requires obtaining and processing hundreds of years of experience, which is only feasible using significant cloud infrastructure costing millions of dollars [18, 19]. We create hierarchical solutions by breaking the complex problem into a hierarchy of simpler learning problems [19, 20].

We move away from the recent trends at training superhuman agents by training believable agents that can play like human players do. We believe that winning isn’t everything, and instead explore training agents that are engaging and human-like. In order to apply RL to modern video games or any other part of the problem, we would have to shape rewards that promote a certain style or human-like behavior. Reward shaping in this setup is an extremely challenging problem, as has been pointed out by several researchers know. Additionally, we also need to capture human-like cooperation/conflict in multi-agent strategic gameplay. These make reward shaping extremely challenging with mathematically vague objectives.

We also move away from using the raw state space through the screen pixels. On the contrary, we provide the agent with any additional form of information that could ease training and might otherwise be hard to infer from the screen pixels. Our ultimate goal is to train agents that play human-like. Thus, so long as the agents would pass the Turing test we are not alarmed by the unfair extra information at their disposal. Furthermore, in the game development stage, the game itself is dynamic in the design and multiple parameters and attributes (particularly related to graphics) may change between different builds, hence it is desirable to train agents on more stable features.

We mainly pursue two use-cases for our agent training pipeline to help with the game design.

  1. The first use-case is to provide design feedback during game design. The game designers usually rely on playtesting sessions and feedback they receive from playtesters to make design choices in the game development process. However, as the game worlds are becoming larger and the number of concurrent players are increasing in games, the traditional approach is becoming infeasible calling for alternative approaches.

  2. The second use-case is to train player facing game AI agents and non-player characters (NPCs) that constitute a part of the game itself and shape the gameplay experience of real players. The traditional AI solutions are already providing excellent experiences for the players. However, it is becoming increasingly more difficult to scale those traditional solutions up as the game worlds are becoming larger and the content is becoming dynamic.

The rest of the paper is organized as follows. In Section II, we review the related work on training agents for playtesting and NPCs. In Section III, we describe our training pipeline. In Section IV, we provide two case studies on the intelligent playtesting. In Section V, we provide two case studies on training intelligent NPCs. Finally, the concluding remarks are provided in Section VI.

Ii Related Work

Ii-a Playtesting

To validate their design, game designers conduct playtesting sessions. Playtesting consists of having a group of players interact with the game in the development cycle to not only gauge the engagement of players, but also to discover elements and states that result in undesirable outcomes. As a game goes through the various stages of development, it is essential to continuously iterate and improve the relevant aspects of the gameplay and its balance. Relying exclusively on playtesting conducted by humans can be costly and inefficient. Artificial agents could perform much faster play sessions, allowing the exploration of much more of the game space in much shorter time. This becomes even more valuable as game worlds grow large enough to hold tens of thousands of simultaneously interacting players. Games at this scale render traditional human playtesting infeasible.

Recent advances in the field of RL, when applied to playing computer games assume that the goal of a trained agent is to achieve the best possible performance with respect to clearly defined rewards while the game itself remains fixed for the foreseen future. In contrast, during game development the objectives and the settings are quite different and vary over time. The agents can play a variety of roles with the rewards that are not obvious to define formally, e.g., an objective of an agent exploring a game level is different from foraging, defeating all adversaries, or solving a puzzle. Also, the game environment changes frequently between the game builds. In such settings, it is desirable to quickly train agents that help with automated testing, data generation for the game balance evaluation and wider coverage of the gameplay features. It is also desirable that the agent be mostly re-usable as the game build is updated with new appearance and gameplay features. Following the direct path of throwing computational resources combined with substantial engineering efforts at training agents in such conditions is far from practical and calls for a different approach.

The idea of using artificial agents for playtesting is not new. Algorithmic approaches have been proposed to address the issue of game balance, in board games [21, 22] and card games [23, 24]. More recently, Holmgard et al. [25] build a variant of MCTS to create a player model for AI Agent based playtesting. These techniques are relevant to creating rewarding mechanisms for mimicking player behavior. AI and machine learning can also play the role of a co-designer, making suggestions during development process [26]. Tools for creating game maps [27] and level design [28, 29] are also proposed. See [30, 31] for a survey of these techniques in game design.

In this paper, we describe our framework that supports game designers with automated playtesting. This also entail a training pipeline that universally applies this framework to a variety of games. We then provide two case studies that entail different solution techniques.

Ii-B Game AI

Fig. 1: The AI agent training pipeline.

Game AI has been a main constituent of games since the dawn of video gaming. Game AI agents have become more sophisticated providing excellent experiences to millions of players as the games have grown in complexity over the years. Scaling traditional AI solutions in ever growing worlds with thousands of agents and dynamic content is a challenging problem calling for alternative approaches. In this paper, we describe the solution techniques that we are exploring to train agents that play games like human players. As already discussed in the introduction, this is a more challenging task than training agents with superhuman gameplay capabilities.

Iii Training Pipeline

Iii-a Gameplay and Agent Environments

The AI agent training pipeline, which is depicted in Fig. 1, consists of two key components:

  • Gameplay environment refers to the simulated game world that executes the game logic with actions submitted by the agent every timestep and produces the next state.

  • Agent environment refers to the medium where the agent interacts with the game world. The agent observes the game state and produces an action. This is where training occurs.

In practice, the game architecture can be complex and it might be too costly for the game to directly communicate the complete state space information to the agent at every timestep. To train artificial agents, we create a universal interface between the gameplay environment and the learning environment.444These environments are usually physically separated, and hence, we prefer a thin (i.e., headless) client that supports fast cloud execution, and is not tied to frame rendering. The interface extends OpenAI Gym [32] and supports actions that take arguments, which is necessary to encode action functions and is consistent with PySC2 [19, 33]. In addition, our training pipeline enables creating new players on the game server, logging in/out an existing player, and gathering data from expert demonstrations. We also adapt Dopamine [34] to this pipeline to make DQN [12] and Rainbow [35] agents available for training in the game. Additionally, we add support for more complex preprocessing other than the usual frame buffer stacking, which we explicitly exclude following the motivation presented in the next section.

Iii-B State Abstraction

The use of frame buffer as an observation of the game state has proved advantageous in eliminating the need for manual feature-engineering in Atari games [12]. However, to achieve the objectives of RL in a fast-paced game development process, the drawbacks of using frame buffer outweigh its advantages. The main considerations which we take into account when deciding in favor of a lower-dimensional engineered representation of game state are:

  1. [(a)]

  2. During almost all stages of the game development, the game parameters are evolving on a daily basis. In particular, the art may change at any moment and the look of already learned environments can change overnight. Hence, it is desirable to train agents using features that are more stable to minimize the need for retraining agents.

  3. Another important advantage of state abstraction is that it allows us to train much smaller models (networks) because of the smaller input size and use of carefully engineered features. This is critical for deployment for real time applications in console and consumer PC environments where rendering, animation and physics are occupying much of the GPU and CPU power.

  4. In playtesting, the gameplay environment and the learning environment may reside in physically separate nodes. Naturally, closing the RL state-action-reward loop in such environments requires a lot of network communication. Presence of frame buffers as the representative of game state would significantly increase this communication cost whereas derived game state features enable more compact encodings.

  5. Obtaining an artificial agent in a reasonable time (a few hours at most) usually requires that the game be clocked at a rate much higher than the usual gameplay speed. As rendering each frame takes a significant portion of every frame’s time, overclocking with rendering enabled is not practical. Additionally, moving large amount of data from GPU to main memory drastically slows down the game execution and can potentially introduce simulation artifacts, by interfering with the target timestep rate.

  6. Last but not least, we can leverage the advantage of having privileged access to the game code to let the game engine distill a compact state representation that could be inferred by a human player from the game and pass it to the agent environment. By doing so we also have a better hope of learning in environments where the pixel frames only contain partial information about the the state space.

The compact state representation could include the inventory, resources, buildings, the state of neighboring players, and the distance to target. In an open-world first-person shooter game the features may include the distance to the adversary, angle at which the agent approaches the adversary, presence of line of sight to the adversary, direction to the nearest waypoint generated by the game navigation system. The feature selection may require some engineering efforts but it is logically straightforward after the initial familiarization with the gameplay mechanics, and often similar to that of traditional CPU Game AI design, which will be informed by the game designer. We remind the reader that our goal is not to train agents that win but to simulate human-like behavior, so we train on information that would be accessible to a human player.

Iv Playtesting

Iv-a Optimizing Player Experience

In this section, we consider the early development of The Sims Mobile, whose gameplay is about “emulating life”: players create avatars, called Sims, and conduct them through a variety of everyday activities. In this game, there is no single predetermined goal to achieve. Instead, players craft their own experiences, and the designer’s objective is to evaluate different aspects of that experience. Each player can pursue different careers, and as a result will have a different experience. The designer’s goal is to measure the impact of high-level decisions on the progression path of the player. We refer the interested reader to [36] for a more complete study of this problem.

The game is fully observable and the game dynamics are fully known. This simplified case allows for the extraction of a lightweight model of the game. While this requires some additional development effort, we can achieve a dramatic speedup in training agents by avoiding (reinforcement) learning and resorting to planning techniques instead.

In particular, we use the A* algorithm as it enables computation of an optimal strategy by exploring the state transition graph instead of the more expensive iterative processes, such as dynamic programming,555Unfortunately, in the dynamic programming every node will participate in the computation while it is often true that most of the nodes are not relevant to the shortest path problem in the sense that they are unlikely candidates for inclusion in a shortest path [37]. and even more expensive Monte Carlo search based algorithms. The customizable heuristics and the target states corresponding to different gameplay objectives, offered by A*, provide sufficient control to conduct various experiments and explore multiple aspects of the game.

Fig. 2: Comparison of the average amount of career actions (appointments) to reach the goal using A* search and evolution strategy adapted from [36].

We validate our approach against (approximately) solving a full optimization over the entire game strategy space using evolution strategies, where the agent optimizes for a utility function that selects between available actions.666Coincidentally, OpenAI has recently advocated for evolution strategies as an alternative for reinforcement learning in training agents to play games [38]. We compare the number of actions that it takes to reach the goal for each career in Figure 2 as computed by the two approaches. We emphasize that our goal is to show that a simple planning method, such as A*, can sufficiently satisfy the designer’s goal in this case. We can see that the more expensive optimization based evolution strategy reaches a style of gameplay that is similar to the much simpler A* search.

The largest discrepancy arises for the Barista career, which might be explained by the fact that this career has an action that does not reward experience by itself, but rather enables another action that does it. This action can be repeated often and can explain the high numbers despite having half the number of levels. Also, we observe that in the case of the medical career, the 2,000 node A* cutoff has led to a suboptimal solution.

When running the two approaches, another point of comparison can be made: how many sample runs are required to obtain statistically significant results? We ran 2,000 runs for the evolution strategy while it is notable that the A* agent learns a deterministic playstyle, which has no variance. On the other hand, the agent trained using an evolution strategy has a high variance and requires a sufficiently high number of runs of the simulation to approach a final reasonable strategy 


In this experiment, we were able to create a simulation model for the game mechanics, and we found that its benefits outbalance the time needed to run the actual simulations to answer different questions raised by the game designer. However, it is worth reiterating that the availability of a game model as a separate application is not universally expected due to the huge state space, complex game dynamics, and a weakly structured heterogeneous action space of high dimensionality. The next case study discusses solution techniques to solve these problems using the state-of-the-art approaches.

Fig. 3: Average cumulative reward (return) in training and evaluation for the agents as a function of the number of iterations. Each iteration is worth 60 minutes of gameplay. The trained agents are: (1) a DQN agent with complete state space, (2) a Rainbow agent with complete state space, (3) a DQN agent with augmented observation space, and (4) a Rainbow agent with augmented observation space.

Iv-B Measuring Expert Player Progression

When the game dynamics are unknown, most of the recent success stories are based on RL (and particularly DQN and PPO). In this section, we show how such model-free control techniques fit into the paradigm of playtesting modern games.

In our second case study, we consider a mobile game designed to engage many players for months exhibiting all of the challenges discussed in the introduction. The designer’s primary concern is to test how quickly an expert player can possibly progress in the game.

The state consists of 50 continuous and 100 discrete state variables. The set of possible actions is a subset of a space , which consists of 25 action classes, some of which are from a continuous range of possible action values, and some are from a discrete set of action choices. The agent has the ability to generate actions but not all of them are valid at every game state since , i.e., depends on the timestep and the game state. Moreover, the subset of valid actions is only partially known to the agent.

We rely on the game server to validate whether a submitted action is available because it is impractical to encode and pass the set of available actions to the agent at every timestep. While the problem of a huge state space [39, 40, 41], a continuous action space [42], and a parametric actions space [43] could be dealt with, these techniques are not directly applicable to our problem. This is because, as we shall see more, some actions will be invalid at times and inferring that information may not be fully possible from the observation space. Finally, the game is designed to last tens of millions of timesteps, taking the problem of training a functional agent in such environment outside of the domain of previously explored problems.

We study game progression while taking only valid actions. As we already mentioned, the set of valid actions

is not fully determined by the current observation, and hence, we deal with a partially observable Markov decision process (POMDP). Given the practical constraints outlined above, it is infeasible to apply deep reinforcement learning to train agents in the game in its entirety. In this game, we show progress toward training an artificial agent that takes valid actions and progresses fast in the game like expert human players. We connect this game to our training pipeline with DQN and Rainbow agents, where we use a network with two fully connected hidden layers and ReLU activation.

We create an episode by setting an early goal state in the game that takes an expert human player 5 minutes to reach. We let the agent submit actions to the game server every second. We reward the agent with ‘+1’ when they reach the goal state, ‘-1’ when they submit an invalid action, ‘0’ when they take a valid action, and ‘-0.1’ when they choose the “do nothing” action. The game is such that at times the agent has no other valid action to choose, and hence they should choose “do nothing”, but such periods do not last more than a few seconds.

We consider two different versions of the observation space, both extracted from the game engine. The first is what we call the “complete” state space. The complete state space contains information that is not straightforward to infer from the real observation in the game and is only used as a baseline for the agent. The polar opposite of this state space could be called the “naive” state space, which only contains straightforward information. The second state space we consider is what we call the “augmented” observation space, which contains information from the “naive” state space and information the agent would reasonably infer and retain from current and previous game observations. Note that current RL techniques would have had difficulty in inferring this information from the frame buffer pixels which only constitute a partially observable state space.

We trained four types of agents as shown in Figure 3, where we are plotting the average undiscounted return in each training episode. By design, this quantity is upper bounded by ‘+1’, which is achieved if the agent keeps taking valid actions until reaching the final goal state. On the other hand, this is not achievable as there are periods of time where no action is available and the agent has to choose the “do nothing” action and be rewarded with ‘-0.1’. Hence, the best an expert human player would achieve on these episodes would be around zero.

We see that after a few iterations, both the Rainbow and DQN agents converge to their asymptotic performance values. The Rainbow agent converges to a better asymptotic performance level as compared to the DQN agent. However, in the the transient behavior we observe that the DQN agent achieves the asymptotic behavior faster than the Rainbow agent. We will observe in the last case study that Rainbow does not outperform DQN in all experiments. We also see that the augmented observation space makes the training slower and also results in a worse performance on the final strategy. In addition, the agent will keep taking invalid actions in some cases in the evaluation phase resulting in high negative returns because of the exploratory nature of the agent. We intend to experiment with shaping the reward function for achieving different play styles. We also intend to investigate augmenting the replay buffer with expert demonstrations for faster training.

While we achieved a certain level of success using the outlined approach (streamlined access to the game state, and direct communication of actions to the game, followed by training using a deep neural network), we observe that the training within the current paradigm of RL remains costly. Specifically, even using the complete state space, it takes several hours to train a model that achieves a level of performance expected of an expert human player on this relatively short episode. This calls for the exploration of complementary approaches to augment the training process.

V Game AI

We have shown the value of simulated agents in a fully modeled game, and the potential of training agents in a complex game to model player progression. We can take these techniques a step further and make use of agent training to help build the game itself. Instead of applying RL to capture player behaviors, we consider an approach to gameplay design where the player agents learn behavior policies from the game designers.

V-a Human-Like Exploration in an Open-World Game

To bridge the gap between the agent and the designer, we introduce imitation learning (IL) to our system [44, 45, 46]. In the present application, IL allows us to translate the intentions of the game designer into a primer and a target for our agent learning system. Learning from expert demonstrations has traditionally proved very helpful in training agents. In particular, the original Alpha Go [13] used expert demonstrations in training a deep Q network. While it is argued in subsequent work that learning via self-play could achieve a better asymptotic return compared to relying on expert demonstrations, the better performance comes with significantly higher cost in terms of training computational resources and the superhuman performance is not what we are seeking in this work. There are also other cases where the preferred solution for training agents would utilize a few relatively short demonstration episodes played by the software developers or designers at the end of the current development cycle [47].

Fig. 4:

Model performance measures the probability of the event that the Markov agent finds at least one previous action from human-played demonstration episodes in the current game state. The goal of interactive learning is to add support for new game features to the already trained model or improve its performance in underexplored game states. Plotted is the model performance during interactive training from demonstrations in a proprietary open-world game as a function of time measured in seconds.

In this experiment, we consider training artificial agents in an open-world video game, where the game designer is interested in training non-player characters in the game that follow certain behavioral styles. We need to efficiently train an agent using demonstrations capturing only a few key features. The training process has to be computationally inexpensive and the agent has to imitate the behavior of the teacher(s) by mimicking their relevant style (in a statistical sense) for implicit representation of the teacher’s objectives.

Casting this problem directly into the RL framework is complicated by two issues. First, it is not straightforward how to design a rewarding mechanism for imitating the style of the expert.777While inverse RL aims at solving this problem, its applicability is not obvious given the reduced representation of the huge state-action space that we deal with and the ill-posed nature of the inverse RL problem [48, 49]. Second, the RL training loop often requires thousands of episodes to learn useful policies, directly translating to a high cost of training in terms of time and computational resources. We propose a three-component solution to the stated problem:

  • The first component is an ensemble of multi-resolution Markov models capturing the style of the teacher(s) with respect to key game features.

  • The second one is a DNN trained as a supervised model on samples bootstrapped from an agent playing the game following the Markov ensemble.

  • Lastly, we enable an interactive medium where the game designer can take the controller back at any time to provide more demonstration samples.

V-A1 Multi-resolution Markov agent

We start with capturing the demonstration data which consists of a few engineered features reported by the game at every timestep. Total dimensionality of an individual frame data is 20 variables with, some of them reported only once every few timesteps. We also record actions that are controller inputs from a human player. We intend to provide a more complete description of this setup by publishing a preprint of our internal report [50]. The key point is that the data from the demonstrations has low dimensionality and is sparse, but sufficient to capture the main characteristics of the core gameplay loop of a first person shooter in an open world.

To build our Markov agent we use a direct approach to style reproduction inspired by natural language processing literature (see review

[51]). The demonstrations are converted to symbolic sequences using a hierarchy of multi-resolution quantization schemes with different levels of details for both the continuous data and the discrete channels. The most detailed quantization and higher order Markov models are able to reproduce sequences of human actions in similar situations with high accuracy, thus capturing the gameplay style. The coarsest level corresponds to a Markov agent blindly sampling actions from the demonstrations. The multi-resolution ensemble of Markov models provides an initial way of generalizing the demonstration data. The ensemble is straightforward to build and the inference is essentially a look-up process. We observe that even such a basic approach provided considerable mileage towards solving the stated problem.

V-A2 Bootstrapped DNN agent

We treat the existing demonstrations as a training set for a supervised learning problem where we need to predict the next action from a sequence of observed state-action pairs. This approach has proved to be useful in pre-training of self-driving cars

[52]. Since our database of demonstrations is relatively small, it is desirable to generate more data by bootstrapping the dataset for which we use our base Markov agent interacting with the game.

OpenAI Bootstrapped
1v1 Bot Agent
Experience 300 years 5 min
(per day) human
Bootstrap using N/A 5-20
game client
CPU 60,000 CPU 1 local CPU
cores on Azure
GPU 256 K80 GPUs N/A
on Azure
Size of 3.3kB 0.5kB
Observations 10 33
per second
of gameplay
TABLE I: Comparison between OpenAI 1v1 Dota 2 Bot [17] training metrics and training an agent via bootstrap from human demonstrations in a proprietary open-world game. While the objectives of the training are different, the environments are comparable and the open-world game considered here is even more complex in terms of the state and action spaces. The metrics below highlight the details of practical training of agents during the game development cycle.

Such a bootstrap process is easy to parallelize since we can have multiple simulations running without the need to cross-interact as in some learning algorithms like A3C [53]. The generated augmented data set is used to train a DNN that predicts the next action from the already observed state-action pairs. Due to partial observability, the low dimensionality of the feature space results in fast training in a wide range of model architectures, allowing a quick experimentation loop. We converged on a simple model with a single “wide” hidden layer for motion control channels and a DNN model for discrete channels responsible for turning on/off actions like sprint, firing, climbing. The approach shows promise even with many yet unexplored opportunities to improve its efficiency.

V-A3 Interactive learning

While bootstrapping can help training a better model, the quality of the final trained model is still limited by the amount of information contained in the relatively small set of demonstrations. Hence, it is highly desirable to obtain further information from the game designer, particularly in unexplored and underexplored parts of the state space where the trained model has little hope of generalizing.

We find that such a direct approach provides an opportunity to make the entire process of providing the demonstrations interactive. The interactivity entirely comes from the compound hierarchical nature of the initial ensemble of models, making it easy to add new ones to the set of already existing sub-models. In practical terms, it enables adding new demonstrations to directly override or augment already recorded ones.

While the spirit of the idea is similar to that of DAgger [54], providing labels on the newly generated samples is more time consuming than providing new demonstrations.888We anticipate that both techniques could be combined to provide better sample efficiency. The interactivity could also be used to support newly added features or to update the existing model otherwise. The designer can directly interact with the game, select a particular moment where a new demonstration is required, adjust the initial location of the character object, and run a short demonstration without reloading the game. The interactivity eliminates most of the complexity of the agent design process and brings down the cost of gathering data from under-explored parts of the state space.

A designer can start with the most basic gameplay, e.g., approach the target, then add more elements, e.g., attack the target, followed by more sophisticated gameplay. To provide additional feedback to the designer, we provide a (near) real-time chart showing how well the current model generalizes to the current conditions in the game environment.

An example of such interactive training chart is reported in Figure 4. The prolonged period of training may increase the size of the model with many older demonstrations already irrelevant, not used for inference, but still contributing to the model size. Instead of using rule-based compression of the resulting model ensemble, we consider a DNN from the ensemble of Markov models via a novel bootstrap using the game itself as the way to compress the model representation and strip off obsolete demonstration data. Using the proposed approach, we train an agent that satisfies the design needs in only a few hours of interactive training.

Table I illustrates the computational resources that solving this problem with the outlined 3-step process required as compared to training 1v1 agents in Dota 2 [17]. While we acknowledge that the goal of our agent is not to play optimally against the opponent and win the game, we observe that using model-based training augmented with expert demonstrations to solve the Markov decision process, in a game even more complex than Dota 2, results in huge computational savings compared to an optimal reinforcement learning approach.

V-B Strategic Gameplay in Team Sports

Our last case study involves a team sports game, where the designer’s goal is to train agents that can learn strategic teamplay. At the first glance, the problem lends itself to a multi-agent learning (MAL) framework. However, MAL is far more complicated than training a single agent, which has been the subject of all of the previous case studies, partly due to the increase of the size of the state space and the action space requiring far more computational resources. More important, MAL also suffers from non-convergence due to the breakdown of the Markovity of the decision process as the environment for each single agent is changing while the rest of the agents update their policies [55, 56].

To simplify the problem, we employ a hierarchical approach, where we assume multiple levels of the problem abstraction. At the lowest level, the agent actions and movements should resemble that of actual human players. At the highest level, the agents should learn how to follow different defense and offense strategies. In the mid-level, the agents should learn to coordinate their movements with each other, e.g., to complete successful passes, or to shoot toward the opponent’s goal. In this section, we leave out the details of the low-level tactics and only focus on the high-level strategic gameplay. We also move away from MAL by training agents one at a time within a team, and letting them blend into the overall strategic gameplay. We remind the reader that the goal is to provide a viable approach to solving the problem with reasonable amount of resources.

Fig. 5: A screen shot of the simple team sports simulator (STS2). The red agents are home agents attempting to score at the upper end and the white agents are away agents attempting to score the lower end. The highlighted player has the possession of the ball and the arrows demonstrate a pass/shoot attempt.

Our training takes place on a mid-level simulator, which we call simple team sports simulator (STS2).999We intend to release the STS2 gameplay environment as an open-source package. A screenshot of STS2 gameplay is shown in Fig. 5. The simulator embeds the rules of the game and the physics at a high level abstracting away the low-level tactics. The simulator supports v matches for any positive integer . The two teams are shown as red (home) and white (away). Each of the players can be controlled by a human, traditional game AI, or any other learned policy. The traditional game AI consists of a handful of rules and constraints that govern the gameplay strategy of the agents. The STS2 state space consists of the the coordinates of the players and their velocities as well as an indicator for the possession of the ball. The action space is discrete and is considered to be left, right, forward, backward, pass, and shoot. Although the player can hit two or more of the actions together we do not consider that possibility to keep the action space small for better scalability.

In the rest of this section, we report our progress toward applying deep RL in the STS2 environment.

V-B1 Single agent in a 1v1 game

As the simplest first experiment, we consider training an agent that learns to play against the traditional game AI in a 1v1 match. We start with a sparse reward function of ‘+1‘ for scoring and ‘-1’ for being scored against. We used DQN [12], Rainbow [35], and PPO [16] to train agents that would replace the home team (player). DQN shows the best sign of learning of useful policies after an equivalent of 5 years of human gameplay experience. For example, the gameplay statistics of the DQN agent are reported in Table II. As can be seen the DQN agent was losing 1:4 to the traditional AI. Note that we randomize the orientation of the agents at the beginning of each episode, and hence, the agent encounters several easy situations with an open net for scoring. On the other hand, the agent does not learn how to play defensively when the opponent is in possession of the ball. In fact, we believe that a successful strategy for defense is more difficult to learn than that of offensive gameplay.

Statistics DQN Agent Trad. Game AI
Score rate 22% 78%
Possession 36% 64%
TABLE II: DQN agent in a 1v1 match against a traditional game AI agent with a sparse ‘+/-1’ reward for scoring.

Next, we shape the rewarding mechanism with the goal of training agents that also learn how to play defensively. In addition to the ‘+/-1’ scoring reward, we reward the agent with ‘+0.8’ for gaining the possession of the ball and ‘-0.8‘ for losing it. The statistics of the DQN agent are reported in Table III. In this case, we observe that the DQN agent learns to play the game with an offensive style of chasing the opponent down, gaining the ball, and attempting to shoot. Its score rate as compared to the traditional game AI is 4:1, and it dominates the game.

Statistics DQN Agent Trad. Game AI
Score rate 80% 20%
Possession 65% 35%
TABLE III: DQN agent in a 1v1 match against a traditional game AI agent with a sparse ‘+/-1’ reward for scoring and a ‘+/0.8’ reward for gaining/losing the possession of the ball.

We repeated this experiment using PPO and Rainbow as well. We observe that the PPO agent’s policy converges quickly to a simple one. When it is in possession of the ball, it wanders around in its own half without attempting to cross the half-line or to shoot until the game times out. This happens because the traditional game AI is programmed not to chase the opponent in their half when the opponent is in possession of the ball, and hence, the game goes on as described until timeout with no scoring on either side. PPO has clearly reached a local minimum in the space of policies, which is not unexpected as it is optimizing the policy directly. Finally, the Rainbow agent does not learn a useful policy for either offense or defense.

As the last 1v1 experiment, we train a PPO agent against the abovementioned DQN agent with exactly the same reward function. The gameplay statistics is reported in Table IV. We observe that the DQN agent is no longer stuck in a local optimum policy, and it is dominating the DQN agent with a score rate of 6:1. Notice that this is not a fair comparison as the DQN agent was only trained against traditional game AI agent and had not played against the PPO agent, whereas the PPO agent is directly trained against the DQN agent. While dominating the score rate, we also observe that the game is much more even in terms of the possession of the ball.

Statistics PPO Agent DQN Agent
Score rate 86% 14%
Possession 55% 45%
TABLE IV: PPO agent in a 1v1 match against a DQN agent, both with a sparse ‘+/-1’ reward for scoring and a ‘+/0.8’ reward for gaining/losing the possession of the ball.

V-B2 Single agent in a 2v2 game

Having gained some confidence with single agent training, as the simplest multi-agent experiment, we consider training a single agent in a 2v2 game. We let the traditional game AI be in control of the opponent players as well as the teammate player. The first experiment entails a ‘+/-0.8‘ team reward for any player in the team gaining/losing the ball in addition to the ‘+/-1’ reward for scoring. The agent does not learn a useful defensive or offensive policy and the team loses overall.

In the second experiment, we change the rewarding mechanism to ‘+/-0.8’ individual reward for the agent gaining/losing the ball. This seems to turn the agent into an offensive player that chases the opponent down, gains the ball, and attempts to shoot. The team statistics for this agent are shown in Table V. We observe that the agent has learnt an offensive gameplay style where it scores most of the time.

Statistics DQN Agent Teammate Opponent 1 Opponent 2
Score rate 54% 20% 13% 13%
Possession 30% 18% 26% 26%
TABLE V: Offensive DQN agent in a 2v2 match against two traditional game AI agents and playing with a traditional game AI agent as teammate, with a sparse ‘+/-1’ reward for scoring and a ‘+/0.8’ individual reward for gaining/losing the possession of the ball.

While the team is winning in the previous case, we observe that the teammate is not participating much in the game with even less possession of the ball than the opponent players. Next, we explore training an agent that can assist the teammate score and possess the ball. We add another ‘-0.8’ teammate reward, which occurs whenever the teammate loses the ball. The gameplay statistics of this team are reported in Table VI. In terms of gameplay, we observe that the agent spends more time defending their own goal and passes the ball to the teammate to score when gains the possession of the ball.

Statistics DQN Agent Teammate Opponent 1 Opponent 2
Score rate 20% 46% 17% 17%
Possession 36% 22% 21% 21%
TABLE VI: Defensive DQN agent in a 2v2 match against two traditional game AI agents and playing with a traditional game AI agent as teammate, with a sparse ‘+/-1’ reward for scoring and a ‘+/0.8’ individual reward for gaining/losing the possession of the ball and a ‘-0.8’ teammate reward when the teammate loses the possession of the ball to the opponent team.

V-B3 Two agents trained separately in a 2v2 game

After successful training of a single agent in a 2v2 game, we train a second agent in the home team while reusing one of the previously trained agents as the teammate. For this experiment, we choose the DQN agent with an offensive gameplay style from the previous set of experiments as the teammate. This agent was described in the previous experiment. We train another agent as the teammate using exactly the same reward function as the offensive DQN agent. The statistics of the gameplay for the two agents playing together against the traditional game AI agents are shown in Table VII. While the second agent is trained with the same reward function as the first one, it is trained in a different environment as the teammate is now the offensive DQN agent trained in the previous experiment rather than the traditional game AI agent. As can be seen, the second agent now becomes defensive and is more interested in protecting the net, gaining the possession of the ball back, and passing it to the offensive teammate.

Statistics DQN 2 DQN 1 Opponent 1 Opponent 2
Score rate 50% 26% 12% 12%
Possession 28% 22% 25% 25%
TABLE VII: Two DQN agents in a 2v2 match against two traditional game AI agents, with a sparse ‘+/-1’ reward for scoring and a ‘+/0.8’ individual reward for gaining/losing the possession of the ball.

As the second 2v2 experiment, we train two PPO agents in the exact same manner as we trained the DQN agents in the previous experiment. We observe a similar trait in the role of the agents as offensive and defensive. Then we let the PPO team play against the DQN team. We observe that the PPO team defeats the DQN team by a slight edge, 55:45. While this experiment is a fair comparison between PPO and DQN, we emphasize that these teams are both trained against the traditional game AI agents and are now both playing in a new environment. In a sense, this is measuring how generalizable the learned policy is. Another point we mention is that the gameplay between the agents is interesting to watch.

Statistics DQN 2 DQN 1 Opponent 1 Opponent 2
Score rate 20% 46% 17% 17%
Possession 36% 22% 21% 21%
TABLE VIII: Two DQN agents in a 2v2 match against two traditional game AI agents, with a sparse ‘+/-1’ reward for scoring and a ‘+/0.8’ individual reward for gaining/losing the possession of the ball.

We repeated all of these experiments using Rainbow agents as well, and they failed all of the experiments. We are still investigating which addition in Rainbow is resulting in the failure of the algorithm in the team sports environment.

V-B4 Two agents trained simultaneously in a 2v2 game

Finally, we consider a case where a single meta-policy controls two home agents at the same time. We tried multiple reward functions including rewarding the team by ‘+1‘ for scoring, ‘-1’ for being scored against, ‘+0.8‘ for gaining the possession of the ball, and ‘-0.8‘ for losing the possession of the ball. We observed that neither algorithm learned a useful policy in this case. We believe with a higher level planner on top of the reinforcement learning, we should be able to train the agents to exhibit teamplay but that remains for future investigation.

Vi Concluding Remarks

In this paper we have described our approach to game-playing agents that considers the player experience over the agent’s ability to win. This is a more challenging but more beneficial route in understanding how players interact with games, and how to modify the games to change and improve player interaction. We describe our training pipeline that is designed to train agents in mobile and HD games. We then describe four case studies on our ongoing efforts for training agents. The first two focus on playtesting and the latter two focus on game AI.

In the first case study, we consider a mobile game. We show that the game dynamics could be fully extracted in a lightweight model of the game. Henceforth, this removes the need for learning, and in particular reinforcement learning, and the gameplay experience could be modeled using much simpler planning methods. The game designer used this model to tune the game parameters for balancing the gameplay experience across different career modes of the game. Each career mode is designed to align with interests of a particular group of the players and the goal is for all interest groups to have a similar game progression. While it is not always possible to extract a full model of the game, especially in more complex games with complicated physics, it is usually possible to extract models for simpler events within the game and embed them in the machine learning loop to speed up the learning process. We use this technique whenever it is suitable.

In the second case study, we consider another mobile game. This game has very complex state and action spaces rendering a direct application of reinforcement learning impractical. In this game, the player has to learn to utilize their resources strategically for fast progression. In particular, a greedy progression in the early stages of the game may hinder the player from fast progression in the future stages. The game designer’s goal is to understand these trade-offs for an expert player. This problem conforms to the usual reinforcement learning formulation with a ‘+1’ reward for reaching the goal state, and ‘-1’ reward for attempting an invalid action. We break down the complexity of the state and action spaces by removing much of the unnecessary information for achieving the desired sub-task. We provide this tool to the game designer for balancing the resources that are provided to and required of each player at every state. While reinforcement learning on the game in its entirety is usually impractical, there are sub-problems where we can define clear rewarding mechanisms that encourage a desirable behavior. In these cases, we resort to reinforcement learning for solving these sub-problems for balancing the game design.

In the third case study, we consider an open-world HD game. The game designer’s goal is to create a non-player character that can follow a certain style. Further, it is desirable to conclude the entire training process in just a few hours. In this case, we start with a few minutes of training samples from the game designer. The training data is insufficient for training a complex model, such as a deep neural network. Thus, we resort to training a simple Markov model for the agent behavior. The Markov agent is capable of exhibiting the style the designer intended, however, it is far from ideal. We bootstrap the Markov agent to obtain more training samples. To augment the information on which we train a model, we make the training process interactive. The game designer is given the capability to take the game controller back from the agent at any time and provide more training data. After a few hours of interactive sample acquisition and training, we obtain a deep neural network model, which exhibits the desired behavior by the game designer. We believe this interactive training loop is an essential part of our agent training pipeline.

In the last case study, we consider a team sports game. The goal is to train agents that play like humans, both in terms of tactics and strategies. We focus on strategy using a high-level simulator, called simple team sports simulator (STS2). We show that sparse ‘+/-1’ rewards for scoring does not suffice to train agents that learn to play both offensively and defensively. Additionally, we introduce a possession reward of ‘+/-0.8’ for gaining/losing possession of the ball. With a variety of reward functions, we show that we can use reinforcement learning on this abstraction level to train agents that can play with different styles, such as offensive and defensive.

To summarize, in each of these four case studies, we give a simple example of how our approach to game-playing agents can yield valuable insight, and even drive the construction of the game itself. We believe that this is just the beginning of a long and fruitful exploration into experience-oriented game-playing agents that will not only deliver radical improvements to the games we play, but will be another major milestone on the roadmap of AI.


The authors are thankful to EA Digital Platform – Data & AI, EA Sports, and other game team partners for their support.