The paper accepted for ICML HILL 2019 workshop and the repository goes public as of June 3, 2019.
There is a high demand for high-quality Non-Player Characters (NPCs) in video games. Hand-crafting their behavior is a labor intensive and error prone engineering process with limited controls exposed to the game designers. We propose to create such NPC behaviors interactively by training an agent in the target environment using imitation learning with a human in the loop. While traditional behavior cloning may fall short of achieving the desired performance, we show that interactivity can substantially improve it with a modest amount of human efforts. The model we train is a multi-resolution ensemble of Markov models, which can be used as is or can be further "compressed" into a more compact model for inference on consumer devices. We illustrate our approach on an example in OpenAI Gym, where a human can help to quickly train an agent with only a handful of interactive demonstrations. We also outline our experiments with NPC training for a first-person shooter game currently in development.READ FULL TEXT VIEW PDF
The paper accepted for ICML HILL 2019 workshop and the repository goes public as of June 3, 2019.
Autonomous agents in video games are called Non-Player Characters (NPCs). They are an essential element of gameplay in ever increasingly complex game environments. The traditional hand-crafting methods of their creation require a substantial amount of domain knowledge, knowledge engineering, scripting, intuition, and testing. Meanwhile, the requirements for the NPCs are growing in two main dimensions: scale and depth. The scale calls for various types of characters present in the game to create an illusion of a diversely populated virtual universe. The depth is about making the characters more believable, human-like and engaging. The existing approach to hand-crafting behaviors of NPCs is hard to scale in both dimensions, calling for alternative approaches. The problem of training NPCs in video games may appear highly specific but it shares a lot of challenges with the problem of real-time interactive Machine Learning for fast training and serving custom models in a variety of contexts, e.g.,(Crankshaw et al., 2015).
In this paper, we consider an idealized NPC creation workflow, where a game designer interactively creates NPCs from demonstrations. The designer would produce some examples that leads to training of a model. While the model is running, the designer takes control of the character and produces more demonstrations of the desired behavior until NPCs are trained to the designer’s satisfaction. We show that we can make a significant step towards such idealized workflow using simple yet effective techniques based on an ensemble of multi-resolution Markov models.
Using human demonstrations helps training artificial agents in many applications and in particular in video games (Gudmundsson et al., 2018), (Harmer et al., 2018), (Stanley et al., 2006). Off-policy human demonstrations are easier to use and are abundant in player telemetry data. Supervised behavior cloning, imitation learning (IL), apprenticeship learning (e.g., (Bogdanovic et al., 2015)) and generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016) allow for the reproduction of a teacher style and achievement of a reasonable level of performance in the game environment. Unfortunately, an agent trained using IL is usually unable to effectively generalize to previously underexplored states or to extrapolate stylistic elements of the human player to new states.
Direct inclusion of a human in the control loop can potentially alleviate the problem of limited generalization. Dataset Aggregation, DAGGER (Ross et al., 2011), allows for an effective way of doing that when a human provides consistent optimal input, which may not be realistic in many environments. Another way of such inclusion of online human input is shared autonomy, which is an active research area with multiple applications, e.g., (Zhou et al., 2018), (Rueben & Smart, 2015), etc. The shared autonomy approach (Reddy et al., 2018) naturally extends to policy blending (Dragan & Srinivasa, 2013) and allows to train DQN agents cooperating with a human in complex environments effectively. The applications of including a human in the training loop to the fields of robotics and self-driving cars are too numerous to cover here, but they mostly address the optimality aspect of the target policy while here we also aim to preserve stylistic elements of organic human gameplay.
In this section, we introduce the notation and the necessary background on the building blocks used in the interactive procedure outlined in the next section.
Markov Decision Process:
Following the established treatment of training artificial agents, we place the problem into the framework of Markov Decision Processes (MDPs)(Sutton & Barto, 1998). While in many cases the actual observation space may not contain the full state space, the designed algorithms do not distinguish between state and the corresponding observation available to the agent. We rely on an extended state space (defined below), which helps to mitigate the partial observability and preserve the stylistic element of human demonstrations.
The interaction of the agent and the environment takes place at discrete momentswith the value of trivially observable by the agent. The agent, after receiving an observation at time , can take an action from the set of allowed actions using policy . Executing an action in the environment results in a new state and a reward also observed by the agent. The primary objective of training a policy in RL is the maximization of cumulative rewards but they are inconsequential for the model we build; hence we drop them from further discussion. In this paper, we consider the episode-based environment, i.e., after reaching a certain condition, the execution of the described state-action loop ends. A complete episode is a sequence . The fundamental assumption regarding the described decision process is that it has the Markov property.
Extended state: Besides the most recent action taken before time , i.e., action , we also consider a recent history of the past actions, where , , whenever it is defined in an episode . For , we define as the empty sequence. We augment the directly observed state with the action history , to obtain an extended state .
The purpose of including the action history is to better capture additional information (e.g., stylistic features) from human controlling the input during interactive demonstrations. An extended policy , which operates on the extended states , is useful for modeling human actions in a manner similar to
-grams text models in natural language processing (NLP) (e.g.,(Kamiński, 2016), (Wright, 2017), (Andresen & Zinsmeister, 2017)). Of course, the analogy with -gram models in NLP works only if both state and action spaces are discrete. We will address this restriction in the next subsection using multi-resolution quantization.
For a discrete state-action space and various
, we can compute probabilitiesof transitions occurring in (human) demonstrations and use them as a Markov model of order of (human) actions. We say that the model is defined on an extended state if the demonstrations contain at least one occurrence of . When a model is defined on , we can use to sample the next action from all ever observed next actions in state . Hence, defines a partial stochastic mapping from extended states to action space .
Stacked Markov models: We call a sequence of Markov models a stack of models. A (partial) policy defined by computes the next action at a state as described in Algorithm 1. Such policy performs a simple behavior cloning. The policy is partial since it may not be defined on all possible extended states and needs a fallback policy to provide a functional agent acting in the environment.
Note that it is possible to implement sampling from a Markov model using an complexity operation with hash tables. Hence, when properly implemented, Algorithm 1 is very efficient and suitable for real-time execution in a video game or other interactive application where expected inference time has to be on the scale of 1 ms or less 111A modern video game runs at least at 30 frames per second with lots computations happening during about 33 ms allowed per frame, drastically limiting the “budget” allocated for inference..
Quantization: Quantization (aka discretization) allows us to work around the limitation of discrete state-action space enabling the application of the Markov Ensemble approach to environments with continuous dimensions. Quantization is commonly used in solving MDPs (Wiering & van Martijn Otterlo, 2012) and has been extensively studied in the signal processing literature (Oppenheim & Schafer, 1975), (Gersho & Gray, 1991). Using quantization schemes that have been optimized for specific objectives can lead to significant gains in model performance, improving various metrics vs. ad-hoc quantization schemes, e.g., (Wiering & van Martijn Otterlo, 2012), (Pagès et al., 2004).
Instead of trying to pose and solve the problem of optimal quantization, we use a set of quantizers covering a range of schemes from coarse to fine. At the conceptual level, such an approach is similar to multi-resolution methods in image processing, mip-mapping and Level-of-Detail (LoD) representations in computer graphics (Hughes et al., 2013). The simplest quantization is a uniform one with step :
For illustration purposes, it is sufficient to consider only the uniform quantization
. In practice, most variables have naturally defined limits which are at least approximately known. Knowing the environment scale gives an estimate of the smallest step sizeat which we will have complete information loss, i.e., all observed values map to a single bin. For each continuous variable in the state-action space, we consider a sequence of quantizers with decreasing step size , , which naturally gives a quantization sequence for the entire state-action space, provided is fixed across the continuous dimensions. To simplify notation, we collapse the sub index and write to stand for . For more general quantization schemes, the main requirement is the decreasingly smaller reconstruction error for in comparison to .
For an episode , we compute its quantized representation in an obvious component-wise manner:
which defines a multi-resolution representation of the episode as a corresponding ordered set of quantized episodes, where
is the vector version of quantization.
In the quantized Markov model , which we construct from the episode , we compute extended states using the corresponding quantized values. Hence, the extended state is . Further, we define the model to contain probabilities for the original action values. In other words, we do not rely on the reconstruction mapping to recover action but store the original actions explicitly. In practice, continuous action values tend to be unique and the model samples from the set of values observed after the occurrences of the corresponding extended state. Our experiments show that replaying the original actions instead of their quantized representation provides better continuity and natural true-to-the-demonstration look of the cloned behavior.
Markov Ensemble: Combining together stacking and multi-resolution quantization of Markov models, we obtain Markov Ensemble as an array of Markov models parameterized by the model order and the quantization schema :
The policy defined by the ensemble (2) computes each next action following Algorithm 2. The Markov Ensemble technique, together with the policy defined by it, are our primary tools for cloning behavior from demonstrations.
Note, that with the coarsest quantization present in the multi-resolution schema, the policy should always return an action sampled using one of the quantized models, which at the level always finds a match. Hence, such models always “generalize” by resorting to simple sampling of actions when no better match found in the observations. Excluding too coarse quantizers and Markov order 0 will result in executing default policy in the Algorithms 1 and 2
. A possible default policy returns a random action sampled from the action space as in the provided OpenAI Gym examples. The default policy in a video game can be as simple (return an idle action) or more sophisticated and use some heuristics preventing the agent from erratic behavior.
Interactive Training of Markov Ensemble: If the environment allows a human to override the currently executing policy and record new actions (demonstrations), then we can generate a sequence of demonstrations produced interactively. For each demonstration, we construct a new Markov Ensemble and add it to the sequence (stack) of already existing models. The policy based on these models consults with the latest one first. If the consulted model fails to produce an action, the next model is asked, etc. until there are no other models or one of them returns action sampled according to Algorithm 2. Thanks to the sequential organization, the latest demonstrations take precedence of the earlier ones, allowing correcting previous mistakes or adding new behavior for the previously unobserved situations. We illustrate the logic of such an interaction with the sample git repository (Borovikov, 2019). The computational costs for each ensemble, as already noted, is small constant while the overall complexity grows linearly with the number of demonstrations, allowing sufficiently long interaction of a user with the environment and training a more powerful policy. In our case studies, we show that often even a small number of strategically provided demonstrations results in a well-behaving policy.
We consider two simple OpenIA Gym examples first and then discuss a more practical application of the proposed interactive training to a proprietary modern open-world first person shooter video game.
OpenAI Gym, classic control: We discuss Mountain Car and Lunar Lander to illustrate the approach with working Python code, available from the repository (Borovikov, 2019). The first example, Mountain Car, is nearly trivial for a human to solve. Also, it allows the evaluation of the quality of individual actions: the “good” actions add mechanical energy to the system. The plot on Figure 1 shows that after just a few episodes that included some demonstrations from a human, the model can solve the environment on its own quite often. Its performance keeps improving with additional demonstrations even if some of the demonstrated actions may be sub-optimal.
Lunar Lander poses more challenges to a human. It requires some level of skills from a human player to outperform the random agent substantially. However, despite the sub-optimality of human demonstrations, they significantly increase performance over that one of the random agent.
An Open World Video Game: For the practical application, we use an open world first-person shooter game, with OpenAI-style instrumentation for extracting game states and providing back actions. The game we explored is still in development, making learning from frame-buffer infeasible due to frequent changes of the visuals. The instrumentation, on the other hand, allows extracting relevant game state with relative ease. We convert the state into the features more suitable for training a general model. In particular, instead of the absolute coordinates and orientation of the player and non-player characters we use their relative to each other location and orientation. Additionally, we included the presence of other relevant features like line of site, ammo, health, ground speed, animation type (e.g., crouched, sprinting, jumping), collision state, etc. The quantization schemes come naturally from the gameplay design, e.g., for the distance to the adversary we defined ranges like “too far to aim”, “can shoot but the damage is negligible”, etc., all the way down to “melee weapon range” using game tuning data and personal experience of playing the game.
For the experiments reported in Figure 3, we trained only one type of gameplay - aggressive approach and attack of the adversary. The untrained model would not attempt to approach and attack without additional fallback logic in the control loop. After only about 40 seconds of human training, the Markov Ensemble model learns attack skills to a high degree of efficiency and eliminates the need for the additional logic most of the time. Here, efficiency refers to how frequently the Markov Ensemble is able to infer an action for the current extended state and not how competitive the performance of those actions is. Similarly, we can train an agent to negotiate contextual obstacles, use “goodies” (e.g., medical and ammo packs found in the environment) and interact with other types of objects in the game.
Since loading and inference time of the stacked model grows as with the number of demonstrations , it may be beneficial to apply DAGGER (Ross et al., 2011) to build a more compact aggregate model. Alternatively, we can use bootstrap to generate additional training data with no human in the loop to train an aggregate DNN model from the resulting data (Borovikov & Beirami, 2019). When using ANNs, aggregated models would load and execute inference at the speeds compatible with the real-time performance of the game.
In this paper, we show that including a human in the learning loop can result in practical training of even such a simple model like our Markov Ensemble model. Quantization provides a simple yet efficient way of generalization for the models trained only on a limited number of human inputs. The process of having a human actively monitor the model and take over control can train practically useful models like the one we discuss for a first-person shooter game. Our future work includes meta-parameters tuning for the ensemble parameters (quantization schemes, extended state history size) for more effective training and more accurate style reproduction. Also, it will be beneficial to place the proposed techniques into a more general framework of RL and IL.
Human-like playtesting with deep learning.2018 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8, 2018.
Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981.