Measuring and Characterizing Generalization in Deep Reinforcement Learning

12/07/2018 ∙ by Sam Witty, et al. ∙ 12

Deep reinforcement-learning methods have achieved remarkable performance on challenging control tasks. Observations of the resulting behavior give the impression that the agent has constructed a generalized representation that supports insightful action decisions. We re-examine what is meant by generalization in RL, and propose several definitions based on an agent's performance in on-policy, off-policy, and unreachable states. We propose a set of practical methods for evaluating agents with these definitions of generalization. We demonstrate these techniques on a common benchmark task for deep RL, and we show that the learned networks make poor decisions for states that differ only slightly from on-policy states, even though those states are not selected adversarially. Taken together, these results call into question the extent to which deep Q-networks learn generalized representations, and suggest that more experimentation and analysis is necessary before claims of representation learning can be supported.



There are no comments yet.


page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Deep reinforcement learning (RL) has produced agents that can perform complex tasks using only pixel-level visual input data. Given the apparent competence of some of these agents, it is tempting to see them as possessing a deep understanding of their environments. Unfortunately, this intuition can be shown to be very wrong in some circumstances.

Consider a deep RL agent responsible for controlling a self-driving car. Suppose the agent is trained on typical road surfaces but one day it needs to travel on a newly paved roadway. If the agent operates the vehicle erratically in this scenario, we would conclude that this agent has not formed a sufficiently general policy for driving.

We provide a conceptual framework for thinking about generalization in RL. We contend that traditional notions that separate a training and testing set are misleading in RL because of the close relationship between the experience gathered during training and evaluations of the learned policy.

With this context in mind, we address the question:

To what extent do the accomplishments of deep RL agents demonstrate generalization, and how can we recognize such a capability when presented with only a black-box controller?

We propose a view of generalization in RL based on an agent’s performance in states it couldn’t have encountered during training, yet that only differ from on-policy states in minor ways. Our approach only requires knowledge of the training environment, and doesn’t require access to the actual training episodes. The intuition is simple: To understand how an agent will perform across parts of the state space it could easily encounter and should be able to handle, expose it to states it could never have observed and measure its performance. Agents that perform well under this notion of generalization could be rightfully viewed as having mastered their environment. In this work, we make the following contributions:

Recasting generalization.

We define a range of types of generalization for value-based RL agents, based on an agent’s performance in on-policy, off-policy, and unreachable states. We do so by establishing a correspondence between the well-understood notions of interpolation and extrapolation in prediction tasks with off-policy and unreachable states in RL.

Empirical methodology. We propose a set of practical methods to: (1) produce off-policy evaluation states; and (2) use parameterized simulators and controlled experiments to produce unreachable states.

Analysis case-study. We demonstrate these techniques on a custom implementation of a common benchmark task for deep RL, the Atari 2600 game of Amidar. Our version, Intervenidar, is fully parameterized, allowing us to manipulate the game’s latent state, thus enabling an unprecedented set of experiments on a state-of-the-art deep Q-network architecture. We provide evidence that DQNs trained on pixel-level input can fail to generalize in the presence of non-adversarial, semantically meaningful, and plausible changes in an environment.


[width=0.35]amidar_key.png [trim=11 53 11 33,clip, width=0.35]example-control-start.png [trim=11 53 11 33,clip, width=0.35]example-control-death.png [trim=11 53 11 33,clip, width=0.35]example-mod-start.png [trim=11 53 11 33,clip, width=0.35]example-mod-death.png
(a) Default start (b) Default death (c) Modified start (d) Modified death
Figure 1: A minor change in Amidar game state can dramatically reduce a trained agent’s ability to obtain a large reward.

In Amidar, a Pac-Man-like video game, an agent moves a player around a two-dimensional grid, accumulating reward for each vertical and horizontal line segment the first time that the player traverses them. An episode terminates when the player makes contact with one of the five enemies that also move along the grid.

Consider the two executions of an agent’s learned policy in Figure 1 starting from two distinct states, default and modified. The default condition places the trained agent in the deterministic start position it experienced during training. The modified condition is identical, except that a single line segment has been filled in. While this exact state could never be observed during training, we would expect an agent that has learned appropriate representations and a generalized policy to perform well. Indeed, with a segment filled in, the agent is at least as close to completing the level as in the default condition. However, this small modification causes the agent to obtain an order of magnitude smaller reward. Importantly, this perturbation differs from an adversarial attack [Huang et al.2017] for deep agents in that it influences the latent semantics of state, not solely the agent’s perception of that state. Our experiments expand on this representative example, enumerating a set of agents and perturbations.

Background and Related Work

We consider the standard RL formulation of an agent sequentially interacting with an environment taking actions at discrete time steps. Formally, this process is modeled as a 4-tuple Markov decision process (MDP). The agent starts from a state from a set of possible start states and takes an action at each timestep . The transition function

is the probability of encountering state

after taking action in state . The reward function defines the reward the agent receives when it encounters state . The agent’s objective is to maximize the accumulated sum of rewards.

A policy, , is a mapping from states to actions, fully characterizing the behavior of an agent. The Q-value of a state–action pair, , is the expected return for following from after taking action , , where is the discount rate. The value of a state, , is the expected return by following from , . The optimal policy is the policy that maximizes , which is equivalent to maximizing .

A widely used class of methods for specifying policies in RL is to construct an approximation of the state–value function, , and then select the action that maximizes at each timestep [Sutton and Barto1998]

. Deep Q-networks (DQNs) are one such method, using multi-layer artifical neural networks as a function approximation for

. We omit discussion of recent advances in network architecture and training for brevity, as they are tangential to the core contributions of our work.

Prior Work on Generalization in RL. Generalization has long been a concern in RL [Sutton and Barto1998]

. Somewhat more recently, kakade03 kakade03 provided a theoretical framework for bounding the amount of training data needed for a discrete state and action RL agent to achieve near optimal reward. nouri2009novel nouri2009novel discuss how to apply the idea of a training/testing split from supervised learning in the context of offline policy evaluation with batch data in RL.

Generalization has been cast as avoiding overfitting to a particular training environment, implying that sampling from diverse environments is necessary for generalization [Whiteson et al.2011, Zhang et al.2018]. Other work has focused on generalization as improved performance in off-policy states, a framework much closer to standard approaches in supervised learning. Techniques such as adding stochasticity to the policy [Hausknecht and Stone2015], having the agent take random steps, no-ops, steps from human play [Nair et al.2015], or probabilistically repeating the agent’s previous action [Machado et al.2017], all force the agent to transition to off-policy states.

These existing methods diversify the training data via exposure to on-policy and off-policy states, but none discuss generalization over states that are logically plausible but unreachable. The prior focus has been on generalization as a method for preventing overfitting, rather than as a capability of a trained agent.

Generalization vs. Memorization.

Generalization is often contrasted with memorization and there have been recent efforts to understand their respective roles in deep learning. For instance, with an operationalized view of memorization as the behavior of deep networks trained on noise, arpit2017closer arpit2017closer showed that the same architectures that memorize noise can learn generalized behaviour on real data.

Adversarial Attacks on Deep Networks. While related to adversarial attacks on deep networks, this work differs in two important ways: (1) interventions are not adversarially selected and, (2) interventions operate on latent states, not on the agent’s perception. mandlekar2017adversarially mandlekar2017adversarially attempted to make agents robust to random high-level perturbations on the input. That is, for the domain they explore, MuJoCo physics simulator, the inputs are at the resolution of human-understandable concepts. Yet, this work does not address questions of alignment between meaningful real world high-level perturbations and learned representations by the network.

Recasting Generalization

Using existing notions of generalization, such as held-out set performance, is complicated when applied to RL for two reasons: (1) training data is dependent on the agent’s policy; and (2) the vastness of the state space in real-world applications means it is likely for novel states to be encountered at deployment time.

One could imagine a procedure in RL that directly mimics evaluation on held-out samples by omitting some subset of training data from any learning steps. However, this methodology only evaluates the ability of a model to use data after it is collected, and ignores the effect of exploration on generalization. Using this definition, we could incorrectly claim that an agent has learned a general policy, even if this policy performs well on a very small subset of states. Instead, we focus on a definition that encapsulates the trained agent as a standalone entity, agnostic to the specific data it encountered during training.

Generalization via State-Space Partitioning.

We partition the universe of possible input states to a trained agent into three sets, according to how the agent can encounter them following its learned policy from . Here, is the set of all policy functions, and , , and are some small positive values close to 0. We can think of and

as thresholds on estimation accuracy and optimality performance. The set of reachable states,

, is the set of states that an agent encounters with probability greater than by following any .111These definitions can be customized with alternative metrics for value estimation and optimality, such as replacing with .

Definition 1 (Repetition).

An RL agent has high repetition performance, , if and , . The set of on-policy states, , is the set of states that the agent encounters with probability greater than by following from .

Definition 2 (Interpolation).

An RL agent has high interpolation performance, , if and , . The set of off-policy states, , is defined as .

Definition 3 (Extrapolation).

An RL agent has high extrapolation performance, , if and , . The set of unreachable states, , is defined as .

Note that only includes states that are in the domain of . In other words, specification of the transition function implicitly defines , and by extension . This definition is particularly important in the context of deep RL, as the dimensionality of the observable input space is typically much larger than . If we wish to demonstrate that an agent generalizes well for Amidar, would need to be well defined with respect to latent state variables in the Amidar game, such as player and enemy position. If we wish to demonstrate that an agent generalizes well for all Atari games, we would need to be well defined with respect to latent state variables in other Atari games as well, such as the paddle position in Breakout. Given any reaonable bound on the MDP, we would not expect the agent to perform well when exposed to random configurations of pixels.222

Modifications to the transition function itself are better described as transfer learning

[Oquab et al.2014].

Note that a large body of work implicitly uses as a criteria for performance, even though this is the weakest of generalization capabilities. It is what you get when testing a learned policy in the environment in which it was trained. Some readers may doubt that it is possible to learn policies that extrapolate well. However, kansky2017schema kansky2017schema show that, with an appropriate representation, reinforcement learning can produce policies that extrapolate well under similar conditions to what we describe in this paper. What has not been shown to date is that deep RL agents can learn policies that generalize well from pixel-level input.

We demonstrate a simple example of this state-space partition in Figure 2, a classic GridWorld benchmark. In this environment, the agent begins each episode in a deterministic start position, can take actions right, right and up, and right and down, and obtains a reward of when it arrives at the goal state, . Note that the agent must move right at every step, therefore there are three regions that are unreachable from the agent’s fixed start position: the upper left corner, the lower left corner, and the lower left corner after the wall. While unreachable, the upper left corner is a valid state that does not restrict the agent’s ability to reach the goal state and obtain a large reward.

Note that an agent interacting in the GridWorld environment learns tabular Q-values, therefore we should not expect it to satisfy any reasonable definition of generalization. However, given an adequate exploration strategy, an agent could conceivably visit every off-policy state during training, resulting in converging to . This agent would satisfy and for arbitrarily small values of and . Despite this positive outcome, most observers would not say that this agent “generalizes”, because it lacks any function-approximation method. Only the definition is consistent with this conclusion.

With the emergence of RL-as-a-service333e.g., and concerns over propriety RL technology, evaluators may not have access to an agent’s training episodes, even if they have access to the training environments. In this context, the distinction between and is particularly important when measuring an agent’s generalization performance, as off-policy states may have unknowingly been visited during training.

Quantifying Generalization Error.

Generalization in Q-value-based RL can be encapsulated by two measurements for off-policy and unreachable states, one that accounts for the condition —whether the agent’s estimate is close to the actual Q-value after executing —and another for the condition —whether the actual Q-value is close to the optimal Q-value. In our work, we use value estimate error, , and total accumulated reward, , respectively.

In most situations, is not known explicitly; however, can be used to evaluate the relative generalization ability between two agents, as the optimal value for a given state is fixed by definition.

Unlike , which, when measured in isolation can depend on the inherent difficulty of , has the advantage of consistency. For example, if an agent is placed in a state such that , alone does not capture the model’s ability to generalize. may, however, if . We address this limitation of in our experiment by training benchmark (BM) agents on each of the evaluation conditions.

Figure 2: Examples of on-policy, off-policy, and unreachable states in GridWorld.

Empirical Methodology

In this section, we describe specific techniques for producing off-policy states and a general methodology for producing unreachable states based on parameterized simulators and controlled experiments.

Off-Policy States

It is helpful to think of off-policy states as the set of states that a particular agent could encounter, but doesn’t when executing its policy from . Framed in this way, the task of generating off-policy states in practice is equivalent to finding agents with policies that differ from the policy of the agent under inspection. We present three distinct categories of alternative policies for producing off-policy states, which we believe to encapsulate a broad set of historical methods for measuring generalization in RL.444We encourage readers to think critically about whether their strategy for generating off-policy states does in fact differ from the agent’s policy, as this deviation may be difficult to measure.

Stochasticity. One method for producing off-policy states is to introduce stochasticity into the policy of the agent under inspection [Machado et al.2017]. We present a representative method we call k off-policy actions (k-OPA), which causes the agent to execute some sequence of on-policy actions and then take random actions to place the agent in an off-policy state. This method is scalable to large and complex environments, but careful consideration must be made to avoid overlap between states, as well as to ensure that the episode does not terminate before actions are completed. It is easy to imagine other variations, where the actions are not selected randomly but according to some other mechanism inconsistent with greedy-action selection.

Human Agents. The use of human agents has become a standard method in evaluating the generalization capabilities of RL agents. The most common method is known as human starts (HS) and is defined as exposing the agent to a state recorded by a human user interacting with an interface to the MDP environment [Mnih et al.2015]. One could easily imagine desirable variations on human starts within this general category, such as passing control back and forth between an agent and a human user. Human agents differ from other alternative agents in that they may not be motivated by the explicit reward function specified in the MDP, instead focusing on novelty or entertainment.

Synthetic Agents. Synthetic agents are commonly used during training in multiagent scenarios, although to our knowledge have not been used previously to evaluate an agent’s generalization ability. We present a representative method we call agent swaps (AS), where the agent is exposed to a state midway through an alternative agent’s trajectory. This method has the potential to be significantly more scalable than human starts in large and complex environments, but attention must be paid to avoiding overlap between the alternative agents and the agent under inspection. This method may also be useful in applications not amenable to a user interface or otherwise challenging to gather human data.

Unreachable States

Unreachable states are unlike off-policy states, which can be produced using carefully selected alternative agents. By definition, unreachable states require some modification to the training environment. We propose a methodology that is particularly well suited for applications of deep RL, where agents often only have access to low-level observable effects, rather than what we would typically describe as a semantically meaningful or high-level representation. In the case of Amidar and other Atari games, for example, the position of individual entities can be described as latent state and the rendered pixels are their observable effects.

Intervening on Latent State. We present two distinct classes of interventions on latent state: existential, adding or removing entities, and parameterized, varying the value of an input parameter for an entity. The particular design of intervention categories and magnitude should be based on expected sources of variation in the deployment environment, and will likely need to be customized for individual benchmarks.

To facilitate this kind of intervention on latent state, we implemented Intervenidar, an Amidar simulator. Intervenidar closely mimics the Atari 2600 Amidar’s behavior,555Readers familiar with Amidar will know that there are other features of gameplay not listed here; although Intervenidar reproduces them, they are not important to the training regimens, nor the overall results of this paper. while allowing users to modify board configurations, sprite positions, enemy movement behavior, and other features of gameplay without modifying Intervenidar source code. Some manipulable features that we use in our experiments are:

Enemy existence and movement. The five enemies in Amidar move at a constant speed along a fixed track. By default, Intervenidar also has five enemies whose movement behavior is a time-based lookup table that mimics enemy position and speed in Amidar. Other distinct enemy movement behaviors include following the perimeter and the alternative movement protocols. These enemy behaviors are implemented as functions of the enemy’s local board configuration and are used for our transfer learning experiments.

Line segment existence and predicates. A line segment is any piece of track that intersects with another piece of track at both endpoints. Line segments may be filled or unfilled; the player’s objective is to fill all of them. In Intervenidar, users may specify which of the 88 line segments are filled at any timestep. Furthermore, Intervenidar allows users to customize the quantity and position of line segments.

Player/enemy positions. Player and enemy entities always begin a game in the same start positions during Amidar, but they may be moved to arbitrary locations at any point in Intervenidar.

We included these features in the experiments because they encapsulate what we believe to be the fundamental components of Amidar gameplay, avoiding death and navigating the board to accumulate reward. The scale of these interventions were selected to reflect a small change from the original environment, and are detailed in the case-study section.

Control. In addition to producing unreachable states, parameterizable simulators enable fine control of experiments, informing researchers and practitioners about where agents fail to generalize, not simply that they fail macroscopically. One limitation of using exclusively off-policy states is that multiple components of latent state may be confounded, making it challenging to disentagle the causes of brittleness from other differences between on-policy and off-policy states. Controlled experiments avoid this problem of confounding by modifying only a single component of latent state.

Figure 3: Average total accumulated reward (TAR) from various unreachable states for each of the trained agents. The benchmark agents trained using ALS, ES, ER, FLS, and PRS configurations respectively achieved average TARs of 94, 74, 14, 77, and 90 percent of the baseline TAR.

Analysis Case Study: Amidar

We trained a suite of agents and evaluated them on a series of on-policy, off-policy, and unreachable Intervenidar states. Using our proposed partitioning of states and empirical methodology, we ran a series of experiments on these agents’ ability to generalize. In this section, we discuss how we generated off-policy and unreachable states for the Amidar problem domain.

We used the standard Amidar

MDP specification for state: a three-dimensional tensor composed of greyscale pixel values for the current, and three previous, frames during gameplay 

[Mnih et al.2015]. There are five movement actions. The transition function is deterministic, and entirely encapsulated by the Amidar game. The reward function is the difference between succesive scores, and is truncated such that positive differences in score result in a reward of 1. There are no negative rewards, and state transitions with no change in score result in a reward of 0.

We trained all agents using the state-of-the-art dueling network architecture, double Q-loss function, and prioritized experience replay 

[Van Hasselt, Guez, and Silver2016, Wang et al.2015, Schaul et al.2016]

. All of the training sessions in this paper used the same hyperparameters as in mnih2015nature’s work and we use the OpenAI’s baselines implementation 

[Dhariwal et al.2017].

Amidar Agents. We explored three types of modifications on network architecture and training regimens in an attempt to produce more generalized agents: (1) increasing dataset size by increasing training time; (2) broadening the support of the training data by increasing exploration at the start of each episode; and (3) reducing model capacity by decreasing network size and number of layers. To establish performance benchmarks for unreachable states, we trained an agent on each of the experimental extrapolation configurations.

Training Time. To understand the effect of training-set size on generalization performance, we saved checkpoints of the parameters for the baseline DQN after 10, 20, 30, and 40 million training actions before the model’s training reward converged at approximately 50 million actions. This process differs from increasing training dataset size in prediction tasks in that increasing the number of training episodes simulataneously changes the distribution of states in the agent’s experience replay.

Exploring Starts. To increase the diversity of the agent’s experience, we trained agents with 30 and 50 random actions at the beginning of each training episode before returning to the agent’s standard -greedy exploration strategy.

Model Capacity.

To reduce the capacity of the Q-value function, we explored three architectural variations from the state-of-the-art dueling architecture: (1) reducing the size of the fully connected layers by half (256-HU), (2) reducing the number of channels in each of the three convolutional filters by half respectively (HC), and (3) removing the last convolutional layer of the network (TL). Recent work on deep networks for computer vision suggest that deeper architectures produce more heirarchical representations, enabling a higher degree of generalization 

[Krizhevsky, Sutskever, and Hinton2012].

Off-policy States. We employed three strategies to generate off-policy states for an agent: human starts, agent swaps, and -OPA. None of these methods require the Intervenidar system. In each case, we ran an agent nine times, for steps, where .

Human starts. Four individuals played 30 Intervenidar games each. We randomly selected 75 action sequences lasting more than 1000 steps and extracted 9 states, taken at each of the time steps [Nair et al.2015].

Agent swaps. We designated five of the trained agents as alternative agents: (1) the baseline agent, (2) the agent that starts with 50 random actions, (3) the agent with half of the convolutional channels as the original architecture, (4) the agent with only two convolutional layers, and (5) the agent with 256 hidden units. We chose these agents with the belief that their policies would be sufficienctly different from each other to provide some variation in off-policy states.666When evaluating any of the alternative agents, we only used states from the remaining four to generate off-policy states.

-OPA. Unlike the previous two cases where states came from sources external to the agent, in this case we had every agent play the game for steps before taking random actions, where was set to 10 and 20.

Unreachable States. With Intervenidar, we generated unreachable states, guaranteeing that the agent begins an episode in a state it has never encountered during training. All modifications to the board happen before gameplay.

Modifications to enemies. We make one existential and one parameterized modification to enemies: We randomly remove between one and four enemies from the board (ER), and we shift one randomly selected enemy by steps along its path, where is drawn randomly between 1 and 20 (ES).

Modifications to line segments. We make one existential and one parameterized modification to line segments: We add one new vertical line segment to a random location on the board (ALS) and we randomly fill between one and four non-adjacent unfilled line segments (FLS).

Modification to player start position. We start the player in a randomly chosen unoccupied tile location that has at least one tile of buffer between the player and any enemies (PRS).

Figure 4: and for replicated trajectories for all experiments. Each subplot is a single independent trial. For the interpolation experiments, the vertical grey line shows the point where the agent takes random actions (in the k-OPA experiments) or regains control (in the agent swaps and human-starts experiments). The length of each episode is consistently lower and the difference between and is consistently higher for the extrapolation experiments.

Transfer Learning: Assessing Representations. We conducted a series of transfer learning experiments [Oquab et al.2014], freezing the convolutional layers and retraining the fully connected layers for 25 million steps. We use these results to understand how learned representations in the convolutional layers relates to overall generalization performance. We train each of the agents using the alternative enemy movement protocol so that enemies move on the basis of local track features, rather than using a lookup table. If an agent has learned useful representations in the convolutional layers, then we expect that agent to learn a new policy using those representations for the alternative movement protocol.777We distinguish this transfer learning experiment from our extrapolation experiments in that the transfer learning experiment modifies the transition function and by extension . In the extrapolation experiments, an agent can later encounter states it has observed during training and effectively use its learned policy, which is not necessarily true if the transition function changed.


Our experiments demonstrate that: (1) the state-of-the-art DQN has poor generalization performance for Amidar gameplay; (2) distance in the network’s learned representation is strongly anti-correlated with generalization performance; (3) modifications to training volume, model capacity, and exploration have minor and sometimes counterintuitive effects on generalization performance; and (4) generalization performance does not necessarily correlate with an agent’s ability to transfer representations to a new environment.

Poor Generalization Performance. Figures 4 and 5 show that the fully trained state-of-the-art DQN dueling architecture produces a policy that is exceptionally brittle to small non-adversarial changes in the environment. The most egregious examples can be seen in Figure 5, in the filling line segments (FLS) and player random starts (PRS) interventions. Visual inspection of the action sequences proceeeding these states showed the agent predominantly remaining stationary, often terminating the epsisode without traversing a single line segment. This behavior can be seen in Figure 4, where PRS and FLS episodes terminate prematurely. Videos displaying this behaviour can be found in the supplementary materials.

Figure 5: TAR and average VEE for control, extrapolation, and interpolation experiments. The agent consistently overestimates the state value. TAR and VEE are strongly anti-correlated. All TAR bars are normalized by the TAR of the control condition. All VEE bars are normalized by their respective TAR.

Furthermore, Figure 5 shows that VEE and TAR are very highly anti-correlated across the experiments, indicating that the agent’s ability to select appropriate actions is related to its ability to correctly measure the value of a particular state. We observe that the model always overestimates the value of off-policy and unreachable states. In contrast, the agent’s value estimates are small and approximately symetrically distributed around 0 in the control condition.

Distance in Representation. By extracting the activations of the last layer of the DQN, we are able to observe the distance between training and evaluation states with respect to the network’s learned representation. Figure 6 depicts the density estimates for the distribution of these distances. We find that the agent does not “recognize” the unreachable states where generalization is the worst, such as PRS and FLS, implying that the learned representation is inconsistent with these components of latent state. Alternatively, one could imagine a network that performs poorly by conflating states that are meaningfully different.

Figure 6: Smoothed empirical distributions of the distances between the test points of the extrapolation experiments and the training data. Generalization performance is anti-correlated with distance from previously seen states.

Training Agents for Generalization. We take inspiration from well-established methods in supervised learning; increasing training set size, broadening the support of the training distribution, and reducing model capacity. We propose the following analogs to each of these methods, respectively; increasing the number of training episodes, introducing additional exploration, and removing layers and nodes.

These experiments indicate that: (1) naïvely increasing the number of training episodes until training set performance converges reduces generalization; (2) some reductions to model capacity induce improvements to generalization; and (3) increasing exploration and otherwise diversifying training experience results in more generalized policies. These results are shown in figure 3.

Training Episodes. While increasing training time clearly increases the total accumulated reward in the control condition, shorter training times appear to contribute to increased generalization ability. This increase is minimal, but it does illustrate that naïvely increasing training time until converge of training rewards may not be the best strategy for producing generalized agents.

Model Capacity. Of the reductions to model capacity, we find that shrinking the size of the fully-connected layers results in the greatest increase in generalization performance across perturbations. Reducing the number of convolutional layers also results in improvements in generalization performance, particularly for the enemy perturbation experiments.

Exploration Starts. We find that increasing the diversity of training experience has the greatest effect on generalization performance, particularly for the agent with 50 random actions. This agent experiences almost a twofold increase in total accumulated reward for human starts and all of the extrapolation experiments. This agent outperforms the baseline agent in every condition. Of particular interest is the agent’s performance on the enemy shift experiments, where the agents’ total accumulated reward approaches the reward achieved by an agent trained entirely in that scenario.

Hierarchical Representations and Generalization. While the agents with increased exploration demonstrate a clear improvement in generalization ability over baseline, it is not consistent with their ability to accumulate large reward with the alternative enemy-movement protocol after retraining. This finding contradicts those of work on representations in computer vision, where transferability of representations directly corresponds to generalization ability.


Generalization in RL needs to be discussed more broadly, as a capability of an arbitrary agent. We propose framing generalization as the performance metric of the researcher’s choice over a partition of on-policy, off-policy, and unreachable states. Our custom, parameterizable Amidar simulator is a proof of concept of the type of simulation environments that are needed for generating unreachable states and training truly general agents.


Thanks to Kaleigh Clary, John Foley, and the anonymous AAAI reviewers for thoughtful comments and contributions. This material is based upon work supported by the United States Air Force under Contract No. FA8750-17-C-0120. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force.