Toybox: A Suite of Environments for Experimental Evaluation of Deep Reinforcement Learning

by   Emma Tosch, et al.

Evaluation of deep reinforcement learning (RL) is inherently challenging. In particular, learned policies are largely opaque, and hypotheses about the behavior of deep RL agents are difficult to test in black-box environments. Considerable effort has gone into addressing opacity, but almost no effort has been devoted to producing high quality environments for experimental evaluation of agent behavior. We present TOYBOX, a new high-performance, open-source* subset of Atari environments re-designed for the experimental evaluation of deep RL. We show that TOYBOX enables a wide range of experiments and analyses that are impossible in other environments. *


page 4

page 5

page 7


Search-Based Testing of Reinforcement Learning

Evaluation of deep reinforcement learning (RL) is inherently challenging...

Assessing Generalization in Deep Reinforcement Learning

Deep reinforcement learning (RL) has achieved breakthrough results on ma...

AutoEG: Automated Experience Grafting for Off-Policy Deep Reinforcement Learning

Deep reinforcement learning (RL) algorithms frequently require prohibiti...

Explainable AI: Deep Reinforcement Learning Agents for Residential Demand Side Cost Savings in Smart Grids

Motivated by the recent advancements in deep Reinforcement Learning (RL)...

Behaviour Suite for Reinforcement Learning

This paper introduces the Behaviour Suite for Reinforcement Learning, or...

Behaviour-Diverse Automatic Penetration Testing: A Curiosity-Driven Multi-Objective Deep Reinforcement Learning Approach

Penetration Testing plays a critical role in evaluating the security of ...

The Primacy Bias in Deep Reinforcement Learning

This work identifies a common flaw of deep reinforcement learning (RL) a...

1 Introduction

Since DeepMind’s 2015 Nature paper, the Arcade Learning Environment (ALE) has become the de facto deep RL benchmark for new training algorithms (Bellemare et al., 2013; Mnih et al., 2015; Machado et al., 2017). ALE has several appealing qualities: humans learn to play Atari and become more skilled with experience, it is a “real-world” environment that was not originally constructed to evaluate RL methods, and it has greater complexity than prior environments (e.g., GridWorld, mountain car).

ALE has been used in several ways to evaluate the performance of deep RL agents. The vast majority of evaluations follow a version of the approach described by Bellemare et al. (2012)

: first researchers choose network architectures and tune hyperparameters on small set of Atari games; then they train agents using those hyperparameters on new games, reporting the learning curves (or a statistic of a collection of those curves) 

(Mnih et al., 2015; Van Hasselt et al., 2016; Mnih et al., 2016; Hessel et al., 2018).

While ALE has enabled demonstration and evaluation of much more complex behaviors of deep RL agents, it presents challenges as a suite of evaluation environments for topics on the frontier of deep RL.

Challenge: Limited variation within games. Very little about individual games can be systematically altered, so ALE is poorly suited to testing how changes in the environment affect training and performance. New benchmarks such as OpenAI’s Sonic the Hedgehog emulator and CoinRun inject environmental variation into the training schedule, while introducing train/test splits (Nichol et al., 2018; Cobbe et al., 2018). Similarly, Zhang et al. (2018) suggest benchmarks that incorporate the kind of non-random noise found in nature. Kansky et al. (2017) implemented Breakout variants in order to achieve variation for generalization.

Challenge: No counterfactual evaluation. Meanwhile, assertions about intelligent agent behavior remain untestable in the face of black-box evaluation environments. For example, ALE does not enable testing the conjecture that agents trained on Breakout learn to build tunnels (Mnih et al., 2015) or that they enter a tunneling mode (Greydanus et al., 2018). No system currently permits experiments to answer counterfactual questions about agent behavior.

Contribution. We propose ToyBox, a suite of high-performance and highly parameterized Atari-like environments designed for the purpose of experimentation. We demonstrate that the ToyBox implementations of three Atari 2600 games achieve similar performance to their ALE counterparts across three deep RL algorithms. We demonstrate that ToyBox enables a range of post-training analyses not previously possible, and we show that ToyBox is orthogonal to concurrent efforts in deep RL to address issues of robustness, generalization, and reproducible evaluation.

Organization. The rest of the paper is organized as follows: Section 2 introduces the ToyBox design and its functional capabilities. Section 3 describes our evaluation, including performance and fidelity testing against the ALE. Section 4 describes four behavioral tests we present as case studies. Related work not otherwise addressed can be found in Section 5. Section 6 discusses ToyBox applications beyond the scope of this paper. We conclude in Section 7.

2 ToyBox: System Design

ToyBox is a high-performance, highly parameterized suite of Atari-like games implemented in Rust, with bindings to Python. The suite currently contains three games: Breakout, Amidar, and Space Invaders. We chose these games for diversity of genre (paddle-based, maze, and shooter, respectively) and likely familiarity to readers.111Although Amidar may not be familiar, it is very similar to PacMan, but with simpler rules.

Software requirements. Atari 2600 games were designed for human players. For ToyBox

, the primary user is a reinforcement learning algorithm, and we expect machine learning researchers to be able to customize gameplay. To that end, we developed

ToyBox to meet the following set of software requirements

  1. [label=R0, leftmargin=*]

  2. ToyBox should only leverage the CPU, even for graphical tasks. Although modern games leverage the GPU for faster rendering, we expect the machine learning libraries to be using the GPU, and so we wish to create our screen images using CPU-only.

  3. ToyBox should be at least as efficient as the Stella-emulated version of the game. Since reinforcement learning algorithms require millions of frames of training data, we must be able to simulate and render millions of frames in order to enable efficient use of computation resources for learning.

  4. ToyBox should provide for data-driven user customization. Changing the bricks in Breakout, the board in Amidar, or the alien configuration in Space Invaders should not require re-compilation of the core game code, nor should it require the ability to write Rust.

  5. ToyBox should be accessible through OpenAI Gym, which is a Python API. Furthermore, ToyBox should be usable as a drop-in replacement for the analogous ALE environment.

Figure 1: The ToyBox architecture.
Figure 2: A traditional RL diagram, augmented with features required for counterfactual reasoning (shaded above). There is a physical system that governs behavior within the environment that may only be partially known. The researcher has a model of the environment’s physics and, when attempting to explain agent behavior, must implicitly form a model of the agent’s model. Arrows emanating from the researcher point to elements of the system where the researcher may intervene. The dotted line from the researcher to the environment is possible in ToyBox, but is not typically considered.

Architecture. Figure 1 depicts the ToyBox architecture. The game logic is written in Rust. Every game implements two core structs: Config and State. The Config struct contains data that we would generally only expect to be initialized at the start of an episode (i.e., a game, which may include multiple lives). The State struct contains data that may change between frames. At any point during execution, a ToyBox game can be paused, state exported and modified, and resumed with the new state.

Aside: Why Atari? Given the range of open problems in deep RL, creating a new ALE-like system may not seem like an effective way to facilitate cutting-edge deep RL research. However, there are still many poorly understood properties of Atari games and agent behavior on them.

There are many axes of complexity in the reinforcement learning environment: required planning horizon, reward function assignment, number of actions available, environment stochasticity, and state representation are all different components affecting the total complexity of the environment, each presenting unique challenges for policy learning:

  • [leftmargin=*]

  • Planning Horizon Atari environments requiring temporally extended planning strategies for success, as in Frostbite, still present challenges to deep reinforcement learning algorithms  (Jiang et al., 2015; Lake et al., 2017; Farquhar et al., 2018).

  • Reward Function Environments with sparse reward functions are another sticking point for RL, having only recently seen significant progress on games in this category (Ecoffet et al., 2018; Salimans & Chen, 2018; Burda et al., 2019). Atari examples include Pitfall and Montezuma’s Revenge.

  • State Representation Operating over pixel representations like Atari games means the agent experiences a huge state space, but the underlying objects defining those sensory representations can often be represented more compactly (Guestrin et al., 2003; Diuk et al., 2008; Kansky et al., 2017). Models trained on a compact representation of state can train more quickly (Keramati et al., 2018; Melnik et al., 2018), but the underlying environment dynamics have not changed.

  • Action Space Complexity Increasing the size of the action space can quickly lead to intractable computation for Q-value or policy function approximation, especially when that approximation is computationally expensive as in deep RL  (Dulac-Arnold et al., 2015). Atari games available in ALE allow 4-18 actions  (Mnih et al., 2013). While this may seem small, even environments with as few as 10 actions can present challenges to efficient learning (Dulac-Arnold et al., 2012).

  • Environment Stochasticity Previous work has shown that environments with higher stochasticity can be more difficult for some types of RL algorithms to learn (Henderson et al., 2017). Atari games are deterministic, but ToyBox enables Atari-like environments with parameterized stochasticity.

Despite these open challenges, it has been argued that Atari’s environments are not sufficiently complex to evaluate reinforcement learning agents because the source code is small (Zhang et al., 2018). However, source code size as minimum description length is a poor proxy for environment complexity. As  Raghu et al. (2017) have shown, Erdős-Selfridge-Spencer games can be represented quite compactly, simply requiring the assignment of two parameters, but represent a large combinatorial space of potential games for evaluating RL algorithms.

Ultimately, each of these axes of complexity is observed through the agent’s interaction with the environment. Figure 2 depicts the traditional RL diagram, overlaid with the interventions a researcher may make, as well as the models that inform both the researcher’s and the agent’s decision-making. The solid arrows represent current avenues for experimentation used by deep RL researchers to evaluate models: sensory perception of state, action selection manipulations, and reward function definition.

Without intervention or introspection of the environment, researchers must use observational data of agent behavior to reason about experimental results. ToyBox enables new methodology for experimentation in deep RL. Ultimately, we mimic Atari games because there is a font of untapped research questions related to testing and explaining the behavior of deep RL agents. We chose our initial set of games to establish that there are surprising results on even seemingly “solved” games such as Breakout and Space Invaders.

3 Evaluation

Performance. We achieved 1 by designing a simple CPU graphics library; we demonstrate ToyBox efficiency (2) in Table 1. Note that ToyBox permits researchers to process games entirely in grayscale and achieve substantial additional performance gains. However, since this is not a feature offered in ALE, we only compared against the ToyBox RGB(A) rendering.

Raw kFPS Gym kFPS
Breakout ALE 52 (1.3) 3.4 (0.065)
ToyBox 230 (5.4) 7.2 (0.23)
Amidar ALE 61 (2.9) 3.0 (0.083)
ToyBox 250 (2.3) 6.0 (0.112)
Space Invaders ALE 55 (1.3) 3.9 (0.072)
ToyBox 120 (3.4) 5.2 (0.082)
Table 1: ToyBox vs ALE performance

. Thousand frames per second (kFPS) on a MacBook Air (OSX Version 10.13.6) with a 1.6 GHz Intel Core i5 processor having 4 logical cores. Rates are averaged over 30 trials of 10 000 steps and reported to two significant digits, with standard error. We consistently observed an approximately 95% slowdown when interacting with both ALE (C++) and

ToyBox (Rust) via OpenAI Gym. All benchmarks are run from CPython 3.5 and include FFI overhead (via atari-py for ALE).
(a) Upper: ALE; lower: ToyBox.
(b) Left: ALE; right: ToyBox.
(c) Upper: ALE; lower: ToyBox.
Figure 3: Side-by-side comparisons of screen shots from ALE and ToyBox Atari games. Each game represents a different deterministic action trace, but traces are the same between ALE and ToyBox. ToyBox implementations of Breakout and Space Invaders have nondeterministic elements. Amidar is deterministic in both ALE and ToyBox. Due to idiosyncrasies at the start of ALE Amidar gameplay, the frames are not from identical points in the action trace; frames for Breakout and Space Invaders are. Note that for Amidar ALE appears to be missing an enemy, which is due to Atari only rendering a subset of sprites each frame due to computational constraints.

Fidelity. Figure 3 depicts frames at roughly equivalent points in the execution of a fixed action trace. If ToyBox perfectly reproduced the games in ALE, the frames would be exactly the same. Three factors prevent exact replication of games: (1) Atari 2600 game source code is not available, (2) there are no formal specifications and few informal specifications of games,222Some informal specifications contain errors (e.g., the Atari manual for Breakout refers to a row of bricks that does not exist). and (3) inferring arbitrarily complex programs from data is extremely challenging (Raychev et al., 2016). Therefore, comparing frames of program traces is not a sufficient measure of how closely we have approximated ALE games.

A human study could help assess whether the two implementations are are sufficiently alike. While we did solicit feedback from Atari aficionados during development (via a playable interface), we did not view human perception of equivalence as a sufficient measure of fidelity. Human players rely on unique problem-solving capabilities that deep RL agents have not yet achieved, while deep networks are undeterred by the kind of noise that can confuse humans  (Szegedy et al., 2014; Dubey et al., 2018). Instead, we focused on tuning our environments to produce comparable post-training agent performance.

Methodology. We used three off-the-shelf implementations of training algorithms with default parameter settings for 5e7 steps from OpenAI Baselines (Dhariwal et al., 2017): a2c (Mnih et al., 2016), acktr (Wu et al., 2017), and ppo2 (Schulman et al., 2017). Due to issues with variability across agents and environments (Henderson et al., 2017; Clary et al., 2018; Jordan et al., 2018), we trained ten replicates for each of these training algorithms, differentiated by their random seed. Since there are various other uncontrolled sources of randomness, we evaluated each of these thirty agents per game using thirty unique random seeds over thirty games. Figure 4 depicts our results: we find that agents achieve sufficiently similar performance in each analogous environment, and have roughly equivalent rankings (idiosyncrasies discussed in the in Fig. 4 caption).

Findings. ToyBox is much faster than ALE and our ranking results in Fig. 4 show that our current implementations are comparable to their ALE counterparts. We will continue to strive for fidelity over the course of game development.

Figure 4: Rankings across 300 model replicates per model, per game, with standard error. Each horizontal bar is the average performance for a trained model, evaluated over 30 games. Stridations should be similar in both environments. Breakout: One training seed led an agent trained using acktr to have abysmally poor performance in every trial (i.e., all average scores below 2). Amidar: The lack of within-game variation is reflected in the short error bars and similar rankings between backends. Space Invaders: ToyBox only implements the first level of Space Invaders. Since there is currently no way to detect levels within ALE, we let both agents play indefinitely. However, because Space Invaders levels vary considerably, performance is not comparable. For this reason we use different scales for the score (x-axis).

4 Case Studies

We demonstrate (3: Customization) with four case studies, in which we test the post-training performance333The analyses in this paper focus on evaluations of post-training performance, but ToyBox interventions can be applied at any time–including during training. of agents according to some hypothesis about behavior. This sort of testing is useful for evaluating a single agent prior to deployment (i.e., acceptance testing) or as existential proof for behavior under counterfactual conditions. None of these tests are currently possible in ALE because they all rely on resuming gameplay from an arbitrary modified state. Furthermore, all experiments in this section can be expressed in fewer than 200 lines of code, in addition to the code required for loading up the OpenAI baselines models. We include a code snippet from one of our experiments in Figure 7.

(a) Polar starts
(b) Tunneling
Figure 5: Breakout case study. Results from counterfactual queries for starting angles and tunnels. Both tests were run on an agent trained with the default OpenAI Baselines parameters for ppo for 5e7 steps, one life, and a 4 min. timeout. (a): The black lines indicate the starting angles seen during training, the light gray area the maximum score achieved from this starting angle, and the dark gray area represents the mean score achieved across trials. (b): Brick hue scaled according to the inverse of the median number of steps required to clear the particular brick in that test; bright-yellow represents fewer steps and red represents more steps (17-400).
(a) Accumulated game score of a single a2c-trained agent.
(b) Ranking of 30 model replicates.
Figure 6: Amidar case study. for four enemy movement protocols (“Control” is a lookup table, “Amidar” is the “Amidar movement,” ”Random” enemies move in a random direction at every junction, and “Target” causes enemies to pursue the player when it is in line of sight). (a): Upper left: baseline performance of the agent on each of the four protocols. Lower left: baseline performance of the agent on each protocol without the ability to jump over enemies. Upper right: “Ganging up” test, where all agents start close to the player. Lower right: “Ganging up” test with no jump. (b): Score ranking of 30 model replicates for the baseline condition with jumps (i.e., the upper left corner of the left graph) for each movement protocol.

Breakout: Polar Angles.

An agent that has learned to play Breakout must have at least learned to hit the ball. Our first test manipulates the starting angle of the ball.

We modified the start state to change the initial launch angle of the ball in 5 increments (72 configurations). Figure 5

depicts the results. Note that the agent fails to achieve any score with horizontal ball angles: since Breakout has no gravity, balls simply bounce horizontally forever, never hitting any bricks or threatening the paddle. The agent also sometimes struggled with vertical angles. When we observed this behavior, the agent would keep the ball aligned perfectly in the center of the board, hitting it precisely in the center of the paddle, failing to make progress. This is an unexpected behavior that is entirely unlike human gameplay. In all, we found the agent to be resilient to starting angles, albeit with high variance. This suggests that an agent can be successful even with balls traveling at angles it may never have observed in training, a powerful recommendation for the training algorithms that produced such robust RL agents.

Breakout: Tunneling.

One of the most promising behaviors observed in deep RL has been the apparent ability to learn higher-level strategies. Perhaps no high-level strategy has been written about more than “tunneling” in the game of Breakout, which happens when the player clears a column of bricks, causing the ball to bounce through the hole and onto the ceiling, clearing many bricks rapidly (Mnih et al., 2015; Greydanus et al., 2018). One way to test whether an agent intentionally exploits tunneling is to give it a board with a nearly complete tunnel, save for a single brick, and test whether the agent can prioritize aiming at that single brick.

For every brick, we removed all other bricks in the column, creating a nearly-completed tunnel. Figure 5 depicts the results: the value for each brick is the reciprocal of the median number of time steps before that brick was removed.

If an agent were able to build tunnels, we would expect, for example, symmetry along one or both axes. Instead, we see that the agent clears one column in the center very quickly, the left adjacent column and some bricks in the upper left region a bit more slowly, and the remaining bricks take all about the same time. Observing agent gameplay, we saw the agents hit the ball to predictable locations, regardless of the board configuration.

# Fourth test: Target Player
# Delete Existing Enemies
config = toybox.get_config_json()
config[’enemies’] = []
# Add Enemies that chase the player:
for i in range(5):
        ’EnemyTargetPlayer’ : {
            ’start’ : starts[i],
            ’start_dir’: ’Right’,
            ’dir’: ’Right’,
            ’player_seen’: None
# Update game configuration
# Run 30 trials with OpenAI Gym API
obs = env.reset()
for trial in range(n_trials):
    n_steps = 0
    done = False
    # until death or max_steps
    while n_steps < max_steps and not done:
        action = agent.step(obs)
        obs, _, death, info = env.step(action)
        done = death and not toybox.game_over()
        n_steps += 1
Figure 7: Code snippet from our Amidar Protocol test. We build JSON config in Python and run it in ToyBox.

Amidar: Enemy Protocol.

Suppose we would like to test whether an agent has learned to avoid adversarial elements of a game: e.g., the enemies in Amidar. To test this, we might drop the agent around the corner from an enemy, or position the enemies to “gang up” on the player, forcing the agent to move in a particular direction.

This kind of intervention is only meaningful if enemy position is a function of current location. Observation led us to conclude that enemies move in fixed loops, likely implemented as lookup tables. This contrasts with “Amidar movement” is believed to dictate enemy behavior.444An enemy moves with a diagonal velocity, flipping the vertical direction when encountering the top or bottom of the board and horizontal direction when encountering the left or right edge ( The protocol matters for intervention because, for a lookup table, moving enemies will have no effect: enemies will simply “teleport” to the next location in the lookup table.

The upper left plot in Fig. 6 shows a baseline test for how an individual trained agent performs under each of four different enemy movement protocols: (1) a lookup table, on which the model was trained, (2) the “Amidar movement” protocol, (3) a random protocol, where at each junction the enemy chooses a random direction, and (4) an adversarial protocol, where enemies explore via random turns until the player is within line of sight, at which time they move toward the player’s location. Note that, since enemies start far away from the player, the agent can (and does) easily make progress at the start of the game, regardless of enemy position. However, as the game progresses, the enemies close in and there are fewer opportunities for rewards.

The upper right plot in Fig. 6 shows a test in which enemies “gang up” on the player: the enemies start position is modified to be close to the player. We were at first surprised to see how well the agent did; however, upon examination, we found we found that the agent was using up the jump button, which allows the player to bypass enemies, at beginning of the game. The lower half of Fig. 6 depicts the results of running the baseline and test for no jumps: while the baseline performs similar, the player dies quickly for all non-lookup table enemy protocols.

Figure 8: Space Invaders case study. The top three bands depict the tests of the solitary first, second, and third shields respectively. The bars depict the total number of steps spent in the corresponding horizontal location, while the boxplots depict the scores. The fourth band shows the behavior for all shields present (green; score boxplots on the left) and no shields present (purple; score boxplots on the right). The bottom band shows the Space Invaders terrain and default positions of the shields.

Space Invaders: Shield Usage.

In Space Invaders, the player can seek refuge under three shields from the frontier of alien ships shooting down. We ran a test to see whether removing two of the three shields would cause an agent to use the remaining one more often. We also ran two baseline comparisons for a fixed amount of time: one where all sheilds are present (the default setting) and one where no shields were present.

Figure 8 shows the results under test. Since score provides an incomplete picture of agent behavior, we also tracked the agent’s location (a simple query in ToyBox). We observe that the player does not appear to change its preferred locations under any of the tests.

A Note on Negative Results: Space Invaders, as we have implemented it, has turned out to be a fairly uninteresting game. Randomly selecting from the trimmed action set that OpenAI Gym allows can lead to fairly good performance. Furthermore, our implementation, which included both random and adversarial enemy behavior, led the agent’s behavior to be invariant to randomness in enemy behavior.

Findings. We have shown a range of interventions and queries possible with ToyBox, all of which would be impossible to conduct using ALE. The interventions we demonstrated were designed to demonstrate the power of ToyBox’s design and implementation, rather than to satisfy any particular RL research agenda. We were able to rapidly iterate on all of our experiments due to ToyBox’s fast performance and its simple API for editing state. In addition to highlighting ToyBox’s capacity for evaluating a single agent, we have shown how ToyBox may be used to evaluate models, by comparing the post-training performance ranking under test.

5 Related Work

We are hardly the first to suggest new or different benchmarks for deep RL (Kansky et al., 2017; Zhang et al., 2018; Wang et al., 2019). Four major qualities differentiate ToyBox from prior work: (1) it is based on a widely used and accepted community standard (ALE); (2) results on ALE can be replicated and compared in ToyBox, providing continuity to individual research trajectories; (3) a wide array of features of ToyBox environments are intervenable; furthermore, a particular configuration is easily exported and can be shared as part to further replication efforts; and (4) an individual game may be modified to produce a family of games, leading to a potentially infinite number of environments per-game; for example, the injection of real-world images into the background of Breakout described in (Zhang et al., 2018) would be trivial to implement in ToyBox.

Recall the available interventions in the traditional RL research environment shown in Figure 2. Most existing work manipulates the state input to the agent (i.e., the agent’s perception of state), the reward function, or the agent’s actions, e.g.:

  • [leftmargin=*]

  • State input: Injecting real-world data into the background of Atari 2600 games simulates non-random noise (Zhang et al., 2018). Skipping frames periodically is a critically important hyperparameter for tuning algorithms to play Atari (Mnih et al., 2013; Braylan et al., 2015; Seita, 2016).

  • Reward function: Hybrid reward structures decompose the reward function, making it easier for some agents to learn particularly difficult games in the Atari suite (Van Seijen et al., 2017).

  • Agent actions: Sticky actions, human starts, and random starts are all methods for intervening on the agent’s actions outside the normal parameters of -greedy exploration (Sutton et al., 1998; Mnih et al., 2013; Bellemare et al., 2013; Nair et al., 2015).

These efforts can help combat overfitting, learning spurious correlations, or generally failing to make progress on a task. ToyBox is orthogonal to such efforts.

We have introduced relevant citations throughout the paper. Here we highlight critical work that was not otherwise mentioned.

Evaluation and Replication. Recent investigations into in how the community handles the evaluation and replication of agent performance has exposed some serious challenges that the community needs to address (Henderson et al., 2017; Balduzzi et al., 2018; Clary et al., 2018; Jordan et al., 2018). Environments such as ToyBox and evaluations of the style presented in Section 4 are one possible way to ameliorate issues surrounding replication, robustness, and variability.

Adversarial RL. Much work on adversarial RL focuses on exploiting decision boundaries (Mandlekar et al., 2017), adding nonrandom noise to state input for the purpose of altering or misdirecting policies (Huang et al., 2017), or introducing additional agents to apply adversarial force during training to produce agents with more robust policies in physics simulations (Pinto et al., 2017).

Saliency maps. Saliency maps were developed as an insight into model behavior (Simonyan et al., 2013), but more recently have been put forth as tool for explainability (Greydanus et al., 2018). We show that in at least one case, saliency maps can be misleading, due in part to bias exemplified by the researcher’s models of the agent and the environment as shown in Fig. 2. Experiments enabled by ToyBox provide much more specific information and can disambiguate competing hypotheses about agent behavior.

6 Discussion

This paper is a proof-of-concept for experimentation about the behavior of deep RL agents. However, there are many possible applications beyond this type of post-training testing:

Rejection sampling/dynamic analysis. One of the biggest strengths offered in ToyBox is the ability to answer arbitrary questions about the environment structures and code at any time. Agents may encounter local minima during training that are not representative of the target deployment distribution, due to factors such as random seeds (Irpan, 2018). Training replicates with many random seeds is a costly solution to this problem. Instead, researchers could use ToyBox to monitor environmental features to test whether an agent is spending too much time in an undesirable state. Similar types of model monitoring could be used to identify “detachment,” a condition of the agent-environment interaction that induces catastrophic forgetting (Kirkpatrick et al., 2017; Ecoffet et al., 2018).

Datasets from game families. The ability to generate a family of games with similar but different mechanics provides a convenient dataset which can be used in a variety of ways. For example, with ToyBox, a researcher can define a family of Breakout-style games with slightly different movement physics (e.g., ball velocity and acceleration, paddle-bounce mechanics) sampled from some real-valued parameter domain. This can be used to create a train/test split over environments (Cobbe et al., 2018)

, for supporting transfer learning experiments from one game physics to another 

(Taylor & Stone, 2009), or for testing generalization across multiple environments (Guestrin et al., 2003).

Adversarial testing. With total control over environment dynamics, trained agents can be stress-tested by running the agent on progressively more difficult versions of the game. Tests of this form can serve to disambiguate agent behavior that can be explained in multiple ways—much as ToyBox’s more advanced Amidar movement protocols revealed agents had not necessarily learned to avoid enemies so much as memorize their observed paths. Difficulty can be increased by increasing stochasticity in the environment, or increase the speed or accuracy of adversarial game elements. Similar methods could be used to create a suite of curriculum learning environments.

7 Conclusions

We have shown that ToyBox unlocks novel and important capabilities for evaluating deep reinforcement learning agents. We introduce a new paradigm for thinking about evaluating agents, in the style of acceptance testing. We demonstrate ToyBox capabilities with four case studies and outline a variety of other applications.