Log In Sign Up

Escape Room: A Configurable Testbed for Hierarchical Reinforcement Learning

Recent successes in Reinforcement Learning have encouraged a fast-growing network of RL researchers and a number of breakthroughs in RL research. As the RL community and the body of RL work grows, so does the need for widely applicable benchmarks that can fairly and effectively evaluate a variety of RL algorithms. This need is particularly apparent in the realm of Hierarchical Reinforcement Learning (HRL). While many existing test domains may exhibit hierarchical action or state structures, modern RL algorithms still exhibit great difficulty in solving domains that necessitate hierarchical modeling and action planning, even when such domains are seemingly trivial. These difficulties highlight both the need for more focus on HRL algorithms themselves, and the need for new testbeds that will encourage and validate HRL research. Existing HRL testbeds exhibit a Goldilocks problem; they are often either too simple (e.g. Taxi) or too complex (e.g. Montezuma's Revenge from the Arcade Learning Environment). In this paper we present the Escape Room Domain (ERD), a new flexible, scalable, and fully implemented testing domain for HRL that bridges the "moderate complexity" gap left behind by existing alternatives. ERD is open-source and freely available through GitHub, and conforms to widely-used public testing interfaces for simple integration and testing with a variety of public RL agent implementations. We show that the ERD presents a suite of challenges with scalable difficulty to provide a smooth learning gradient from Taxi to the Arcade Learning Environment.


page 1

page 2

page 3

page 4


JORLDY: a fully customizable open source framework for reinforcement learning

Recently, Reinforcement Learning (RL) has been actively researched in bo...

A Reinforcement Learning-based Volt-VAR Control Dataset and Testing Environment

To facilitate the development of reinforcement learning (RL) based power...

Natural Environment Benchmarks for Reinforcement Learning

While current benchmark reinforcement learning (RL) tasks have been usef...

Distributed Reinforcement Learning is a Dataflow Problem

Researchers and practitioners in the field of reinforcement learning (RL...

The NetHack Learning Environment

Progress in Reinforcement Learning (RL) algorithms goes hand-in-hand wit...

SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference

We present a modern scalable reinforcement learning agent called SEED (S...

Room Clearance with Feudal Hierarchical Reinforcement Learning

Reinforcement learning (RL) is a general framework that allows systems t...

1 Introduction

Reinforcement Learning (RL) is an AI paradigm in which an artificial agent explores and exploits its environment through an action-observation loop. RL has seen increasing success in recent years as it has been integrated with parallel developments in AI such as Deep Learning. These successes, however, have brought to light new problems with replication and reproducibility, inspiring researchers to focus attention on the challenge of accurately gauging the efficacy of new approaches


The proliferation of public source code hosting and sharing technologies has created an unprecedented potential for peer evaluation and collaboration in the applied Computer Science disciplines such as AI. Services such as GitHub and BitBucket have enabled AI researchers to publish not only high-level descriptions and results, but the low-level implementation details that are necessary for perfectly replicating the experiments in a given body of work. Projects like the Open AI Gym [7]

and TensorFlow

[1] have built on such collaboration frameworks to minimize the barrier to entry for new researchers and empower newcomers to replicate and build upon existing work.

Despite the availability of these tools, however, it is still common for works to exclude details and source code that are necessary for reproducibility [43]. It is reasonable to assume that when testing frameworks are open and available, researchers will be more likely to use them and will be more likely to report results that can be easily reproduced, evaluated, and integrated with parallel work.

Hierarchical Reinforcement Learning (HRL) is one specification of RL in which evaluation and reproducibility are hindered by a lack of suitable test domains. Hierarchies provide an efficient means of breaking monolithic tasks down into manageable interdependent chunks, and for this reason hierarchical learning has been an active area of RL research since RL’s inception. And yet, the hierarchical domains frequently used for evaluation in the literature are generally either too simplistic or too rigid to fully evaluate the capabilities of an HRL algorithm.

In this work we present a new HRL testing framework which we call the Escape Room Domain (ERD), named after the Escape Room genre popular in both virtual- and real-world puzzle-based challenges. The ERD is a parameterized schematic for generating individual instances, where each instance is a single testbed upon which an HRL agent can be evaluated and compared against alternative HRL algorithms. The ERD enables large-scale evaluation and reproducibility by featuring the following characteristics:

  1. Open Source and Public Availability - The ERD is published on a public-facing GitHub repository and is built exclusively on open-source and publicly available frameworks.

  2. Tutorials and Examples - The ERD retains a minimal barrier to entry to ensure maximal adoption within the general HRL community.

  3. Randomized Instances - Because the ERD is a parameterized schematic, researchers can report results that are averaged over a large collection of instances of varying difficulty, which mitigates the potential for overfitting an algorithm to a particular environment configuration.

  4. Unbounded Difficulty - ERD instances can be configured with arbitrary levels of hierarchical depth, state and action space sizes, and general logistical complexity, ensuring it can be expanded to evaluate a wide array of HRL innovations.

In the next section we will discuss the related HRL advances and existing domains upon which the ERD is founded. In Section 4 we present precise descriptions of the ERD and its generated instances, including dimensions of complexity, randomization features, and general applicability to existing work. In Section 6 we test a set of existing algorithms against varying ERD instances and use these results to motivate the continued development of HRL algorithms using ERD. Finally, in Section 7 we conclude and discuss the future of ERD in RL research.

2 Background

Testbed domains allow AI researchers to empirically evaluate their algorithms repeatably within a highly controlled (and usually virtual) environment. A well-designed testbed can serve as both a proof of concept and a point of comparison from one algorithm to another. With enough traction, a testbed becomes a standardized benchmark through which a wider range of algorithms may be indirectly compared.

CartPole (CP) and Double Pendulum (DP) are two examples of classic control problems that are often used in RL publications to demonstrate viability. These problems are well-understood and already considered to be solved, but they can still provide some degree of insight into the inner-workings of a particular RL algorithm. Unfortunately, many of these classic RL domains are relatively simplistic and tend not to provide new and interesting challenges to recently developed algorithms.

More recently, Atari emulation has become popular due to the variety of available games and coinciding range of available difficulties. Atari games are generally simple to understand and explain, yet their use of pixels for perceptual input has rendered such games outside the reach of AI learning algorithms until the advent of Deep Learning. Moreover, Atari games can be easily built and emulated with publicly available source code via the Arcade Learning Environment (ALE) [4]; just as the classic control problems have provided a consistent metric for validation and comparison of traditional RL algorithms, so have Atari games for modern pixels-to-actions AI.

Many existing RL algorithms have been designed without the capability of converting pixels to abstract perceptions, and a number of HRL algorithms are included in this category (see Section 2.2). But when we restrict our attention to hierarchical domains with built-in abstract perceptual input (e.g. “pixel-free” domains), the pool of available testbeds shrinks considerably. Domains such as Taxi [13] and bitflip [14] served as challenges for early HRL algorithms such as R-Max [5], MaxQ [13], and others that derived from or extended these ideas, but these domains are akin to classic control problems in that they have been more or less “solved” and may not bring out the full capabilities of a modern HRL algorithm. This paper aims to fill this void by presenting a new RL domain of moderate difficulty that can be readily integrated for comparison of varying HRL algorithms.

2.1 Markov Decision Processes

Reinforcement Learning (RL) is a paradigm for enabling autonomous learning wherein rewards are used to influence an agent’s action choices in various states. RL is built on the formalism of the Markov Decision Process, which is the theoretical construct that describes Reinforcement Learning problems.

An MDP is defined by a 5-tuple with state space , action space , transition distribution function , reward function , and discount factor . Reinforcement Learning methods seek to learn optimal policies through exploring and exploiting the state-action space of an MDP.

Common variants of MDPs are Factored MDPs, where the state space is multidimensional and each dimension is considered a state factor, e.g. . Partially Observable MDPs are those in which the state space cannot be directly observed, at least in full. In a POMDP, an agent’s belief state is instead drawn from a distribution, while its true state remains unknown. Finally, Semi-MDPs (SMDPs) are a generalization of MDPs in which time is continuous instead of discrete; SMDPs are popular for the use of time-extended actions. The following works that comprise much of the HRL background literature make use of these MDP variants.

2.2 Hierarchical RL

Hierarchical RL is the study of RL methods that organize primitive actions into a system of abstract state, action, transition, and reward hierarchies. Different approaches to HRL may focus on these different components of an MDP.

Feudal Reinforcement Learning (FRL) [10] is one such approach where a managerial hierarchy is used in a manner reminiscent of feudal fiefdoms. State and reward signals are interpreted and then transformed between each layer of the hierarchy to enable a wide range of learning granularities. Vezhnevets et al. take the Feudal paradigm a step further by applying recent advances in deep learning to Feudal-style Manager and Worker modules. The authors then use these so-called FeUdal Networks (FUNs) for solving ALE domains.

Kaelbling expands on this idea of hierarchical control with HDG [23], a hierarchical version of Q-Learning [52]. Where FRL is intended to be applied to general RL problems, HDG shows improved performance on specific tasks where an agent is rewarded based on a handful of individual goals rather than a general-purpose reward function. Conversely, Wiering and Schmidhuber generalize the mechanics of Q-Learning in a hierarchical context by expanding their work to POMDPs. HQ-Learning [53], a “hierarchical extension to Q-Learning”, enables an agent to learn a hierarchical value function in the setting of partially observable state.

Rather than focusing on state and reward abstraction and compartmentalization, as is done in FRL, other variants of HRL have sought to create frameworks to simplify the discussion and implementation of hierarchical algorithms. Hierarchical Abstract Machines (HAMs) [36] are one such example in which an agent makes use of hierarchically-organized action sets that focus the choices available to an agent based on a given state.

Macro Actions are another early development in HRL frameworks first presented by Precup et al.; macro-actions encapsulate the concept of time-extended actions as a tuple where , is a policy, and is a termination condition [39]. These actions function as abstractions over time that can be used for simple navigation tasks such as navigating between rooms in a block world. Precup later extends this work with the Options Framework [38], where an Option is defined by a tuple with and defined as with macro-actions. An option thus generalizes the concept of a macro action to sets of initiation states rather than a single discrete starting point.

Other variants of HRL make use of hierarchical models to guide some form of planning or simulation. MAXQ [13] is one such example which relies on expert-provided hierarchical action models to guide action selection. R-MAXQ [22] expands on MAXQ’s hierarchical decomposition framework by incorporating the model-based exploration techniques of R-MAX to cope with scarce rewards. H-UCT [50] similarly expands the scope of MAXQ’s hierarchical action models by generalizing their application to POMDPs.

Conversely, model-based HRL algorithms may include mechanisms for learning their own models from scratch. SLF-RMax [44] analyzes

histories in order to infer action dependencies and produce a hierarchical action model, using Dynamic Bayesian Networks (DBNs)

[11] as the core structure for modeling dependencies in the factored state spaces of FMDPs. More recent work on learning DBN-based action models has focused on improving sample efficiently to learn accurate models more quickly [51, 29].

HRL algorithms have traditionally focused on discrete problems, or problems in which state-actions are discretized via intermediate methods (e.g. tile coding). However, this need for discretization hinders application to modern, complex domains like those found in ALE. To some extent, the lack of HRL-solvable domains has left modern RL algorithms without the mechanisms necessary to effectively model action hierarchies when such models would be beneficial. As we will show in Section 6, domains that can be readily solved may be trivially augmented with hierarchical components, leading to disproportionately negative effects on overall agent performance. We find that modeling action hierarchies is therefore an indespensible tool in solving some otherwise simple RL domains, and that new testing domains are needed to encourage new developments in this area of RL.

2.3 Desiderata

We now describe the key characteristics we consider in testing domains for general-purpose RL research, and then examine the extent to which these characteristics are found in HRL papers published in recent years. The desiderata are as follows.

  1. Availability: The domain is freely available for research online.

  2. Accessibility: The domain is constructed and documented in a way that conforms to the norms of the research community.

  3. Flexibility: The domain is built on an open-source framework and can be readily adapted to the varying needs of individual researchers.

  4. Scalability: The domain contains built-in mechanisms for iteratively rescaling its difficulty to provide a gradient of successive challenges.

Availability and accessibility are straightforward desiderata; any domain that is already implemented and simple to integrate with existing code is more likely to be tested against than domains that exist only in the abstract. However, many popular domains are neither flexible nor scalable, so these two characteristics merit further discussion.

2.4 Flexibility

Flexibility is an uncommon criterion for testing domains because changes in the testing environment translate to difficulty comparing results. Ideally, two distinct algorithms would be compared based on their performance on identical problems. But just as there is no single RL algorithm that can be applied to all problem domains, there is no single problem domain that can benchmark all RL algorithms.

Rather than keeping a given problem domain static and identical among all instantiations, variation can be critical to comparing otherwise incomparable algorithms. As a simple example, consider that some RL algorithms are designed for low-level control on continuous spaces, and others are designed for high-level planning over small sets of discrete state-actions. It is generally simpler to restrict comparison to those algorithms that fit a single paradigm, but there may still be value in transcending these distinctions. A flexible domain that can provide degrees of abstraction may enable comparison between otherwise incomparable solutions.

A second benefit of being able to easily modify a test domain is the incorporation of randomness and automatic domain generation. A completely static domain is especially vulnerable to overfitting, and may partially explain the recent problems with reproducibility in the RL community [15]. One way to mitigate overfitting is to evaluate algorithms on their aggregate performance over multiple randomly-varied instances of a particular domain schematic. In the Taxi domain, for example, the pickup and dropoff locations can be moved about the grid to avoid overfitting an agent’s policies for hard-coded locations. ALE lies on the opposite end of the flexibility spectrum, since it relies on binary Atari ROMs to describe each game.

2.5 Scalability

In addition to being adaptive to the varying capabilities of different algorithms, an ideal testbed will provide researchers with a difficulty gradient by accommodating iterative adjustments to the domain’s overall difficulty. Grid Worlds naturally meet this criterion to some degree by nature of being configurable in size; A Grid World can be incrementally complexified by changing its dimensions to , for example. The more dimensions along which a domain can be complexified in this manner, the more precisely a researcher can probe and challenge her solutions, and the more quickly she can identify areas for improvement.

Ideally, the dimensions along which a domain is scaled are consistent throughout the works that make use of the domain. If Algorithm A is shown to perform well as the size of a Grid World increases, but Algorithm B is shown to perform well as the density of objects within the world increases, it can be difficult to make a comparative statement with respect to these two algorithms. So in addition to simply being scalable, scalability should be provided with the testbed as a configurable setting for other researchers to make use of. As with the focus on accessibility and availability, scaling dimensions should be made accessible and available along with the domain itself.

3 Related Work

Before setting out to construct a new domain for HRL research, we first investigated whether any suitable alternatives already exist. To do so, we searched and collected data from a large collection of conference papers published in recent years.

The analysis process presented a number of challenges. Most conferences make their publications freely available through online libraries organized by publication year, so there were many resources available for analysis. However, libraries are generally published for human readers rather than automated analysis systems, so in order to even filter through all of the papers that have been published in recent years it is necessary to implement an individual tool for each conference that can scrape download locations, convert PDF files to text, and perform basic keyword filtering.

We limited our analysis to two annual AI conferences: Autonomous Agents and MultiAgent Systems (AAMAS) [20]

, and the Association for the Advancement of Artificial Intelligence (AAAI)

[2]. These two conferences were selected based on their relevance to HRL and the accessibility and availability of their online publication archives. We analyzed every work published by these two conferences from 2010 to 2018, amounting to 2,476 papers from AAMAS and 5,055 papers from AAAI. Of these papers, we identified 26 AAMAS papers and 42 AAAI papers that concern or mention HRL, and then specifically noted which testbed domain(s) each publication relied upon to explain, compare, or demonstrate its contributions.

3.1 AAAI Meta-Analysis

Table 3.1 lists all the domains found in the 42 AAAI papers that were used at least twice. Many of the domains fulfill some of the criteria of Section 2.3, but none of the domains fulfill all of them. The majority of domains have no source code available and are designed as “single-use” domains for the purpose of evaluating of their respective authors’ algorithms.

Domain # Citations
Taxi 3 Xu and Laird [54]
Vien and Toussaint [50]
Li et al. [25]
ALE 3 Bacon et al. [3]
Hessel et al. [18]
Harb et al. [17]
Blocks World 2 Hogg et al. [19]
Xu and Laird [54]
Mario 2 Taylor [46]
Derbinsky et al. [12]
RoboCup 2D 2 MacAlpine et al. [27]
Masson et al. [28]
Puddle World 2 Ruan et al. [41]
Osa and Sugiyama [34]
Table 3.1: A list of all domains occurring in at least 2 of the 42 papers mentioning HRL in all AAAI publications since 2010. An additional 44 domains, each used in only 1 of the 42 surveyed publications, are not listed here due to space constraints.

One notable domain near the top of Table 3.1 comes close to meeting our needs. The Arcade Learning Environment [4] is by far the most easily available and accessible of the group. While ALE is found in just 3 of the HRL papers we surveyed, it is a popular choice for Deep Learning research in other venues. Its popularity speaks to its success as a robust testing environment for Deep RL agents.

3.2 AAMAS Meta-Analysis

We now consider the results of the AAMAS meta-analysis. As above, we compiled a list of all domains found in the 26 papers that were used at least twice in AAMAS conferences over the past decade; these results are found in Table 3.2. As with AAAI, most domains were single-use (even between conferences). We note that while ALE showed up once in this analysis, it is included in Table 3.2 because it also appears in Table 3.1.

Domain # Citations
Taxi 5 Osentoski and Mahadevan [35]
Chaganty et al. [9]
Bratman et al. [6]
Ngo et al. [32]
Li et al. [24]
Four Rooms
(Blocks World Variant)
3 Chaganty et al. [9]
Roderick et al. [40]
Jain and Precup [21]
(Blocks World Variant)
2 Bratman et al. [6]
Sullivan and Luke [45]
ALE 1 Omidshafiei et al. [33]
Table 3.2: A list of all domains used in AAMAS papers over the past decade which occur at least twice in the 68 papers from our meta-analysis.

3.3 Combined Results

Taxi, Blocks World, Four Rooms, and Foraging are the only domains in Table 3.1 that were included in more than one publication that were designed specifically as HRL challenges. These four domains are similar in design and complexity; they are optimized for research on discrete, multi-level hierarchies, and contain embedded transition dynamics that greatly benefit from hierarchical planning without a need for image recognition. Just as domains like Cart Pole and Double Pendulum are considered “Classic Control” problems, these four grid navigation domains function as “Classic HRL” problems. We note that the Escape Room Domain we propose in Section 4 is similar to the grid-based domains of Tables 3.1 and 3.1 in an abstract sense. In fact, ERD is better described as a descendent of Taxi than as a direct alternative, since each domain is built on the premise of hierarchical path planning through a virtual environment.

Unfortunately, we found no domains designed for HRL that were of moderate difficulty, falling somewhere between these Classic HRL problems and the more modern pixel-based challenges like ALE. Just like AI modern agents flourish when a smooth gradient can be found, so do researchers when a smooth gradient exists between testbeds; the lack of gradients in HRL domains serves to hinder progress in this area.

Table 3.1 illustrates the lack of intersection among the test domains of recent publications in HRL and supports our assertion that a robust, unified testbed will help to encourage and validate progress in this area of Reinforcement Learning. In the next section we describe our solution to this problem in detail and explain how it fulfills our own criteria as well as the needs of the RL and HRL communities in general.

4 The Escape Room Domain

The Escape Room Domain (ERD) is based on a series of popular video games and real-life team-building exercises. An Escape Room is a game consisting of an enclosed space and a series of puzzles. In order to “escape”, the agent(s) inside the room must solve the available puzzles in order to unlock the exit. Generally (but not necessarily) these puzzles are arranged in some sequential order and are combined with a series of clues to guide the agent toward a solution.

Escape Rooms are becoming a trend both virtually and in the real world. Popular implementations can be found in major cities and group-oriented tourist destinations such as Las Vegas, USA, and variations on this theme can be found in computer platformer games like Portal [48] and virtual reality environments such as the game Keep Talking and Nobody Explodes [42]. However unlike other virtual games that focus on controlling an avatar, the ultimate goal of an Escape Room is to break down complex tasks into manageable components, and organize such components into a final solution for the domain. Hierarchical RL is an ideal candidate for such endeavors.

The ERD reconstructs the popular notions of an Escape Room as an RL testbed, and allows RL agents to solve virtual Escape Rooms using the infrastructure we present in this paper. Each ERD instance consists of a room with a single exit and a predefined puzzle that the RL agent must solve prior to exiting the room. The agent’s internal state (e.g. joint configurations) and external state (e.g. world position) are concatenated into a single state vector that can be manipulated through a set of discrete actions. In the next section we describe the specific state and action spaces, as well as the transition and reward functions, which together comprise the MDP of this testbed domain.

4.1 The Escape Room MDP

Our first step in describing the ERD implementation then is to describe its theoretical foundation in the language of an MDP. It is important to note that because the ERD is a “flexible” domain, it is more accurately described as a domain schematic than a single, static domain in itself. An instance of ERD may be defined based on the specific criteria and parameters laid out below; any ERD instance is therefore comprised of the following dimensions:

  1. The agent’s 6-dimensional pose consisting of both 3D position (X, Y, and Z) and 3D orientation (Heading, Pitch, and Roll). These state dimensions are always continuous.

  2. A set of puzzle-specific dimensions that describe the state of the puzzle (see Section 5) that must be solved in order to unlock the room’s exit. These state dimensions may be either continuous or discrete.

  3. A set of 1-DoF joint positions that describe the joints on the agent’s virtual robotic arm. The arm may (optionally) be used by the agent to interact with the puzzle specific to its ERD instance.

Based on these definitions any ERD instance must have at least 6 state dimensions, however there are no restrictions on the maximum number of dimensions. Specific related to the sizes of arm links, orientation of arm joints, or the transition dynamics of an embedded puzzle are all determined at the discretion of the researcher who has designed that particular ERD instance.

Figure 4.1: A bird’s-eye view of the layout used for the ERD with embedded Button Puzzle (see Section 5).

The Actions of the ERD are designed to be easily mapped to a real-world robot agent (e.g. a robot arm) with an auxiliary movement controller, and thus no explicit actions exist that are tied to a room’s embedded puzzle. Instead, actions only affect the robot’s position and its joint positions. The actions are defined in Table 4.1.

Movement Action Displacement
Move Forward/Move Back 1 Meter
Strafe Left/Right 1 Meter
Turn Left/Right 10 Degrees
Increment/Decrement Joint 10 Degrees
Table 4.1: The ERD action space.

The transition function is governed primarily by the puzzle embedded in each specific ERD instance, however movement and actuation actions each invoke the expected state transitions. For example, if the agent executes “Move Forward”, it will be displaced by approximately 1 meter in whatever direction it is facing.

The reward function of ERD is intentionally simple. Whenever the agent takes an action it incurs a reward of -1. When the agent successfully exits the room, it earns a reward of 100. The purpose of using a sparse reward function is to encourage general-purpose algorithms that require minimal expert configuration prior to deployment.

In Section 6 we perform a set of experiments using a small set of similar ERD instances and provide results that have been averaged over each. The general layout of these instances is depicted in Figure 4.1 The path depicted in the figure shows one possible route that satisfies a hypothetical button dependency configuration. In general, ERD instances are randomized and the Button Puzzle uses a random seed for generating its button dependency graph, so the optimal routes change between instances.

5 Software Implementation

Our first step in implementing the Escape Room was to pick the software frameworks that would enable smooth integration with general RL problems, as well as support the HRL-specific task hierarchies we discussed earlier in this work.

The Open AI Gym [7] is already a popular framework for designing RL test domains, however there is little support in the way of modeling complex 3D environments. The Gym’s answer to 3D modeling is MuJoCo [47]. While MuJoCo provides much of the infrastructure we needed for ERD, it is closed-source and relatively expensive. Moreover, MuJoCo’s license restrictions make cluster-based learning prohibitively expensive. MuJoCo’s authors point out that free trials are available, but researchers might be hesitant to tightly couple their work with services that would eventually be too expensive to maintain.

Instead of relying on MuJoCo, we used the Panda3D Game Engine [8]. Panda3D is a completely free and open source game development framework that provides all of the basic infrastructure necessary for defining and interacting with a 3D virtual environment. We know of no other existing integrations between the Gym and Panda3D, so in creating the ERD we also designed a programmatic framework for expressing 3D environments as Gym-compatible testing environments. This framework is included in our ERD GitHub repository.

5.1 Puzzles

Each ERD instance contains an embedded puzzle that integrates with the physical makeup of the room. The puzzle can be any physics-based challenge that must be solved prior to exiting the room. Figure 4.1 shows how we embedded a Button Puzzle where the buttons must be pressed in a specific order before the room can be exited. However, this Button Puzzle can be swapped out for any other puzzle that a researcher (or domain designer) wishes. In this way, the puzzle embedded in a given ERD instance dictates the overall difficulty of the instance’s MDP.

Ideally, each puzzle is further customizable to a minor degree. For example, in the Button Puzzle the locations and sizes of the buttons can be adjusted in order to randomize the room and prevent overfitting. One can see how the puzzle might be extended to add more buttons, or even a variety of interactive controls to complexify the MDP and provide greater challenge and randomization to the agent.

5.2 Button Puzzle

Although the name is new, the Button Puzzle has been used in a variety of HRL publications under different identities, such as the BitFlip Domain [14], the LightBox Domain [51], the Random Lights Domain [29], and Randomly Generated Factored Domains [16]

. In each instance, the domain consists of binary variables that are causally related to one another. Only one variable can be modified at a time, and so the agent benefits greatly from learning an action hierarchy that represents the causal structure of the different variables in the domain. As shown by

[29], complex versions of the Button Puzzle can be fully modeled and solved by intrinsically-motivated agents in fewer than 20,000 timesteps.

In our implementation there are up to 4 buttons that may be causally related to one another according to any Directed Acyclic Graph, meaning that a button can only be switched “on” after all of its dependent buttons have been switched “on” as well. Therefore an agent that can deduce the causal structure of the buttons can quickly determine the correct order in which to toggle them. One button is designed as the unknown goal, and when this button is toggled the agent is free to exit the room.

5.3 Accessibility

The Open AI Gym and the Arcade Learning Environment have shown that reducing the barrier to entry is a key endeavor to ensure the proliferation of a testing domain. We have therefore taken every measure to minimize the cost of integration for ERD. ERD is hosted on our GitHub page111, and can be integrated with most puzzles so long as they conform to a minimal API. We have also ensured that ERD can be run in a number of modes to aid with the process of debugging integrations or agent performance. The modes are as follows:

  • Manual - The agent is controlled entirely by the user via keyboard presses.

  • Debug - The agent is controlled by a provided RL algorithm, and a great deal of diagnostic information is reported by the program.

  • Release - Time is virtually dilated and visualizations are disabled to maximize processing speed. This mode runs approximately 100 times faster than the other modes, and is optimized for long-term, repeated, and unattended experiments.

Together, these features make ERD trivial to integrate with existing Gym-compatible agents for the general RL community. Any agent that is interoperable with the Open AI Gym can be evaluated against the ERD without modification.

6 Experiments and Results

In this section we present the results of a set of experiments that demonstrates the need for hierarchical action models and algorithms that can integrate such models into their action selection processes. These experiments show that, when seemingly trivial problems rely on hierarchical action sequences and sparse reward functions, these problems can take an extremely long time to solve even with modern RL algorithms.

The following experiments make use of existing, publicly-available implementations [37] of the following algorithms:

  1. Deep Q Learning (DQN) [30, 31]

  2. Deep SARSA [55]

  3. Deep Deterministic Policy Gradient (DDPG) [26]

Figure 6.1: A simple baseline experiment using an ERD instance with one button. Each algorithm is able to reach the goal with varying degrees of frequency, however DDPG never stabilizes on a single, high-fitness policy.

Figure 6.1 is our baseline experiment which shows that these algorithms achieve some amount of success on a minimally difficult instance of ERD, where a single button must be pressed before exiting the domain. The agent receives a reward of -1 for each timestep, up to a maximum of 1,000 timesteps per episode. If the agent touches all of the buttons and reaches the goal within 1,000 timesteps, the agent receives an additional reward of 100. Movement speed is defined such that the goal can be achieved within approximately 10 timesteps, meaning that the optimal reward is approximately 90 per episode. The rewards in the figure have been rescaled as percentages relative to the minimum and maximum cumulative rewards that can be obtained per episode. Each data series has been averaged over 10 separate trials.

The figure shows that DDPG and DQN reach viable policies almost immediately, and then repeatedly adjust those policies for the remainder of each trial. SARSA stabilizes after approximately 100 episodes. Policies that do not earn maximal reward tend to randomly explore partway through each episode. Small changes in orientation, for example, can lead to agent trajectories that never toggle buttons or reach the exit. Each agent consistently exhibited optimal performance on at least 25% of episodes by the end of the experiment.

In our next experiment we show how the addition of a seemingly trivial domain element can hamstring these algorithms’ ability to quickly learn viable policies. In Figure 6.2, we have taken the domain from Figure 6.1 and added a second button, and enforced the rule that both buttons must be pressed in a particular order before the agent can reach the domain instance’s goal state. Each algorithm was actively trained for 20,000 timesteps. For comparison, other versions of the Button Puzzle found in earlier work contain 10-20 buttons and can be accurately modeled and solved within 20,000 timesteps by model-based HRL algorithms [51, 29].

Figure 6.2: An extension of the baseline experiment using an ERD instance with two buttons. None of the tested algorithms were able to reach the goal state.

Figure 6.2 demonstrates the drastic reduction in fitness that is observed for these algorithms when a domain requires hierarchical action sequences to observe sparse rewards. It is striking that we see such contrast between deceptively similar domains; however, the fact that modern Deep RL algorithms fall apart in the face of such challenges reinforces the claim that a wider variety of testing domains is needed, and in particular, that domains requiring hierarchical action sequences should be included as standard benchmarks for forthcoming work.

6.1 Meta-Actions

In our final experiment we show that the inclusion of a hierarchical action model is sufficient for solving the proposed multi-button ERD instances in a timely manner. Direct modification of the algorithms we tested is outside the scope of this work, so instead we added auto-generated meta-actions that are capable of performing individual button presses from any state. For example, if an agent is in an ERD instance with 2 buttons, it will have access to all actions outlined in Table 4.1, as well as 3 additional actions: meta-01, meta-02, and meta-exit. These actions construct and execute sequences of primitive actions that navigate to Button 1, Button 2, and the exit region, respectively. Each such action appears as a single timestep to the agent, but yields a reward relative to the number of primitive steps taken (e.g. -16 for meta-actions that require 16 primitives).

Figure 6.3: An extension of the baseline experiment using an ERD instance with two buttons and additional meta-actions. The meta-actions enabled significant performance improvements relative to the primitive-only experiment.

The results of the meta-action experiment are shown in Figure 6.3. These results demonstrate that significant performance improvements are possible when an accurate action model is available to an RL agent. However, both SARSA and DDPG show worsening performance over time. The reason for this phenomenon is that the meta-actions appear to these agents as exceptionally large state changes that occur over a single timestep. Because SARSA and DDPG updates consider multiple states and actions simultaneously, this can negatively impact the accuracy of these algorithms’ deep value networks. DDPG is particularly affected because of its reliance on computed gradients. While it may be possible to modify these algorithms to address this issue, these results show that while an action model can benefit performance, the model must be thoroughly integrated with each algorithm’s action selection and value fitting procedures in order to ensure increasing fitness over time.

7 Conclusion and Future Work

In this paper we have presented the Escape Room Domain, a new testbed domain designed for advancing HRL research. We have demonstrated that the need for such testbeds exists, both as a means of bridging the “difficulty gap” left behind by existing testbeds, as well as to provide an available, flexible framework for comparison between a wide array of RL algorithms. The ERD is built on open source software and is freely available on our public GitHub repository222

This domain leaves ample room for future work through collaboration with the RL community on the GitHub repository. In particular, we plan to use the ERD as a basis for challenging, evaluating, and comparing new developments in HRL algorithms. We will be continually updating the repository with references to new algorithm implementations that make progress toward solving ERD instances of increasing difficulty.

Alongside our goal of integrating and comparing against new solutions, we will also be expanding the challenges available via ERD puzzles. The Button Puzzle we present in Section 5 is just one example of the possible subdomains that can be integrated with ERD. Just as the Light Box Domain [51] has been augmented with a 3D environment to create the Button Puzzle, we plan to retrofit other classic RL testbeds with the ERD infrastructure to create new flexible, scalable challenges for RL algorithms.


  • Abadi et al. [2015] Abadi et al.

    TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.

    URL Software available from
  • Association for the Advancement of Artificial Intelligence [2014] Association for the Advancement of Artificial Intelligence. Association for the Advancement of Artificial Intelligence, 2014. URL
  • Bacon et al. [2017] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, pages 1726–1734, 2017.
  • Bellemare et al. [2013] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, jun 2013.
  • Brafman and Tennenholtz [2002] Ronen Brafman and Moshe Tennenholtz. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213–231, 2002.
  • Bratman et al. [2012] Jeshua Bratman, Satinder Singh, Jonathan Sorg, and Richard Lewis. Strong mitigation: Nesting search for good policies within search for good reward. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages 407–414. International Foundation for Autonomous Agents and Multiagent Systems, 2012.
  • Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
  • Carnegie Mellon University [2010] Carnegie Mellon University. Panda3d - free 3d game engine, 2010. URL
  • Chaganty et al. [2012] Arun Tejasvi Chaganty, Prateek Gaur, and Balaraman Ravindran. Learning in a small world. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages 391–397. International Foundation for Autonomous Agents and Multiagent Systems, 2012.
  • Dayan and Hinton [1993] Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–278, 1993.
  • Dean and Kanazawa [1989] Thomas Dean and Keiji Kanazawa. A model for reasoning about persistence and causation. Computational intelligence, 5(2):142–150, 1989.
  • Derbinsky et al. [2012] Nate Derbinsky, Justin Li, and John E Laird. A multi-domain evaluation of scaling in a general episodic memory. In AAAI, 2012.
  • Dietterich [2000] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. J. Artif. Intell. Res.(JAIR), 13:227–303, 2000.
  • Diuk et al. [2006] Carlos Diuk, Alexander L Strehl, and Michael L Littman. A hierarchical approach to efficient reinforcement learning in deterministic domains. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems, pages 313–319. ACM, 2006.
  • Gundersen and Kjensmo [2017] Odd Erik Gundersen and Sigbjørn Kjensmo. State of the art: Reproducibility in artificial intelligence. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence and the Twenty-Eighth Innovative Applications of Artificial Intelligence Conference, 2017.
  • Hallak et al. [2015] Assaf Hallak, Francois Schnitzler, Timothy Mann, and Shie Mannor. Off-policy model-based learning under unknown factored dynamics. In International Conference on Machine Learning, pages 711–719, 2015.
  • Harb et al. [2017] Jean Harb, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup. When waiting is not an option: Learning options with a deliberation cost. arXiv preprint arXiv:1709.04571, 2017.
  • Hessel et al. [2017] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298, 2017.
  • Hogg et al. [2010] Chad Hogg, Ugur Kuter, and Hector Munoz-Avila. Learning methods to generate good plans: Integrating htn learning and reinforcement learning. In AAAI, 2010.
  • International Foundation for Autonomous Agents and Multiagent Systems [2019] International Foundation for Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2019. URL
  • Jain and Precup [2018] Ayush Jain and Doina Precup. Eligibility traces for options. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 1008–1016. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
  • Jong and Stone [2008] Nicholas K Jong and Peter Stone. Hierarchical model-based reinforcement learning: R-max+ maxq. In Proceedings of the 25th international conference on Machine learning, pages 432–439. ACM, 2008.
  • Kaelbling [1993] Leslie Pack Kaelbling. Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the tenth international conference on machine learning, volume 951, pages 167–173, 1993.
  • Li et al. [2016] Zhuoru Li, Akshay Narayan, and Tze-Yun Leong. A core task abstraction approach to hierarchical reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 1411–1412. International Foundation for Autonomous Agents and Multiagent Systems, 2016.
  • Li et al. [2017] Zhuoru Li, Akshay Narayan, and Tze-Yun Leong. An efficient approach to model-based hierarchical reinforcement learning. In AAAI, pages 3583–3589, 2017.
  • Lillicrap et al. [2015] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015. URL
  • MacAlpine et al. [2015] Patrick MacAlpine, Mike Depinet, and Peter Stone. Ut austin villa 2014: Robocup 3d simulation league champion via overlapping layered learning. In AAAI, pages 2842–2848, 2015.
  • Masson et al. [2016] Warwick Masson, Pravesh Ranchod, and George Konidaris. Reinforcement learning with parameterized actions. In AAAI, pages 1934–1940, 2016.
  • Menashe and Stone [2015] Jacob Menashe and Peter Stone. Monte Carlo Hierarchical Model Learning. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 771–779, 2015.
  • Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013. URL
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • Ngo et al. [2014] Vien Anh Ngo, Hung Ngo, and Ertel Wolfgang. Monte carlo bayesian hierarchical reinforcement learning. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pages 1551–1552. International Foundation for Autonomous Agents and Multiagent Systems, 2014.
  • Omidshafiei et al. [2018] Shayegan Omidshafiei, Dong-Ki Kim, Jason Pazis, and Jonathan P How. Crossmodal attentive skill learner. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 139–146. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
  • Osa and Sugiyama [2017] Takayuki Osa and Masashi Sugiyama. Hierarchical policy search via return-weighted density estimation. arXiv preprint arXiv:1711.10173, 2017.
  • Osentoski and Mahadevan [2010] Sarah Osentoski and Sridhar Mahadevan. Basis function construction for hierarchical reinforcement learning. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-Volume 1, pages 747–754. International Foundation for Autonomous Agents and Multiagent Systems, 2010.
  • Parr and Russell [1998] Ronald Parr and Stuart Russell. Reinforcement learning with hierarchies of machines. Advances in neural information processing systems, pages 1043–1049, 1998.
  • Plappert [2016] Matthias Plappert. keras-rl., 2016.
  • Precup [2000] Doina Precup. Temporal abstraction in reinforcement learning. University of Massachusetts Amherst, 2000.
  • Precup et al. [1997] Doina Precup, Richard S Sutton, and Satinder P Singh. Planning with closed-loop macro actions. In Working notes of the 1997 AAAI Fall Symposium on Model-directed Autonomous Systems, pages 70–76, 1997.
  • Roderick et al. [2018] Melrose Roderick, Christopher Grimm, and Stefanie Tellex. Deep abstract q-networks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 131–138. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
  • Ruan et al. [2015] Sherry Shanshan Ruan, Gheorghe Comanici, Prakash Panangaden, and Doina Precup. Representation discovery for mdps using bisimulation metrics. In AAAI, pages 3578–3584, 2015.
  • Steel Crate Games [2015] Steel Crate Games. Keep talking and nobody explodes. Digital Download, 2015. URL
  • Stodden et al. [2013] Victoria Stodden, Peixuan Guo, and Zhaokun Ma. Toward reproducible computational research: an empirical analysis of data and code policy adoption by journals. PloS one, 8(6):e67111, 2013.
  • Strehl et al. [2007] Alexander L Strehl, Carlos Diuk, and Michael L Littman. Efficient structure learning in factored-state MDPs. In AAAI, volume 7, pages 645–650, 2007.
  • Sullivan and Luke [2012] Keith Sullivan and Sean Luke. Learning from demonstration with swarm hierarchies. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages 197–204. International Foundation for Autonomous Agents and Multiagent Systems, 2012.
  • Taylor [2011] Matthew E Taylor. Teaching reinforcement learning with mario: An argument and case study. In Proceedings of the Second Symposium on Educational Advances in Artifical Intelligence, pages 1737–1742, 2011.
  • Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
  • Valve Corporation [2007] Valve Corporation. Portal. Digital Download, 2007. URL
  • Vezhnevets et al. [2017] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.
  • Vien and Toussaint [2015] Ngo Anh Vien and Marc Toussaint. Hierarchical monte-carlo planning. In AAAI, pages 3613–3619, 2015.
  • Vigorito and Barto [2010] Christopher M. Vigorito and Andrew G. Barto. Intrinsically motivated hierarchical skill learning in structured environments. IEEE Transactions on Autonomous Mental Development, 2(2):132–143, jun 2010. ISSN 1943-0604. doi: 10.1109/tamd.2010.2050205. URL
  • Watkins [1989] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge, 1989.
  • Wiering and Schmidhuber [1997] Marco Wiering and Jürgen Schmidhuber. Hq-learning. Adaptive Behavior, 6(2):219–246, 1997.
  • Xu and Laird [2010] Joseph Z Xu and John E Laird. Instance-based online learning of deterministic relational action models. In AAAI, 2010.
  • Zhao et al. [2016] Dongbin Zhao, Haitao Wang, Kun Shao, and Yuanheng Zhu. Deep reinforcement learning with experience replay based on sarsa. In Computational Intelligence (SSCI), 2016 IEEE Symposium Series on, pages 1–6. IEEE, 2016.