Convert a PDDL domain into an OpenAI Gym environment.
We present PDDLGym, a framework that automatically constructs OpenAI Gym environments from PDDL domains and problems. Observations and actions in PDDLGym are relational, making the framework particularly well-suited for research in relational reinforcement learning and relational sequential decision-making. PDDLGym is also useful as a generic framework for rapidly building numerous, diverse benchmarks from a concise and familiar specification language. We discuss design decisions and implementation details, and also illustrate empirical variations between the 15 built-in environments in terms of planning and model-learning difficulty. We hope that PDDLGym will facilitate bridge-building between the reinforcement learning community (from which Gym emerged) and the AI planning community (which produced PDDL). We look forward to gathering feedback from all those interested and expanding the set of available environments and features accordingly. Code: https://github.com/tomsilver/pddlgymREAD FULL TEXT VIEW PDF
In reinforcement learning, wrappers are universally used to transform th...
Markovian processes have long been used to model stochastic environments...
RLCard is an open-source toolkit for reinforcement learning research in ...
Relational representations in reinforcement learning allow for the use o...
OpenAI's Gym library contains a large, diverse set of environments that ...
Machine learning algorithms aim to find patterns from observations, whic...
We introduce The House Of inteRactions (THOR), a framework for visual AI...
Convert a PDDL domain into an OpenAI Gym environment.
The creation of benchmarks has often accelerated research progress in various subdomains of artificial intelligence [imagenet, glue, moleculenet]. In sequential decision-making tasks, tremendous progress has been catalyzed by benchmarks such as the environments in OpenAI Gym [openaigym] and the planning tasks in the International Planning Competition (IPC) [ipc]. Gym defines a standardized way to interact with an environment, allowing easy comparison of various reinforcement learning algorithms. IPC provides a set of planning domains and problems written in Planning Domain Definition Language (PDDL) [pddl], allowing easy comparison of various symbolic planners.
In this work, we present PDDLGym, an open-source framework111Code available at https://github.com/tomsilver/pddlgym. Pull requests are welcome! that combines elements of Gym and PDDL. Concretely, PDDLGym is a Python framework that automatically creates Gym environments from PDDL domain and problem files. As with Gym, PDDLGym allows for episodic, closed-loop interaction between the agent and the environment; the agent receives an observation from the environment and gives back an action, repeating this loop until the end of an episode. As in PDDL, PDDLGym is fundamentally relational: observations are sets of ground relations over objects (e.g. on(plate, table)) and actions are templates ground with objects (e.g. pick(plate)). PDDLGym is therefore particularly well-suited for relational learning and sequential decision-making research. See Figure 1 for renderings of some environments currently implemented in PDDLGym, and Figure 2 for code examples.
The Gym API defines a hard boundary between the agent and the environment. In particular, the agent only interacts with the environment by taking actions and receiving observations. The environment implements a function step that advances the state given an action by the agent; step defines the transition model of the environment. Likewise, a PDDL domain encodes a transition model via its operators. However, in typical usage, PDDL is understood to exist entirely in the “mind” of the agent. A separate process is then responsible for transforming plans into executable actions that the agent can take in the world.
PDDLGym defies this convention: in PDDLGym, PDDL domains and problems lie firmly on the environment side of the agent-environment boundary. The environment uses the PDDL files to implement the step function that advances the state given an action. PDDLGym is thus perhaps best understood as a repurposing of PDDL. Implementation-wise, this repurposing has subtle but important implications, discussed in (§2.2).
PDDLGym serves three main purposes:
(1) Facilitate the creation of numerous, diverse benchmarks for sequential decision-making in relational domains. PDDLGym allows tasks to be defined in PDDL, automatically building a Gym environment from PDDL files. PDDL offers a compact symbolic language for describing domains, which might otherwise be cumbersome and repetitive to define directly via the Gym API.
(2) Bridge reinforcement learning and planning research. PDDLGym makes it easy for planning researchers and learning researchers to test their methods on the exact same benchmarks and develop techniques that draw on the strengths of both families of approaches. Furthermore, since PDDLGym includes built-in domains and problems, it is straightforward to perform apples-to-apples comparisons without having to collect third-party code from disparate sources (see also [muise-icaps16demo-pd]).
(3) Catalyze research on sequential decision-making in relational domains. In our own research, we have found PDDLGym to be very useful while studying exploration for lifted operator learning [glib] and hierarchical goal-conditioned policy learning [silver2020genplan]. Other open research problems that may benefit from using PDDLGym include relational reinforcement learning [lang2012exploration, relational1, relational2], learning symbolic descriptions of operators [lang2012exploration, amir2008learning, pasula2007learning], discovering relational transition rules for efficient planning [xia2019learning, lang2010planning], and learning lifted options [konidaris2014constructing, options1, options2, options3].
The rest of this paper is organized as follows. (§2) discusses the design decisions and implementation details underlying PDDLGym. In (§3), we give an overview of the built-in PDDLGym domains and provide basic empirical results to illustrate their diversity in terms of the difficulty of planning and learning. Finally, in (§4), we discuss avenues for extending and improving PDDLGym.
The Gym API defines environments as Python classes with three essential methods: __init__, which initializes the environment; reset, which starts a new episode and returns an observation; and step, which takes an action from the agent, advances the current state, and returns an observation, reward, a Boolean indicating whether the episode is complete, and optional debugging information. The API also includes other minor methods, e.g., to handle rendering and random seeding. Finally, Gym environments are required to implement an action_space, which represents the space of possible actions, and an observation_space, which represents the space of possible observations. We next give a brief overview of PDDL files, and then we describe how action and observation spaces are defined in PDDLGym. Subsequently, we move to a discussion of our implementation of the three essential methods.
There are two types of PDDL files: domain files and problem files. A single benchmark is characterized by one domain file and multiple problem files.
A PDDL domain file includes predicates — named relations with placeholder variables such as (on ?x ?y) — and operators. An operator is composed of a name, a list of parameters, a first-order logic formula over the parameters describing the operator’s preconditions, and a first-order logic formula over the parameters describing the operator’s effects. The forms of the precondition and effect formulas are typically restricted depending on the version of PDDL. Early versions of PDDL only permit conjunctions of ground predicates [strips]; later versions also allow disjunctions and quantifiers [adl]. See Figure 2A for an example of a PDDL operator.
A PDDL problem file includes a set of objects (named entities), an initial state, and a goal. The initial state is a set of predicates ground with the objects. Any ground predicates not in the state are assumed to be false, following the closed-world assumption. The goal is a first-order logical formula over the objects (the form of the goal is limited by the PDDL version, like for operators’ preconditions and effects). Note that PDDL (and PDDLGym) also allows objects and variables to be typed. See Figure 2B for a partial example of a PDDL problem file.
In PDDLGym, observations are sets of predicates grounded over the objects in the problem file. The observation space, then, represents the powerset of all possible ground predicates. This powerset is typically enormous; fortunately, it does not need to be explicitly computed. The observation_space can be viewed as a discrete space whose size is equal to the size of this powerset; since this space will be large, we expect that most algorithms for solving PDDLGym tasks will not be sensitive to its size.
The action space for a PDDLGym environment is one of the more subtle aspects of the overall framework. In classical AI planning, actions are typically equated with ground operators — operators whose parameters are bound to objects. However, in most PDDL domains, only some operator parameters are free (in terms of controlling the agent); the remaining parameters are included in the operator because they are part of the precondition/effect expressions, but can be derived from the current state or the choice of free parameters. PDDL typically makes no distinction between free and non-free parameters. For example, consider the operator for Sokoban shown in Figure 3A. This operator represents the rules for a player (?p) moving in some direction (?dir) from one cell (?from) to another cell (?to). In a real game of Sokoban, the only choice that an agent makes is what direction to move — only the ?dir parameter is free. ?from is defined by the agent’s location in the current state; ?to can be derived from ?from and the agent’s choice of ?dir.
To properly define the action space for a PDDLGym environment, we must explicitly distinguish free parameters from non-free ones. One option is to require that operator parameters are all free. Non-free parameters could then be folded into the preconditions and effects using quantifiers [adl]; see Figure 3B for an example. However, this is cumbersome and leads to clunky, deeply nested operators. Instead, we opt to introduce new predicates that represent operators, and whose variables are these operators’ free parameters. We then include these predicates in the preconditions of the respective operators; see Figure 3C for an example. Doing so requires only minimal changes to existing PDDL files and does not affect readability, but requires adding in domain knowledge about the agent-environment boundary. Note that this domain knowledge is equivalent to defining an action space, which is very commonly done in sequential decision-making and is not a strong assumption. The action space of a PDDLGym environment is a discrete space over all possible groundings of the newly introduced predicates.
A PDDLGym environment is parameterized by a PDDL domain file and a list of PDDL problem files. For research convenience, each PDDLGym environment is associated with a test version of that environment, where the domain file is identical but the problem files are different (for instance, they could encode more complicated planning tasks, to measure generalizability). During environment initialization, all of the PDDL files are parsed into Python objects; we use a custom PDDL parser for this purpose. When reset is called, a single problem instance is randomly selected.222Note that problem selection when resetting an episode is the only place where randomness is used in PDDLGym. The initial state of that problem instance is the state of the environment. For convenience, reset also returns (in the debugging information) paths to the PDDL domain and problem file of the current episode. This makes it easy for a user to run to a symbolic planner and execute resulting plans in the environment; see the README in the PDDLGym Github repository for an example that uses FastForward [ff].
The step method of a PDDLGym environment takes in an action, updates the environment state, and returns an observation, reward, done Boolean, and debugging information. To determine the state update, PDDLGym checks whether any PDDL operator’s preconditions are satisfied given the current state and action. Note that it is impossible to “accidentally match” to an undesired operator: each operator has a unique precondition as illustrated in Figure 3C, which is generated automatically based on the passed-in action. Since actions are distinct from operators (§2.2), this precondition satisfaction check is nontrivial; non-free parameters must be bound. To perform this check, we include in PDDLGym a Python implementation of typed SLD resolution.333We also experimented with Prolog for inference, but found our own typed SLD resolution algorithm to be much faster, since Prolog is not natively typed. When no operator preconditions hold for a given action, the state remains unchanged by default. In some applications, it may be preferable to raise an error if no preconditions hold; the optional initialization parameter raise_error_on_invalid_action permits this behavior.
Rewards in PDDLGym are sparse and binary. In particular, the reward is 1.0 when the problem goal is satisfied and 0.0 otherwise. Similarly, the done Boolean is True when the goal is reached and False otherwise. (In practice, a maximum episode length is often used.)
In terms of lines of code, the bulk of PDDLGym is dedicated to PDDL file parsing and SLD resolution (used in step). We are continuing to develop both of these features so that a wider range of PDDL domains are supported. Aspects of PDDL 1.2 that are not yet supported by PDDLGym include hierarchical typing, equality, quantifiers, and constant objects. Aspects of later PDDL versions, such as numerical fluents, are not supported. Our short-term objective is to provide full support for PDDL 1.2. We have found that a wide range of standard PDDL domains are already well-supported by PDDLGym; see (§3) for an overview. We welcome requests for features and extensions, via either issues created on the Github page or email.
In this section, we start with an overview of the domains built into PDDLGym, as of the date of preparation of this report (February 8, 2020). We then provide some experimental results that give insight into the variation between these domains, in terms of planning and model-learning difficulty. All experiments are performed on a single laptop with 32GB RAM and a 2.9GHz Intel Core i9 processor.
|Domain Name||Source||Rendering Included||Average FPS|
There are currently 15 domains built into PDDLGym. Most of the domains are adapted from existing PDDL repositories; the remainder are ones we found to be useful benchmarks in our own research. We have implemented custom rendering for 8 of the domains (see Figure 1 for examples). Table 1 gives a list of all environments, their sources, and their average frames per second (FPS) calculated by executing a random policy for 1,000 episodes of horizon 10, with no rendering.
We now provide some results illustrating the variation between the domains built into PDDLGym. We examine two axes of variation: planning difficulty and difficulty of learning the transition model.
Figure 4 (left) illustrates the average time taken by FastForward [ff] to find a plan in each of the environments, averaged across all problem instances. The results reveal a considerable range in planning time, with the most difficult domain (Depot, omitted from the plot for visual clarity) requiring two orders of magnitude more time than the simplest one (Baking). The results also indicate that the included domains are relatively “easy” from a modern planning perspective. However, even in these simple domains, there are many interesting challenges to be tackled, such as learning the true PDDL operators from interaction data, or defining good state abstractions amenable to learning.
(right) provides insight into the difficulty of learning transition models in some of the environments. For each environment, an agent executes a random policy for episodes of horizon 25. The observed transitions are used to learn transition models, which are then used for planning on a suite of test problems. The fraction of test problems solved is reported as an indicator of the learned transition model. To learn the transition models, we use first-order logic decision tree (FOLDT) learning[foldt] (details will be described in forthcoming work). Five domains are visualized for clarity; among the remaining ones, several are comparable to the ones shown, but others, including Baking, Depot, and Sokoban, are difficult for our learning method: FOLDT learning is unable to find a model that fits the data in a reasonable amount of time. Of course, model-learning difficulty varies considerably with the learning method and the exploration strategy. We have implemented simple strategies here to show these results, but these avenues for future research are exactly the kind that we hope to enable with PDDLGym.
We have presented PDDLGym, an open-source Python framework that automatically creates OpenAI Gym environments from PDDL domain and problem files. Our empirical results demonstrate considerable diversity among the built-in environments. We have already begun to use PDDLGym in our own research into relational sequential decision-making and reinforcement learning. We also hope to interface PDDLGym with other related open-source frameworks, particularly the PDDL collection and tools in planning.domains [muise-icaps16demo-pd], so that a user can use PDDLGym simply by specifying a URL pointing to a repository of PDDL files (along with some domain-specific information about free parameters). Another important avenue for future work is to incorporate stochasticity into the environments. Probabilistic PDDL [ppddl] would be a natural starting point for such efforts.
We look forward to gathering feedback from the community and expanding the set of available environments and features accordingly.