GLIB: Exploration via Goal-Literal Babbling for Lifted Operator Learning

01/22/2020 ∙ by Rohan Chitnis, et al. ∙ MIT 7

We address the problem of efficient exploration for learning lifted operators in sequential decision-making problems without extrinsic goals or rewards. Inspired by human curiosity, we propose goal-literal babbling (GLIB), a simple and general method for exploration in such problems. GLIB samples goals that are conjunctions of literals, which can be understood as specific, targeted effects that the agent would like to achieve in the world, and plans to achieve these goals using the operators being learned. We conduct a case study to elucidate two key benefits of GLIB: robustness to overly general preconditions and efficient exploration in domains with effects at long horizons. We also provide theoretical guarantees and further empirical results, finding GLIB to be effective on a range of benchmark planning tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human curiosity often manifests in the form of a question: “I wonder if I can do X?” A toddler wonders whether she can climb on the kitchen counter to reach a cookie jar. Her dad wonders whether he can make dinner when he’s missing one of the key ingredients. These questions lead to actions, actions may lead to surprising effects, and from this surprise, we learn. In this work, inspired by this style of playful experimentation [10, 6], we study exploration via goal-setting for the problem of learning lifted operators to enable robust, generalizable planning.

Lifted operators, like those used in strips [7], are composed of relational preconditions and effects that capture the applicability and consequences of each action. Such operators offer a compact description of a transition model and are well-suited for long-horizon planning.

Learning lifted operators is particularly important in sequential decision-making settings, where an agent must learn these operators through interaction with a domain. If successful, the agent can use the learned operators to generalize far beyond what it has seen during training, solving problems that feature new objects and more challenging goals.

In this paper, we address the problem of efficient exploration for lifted operator learning in domains where the agent is given no goal or reward function at training time. Previous approaches have focused on operator refinement: inspecting faulty operators and gathering data to adjust them [10, 21, 24, 19]. These approaches lack the intrinsic drive needed to explore unfamiliar regions of the state space. An exploration method that does use intrinsic drive to learn operators is rex [14]. We will discuss two common regimes in which rex struggles: when the learned operators have overly general preconditions, and when operator effects can only be discovered in regions of space that are far from the initial state.

Figure 1: We use goal-setting as a paradigm for exploration in lifted operator learning. Left: A robot in our Blocks domain sets itself a goal of holding the blue object. Middle: The robot (mistakenly) believes the goal can be achieved by executing pick on the blue object. When it tries this plan, it fails due to the green object in the way. From this and previous data, the robot can induce lifted operators (preconditions and effects). Right: The robot can plan with its learned operators to achieve goals in more complex environments.

Inspired by playful experimentation in humans, we propose a novel family of exploration methods for lifted operator learning called Goal-Literal Babbling (glib). Goals in glib are conjunctions of literals; informally, these can be understood as specific, targeted effects that the agent would like to achieve in the world. A particular instantiation of glib is characterized by an integer , which bounds the number of literals involved in each goal, and a choice between lifted (glib-l) or ground (glib-g) goals. Goals are proposed according to their novelty [16] (conditioned on actions). To try to achieve these goals, we plan using the current (often flawed) operators.

In planning, we often have the intuition that searching with incorrect operators should be avoided due to the potential for compounding errors. This might be especially concerning in a learning-to-plan setting, where these errors could lead the agent to build representations in problematic and irreparable ways. However, we show both in theory and in practice that this intuition does not apply: we provide theoretical guarantees that glib cannot get stuck in a subregion of the reachable state space (§4.4), and we show empirically that glib yields strong performance across a range of benchmark tasks (§5).

This work has four main contributions. (1) We propose glib, a novel exploration method for lifted operator learning. (2) We prove that exploration under glib is ergodic and converges to the ground truth set of operators given mild assumptions on the planner, learner, and domain. (3) We present a case study that illustrates the limitations of rex (the closest prior approach) and demonstrates how glib overcomes these limitations. (4) We evaluate glib across four benchmark tasks and compare against several prior methods, finding glib to consistently outperform these baselines. We argue that glib is a strong approach that can be easily integrated into any architecture for learning lifted operators.

2 Problem Setting

We begin with background notation, then define the lifted operator learning problem and discuss its challenges.

2.1 STRIPS-Style Domains

We study exploration in sequential, deterministic, relational domains with no goals at training time. Formally, consider a strips-style planning domain [7] , where is a set of predicates (Boolean-valued functions) sufficient to describe the state, and is a set of operators. A predicate applied to objects (resp. variables) is a ground (resp. unground) literal. A state is a set of literals constructed from the predicates in ; absent literals are considered false (closed-world assumption). An operator is defined by a unique name, a set of parameters, preconditions (a logical expression over the parameters specifying when the operator can be executed), and effects (a set of literals over the parameters specifying the consequences of executing the operator)111Effects may also be conditional as in adl [17].. The parameters of that are not uniquely determined given a state are called free parameters. An operator name along with an assignment of its free parameters to objects is an action; an operator name along with placeholders for its free parameters is an action template.

An action is said to be applicable in state if satisfies the preconditions of ; applying produces a new state where the positive effects of have been added and the negative effects of have been removed. If is not applicable in , then applying leaves unchanged. We denote the transition model induced by a set of operators as .

2.2 The Lifted Operator Learning Problem

At the beginning of training time, the agent is given a set of predicates , action templates , and an initial state . For simplicity, we assume that all objects appear within at least one literal in . This implies that the (ground) state space and (ground) action space are known to the agent. The agent interacts with an environment that has transition model , as summarized in Algorithm 1. The agent’s task is to learn through its sequential interactions with . We stress that the agent cannot evaluate at any arbitrary state, only the current state. At each timestep, the agent selects an action according to an exploration method Explore, passes the action to , and observes the resulting new state. The new transition is added to the dataset and fed to the agent’s operator learning algorithm OpLearn, which produces learned operators . Our focus in this paper is on strategies for Explore; see (§5) for details on our OpLearn implementation.

Algorithm Lifted Operator Learning Pipeline
       Input: Predicates and action templates . Input: Transition function . Input: Initial state . Initialize: , a dataset of transitions. Initialize: , a set of learned operators. while not done do
             Explore OpLearn
      return final learned operators
Algorithm 1 An overview of the pipeline for lifted operator learning. Our focus in this work is on strategies for Explore.

The agent’s objective is to learn operators that are as close to the true as possible. To evaluate its performance, at test time we present it with a set of goals

, where each goal is a classifier over states. For each

, the agent uses a symbolic planner to try to achieve under the learned .

The Importance of Exploration. The success of operator learning depends critically on the dataset used for training. The Explore method in Algorithm 1 is responsible for gathering the dataset; it must guide the agent through maximally informative parts of the transition space. No matter how powerful OpLearn is, the learned operators will be useless if, for example, Explore only outputs actions that result in no state change. Our objective in this paper is to design an Explore method that efficiently gathers data and leads to good test-time performance in the low sample complexity regime.

3 Related Work

Operator Learning and Refinement. Learning operators that are amenable to planning has been the subject of a long line of work; see minton1999mlforplanning minton1999mlforplanning and arora2018review arora2018review for surveys. One popular approach is to learn lifted operators that are plug-compatible with strips planners  [25]

, as we do in this work. Inductive Logic Programming (ILP) techniques have also been previously applied to this end

[3, 19]. A related line of work has considered operator refinement: gathering data to improve the current set of good but imperfect operators [10, 21, 24]. These methods are useful when an error in the operators is discovered; they suggest actions that would be useful for gathering data to correct the operators. We use expo [10] as a baseline in our experiments.

irale [19] learns operators using the intuition that an action should be explored if its preconditions almost hold (measured using a variant of least general generalization  [18]), because the agent is likely to learn something new from observing the resulting effects. Unlike methods such as expo, irale does not perform lookahead. We also use irale as a baseline in our experiments.

Exploration in Model-Based RL.

Exploration is one of the fundamental challenges of reinforcement learning. Exploration strategies for model-based RL are particularly relevant to our setting, though typically, the agent is given rewards, and the transition model is stochastic. E

 [12] and R-max [5] are two such classic strategies, but are not lifted. walsh2010efficient walsh2010efficient proves the existence of a kwik (knows what it knows) algorithm for efficient exploration in lifted operator learning. As pointed out by lang2012exploration lang2012exploration, Walsh’s algorithm provides theoretical insight but has never been realized in practice. lang2012exploration lang2012exploration propose rex, which extends E to the relational regime, and as such is the closest prior approach to our own. We use rex as a baseline in our experiments.

Goal Babbling in Robotics. Our use of the term “babbling” is an homage to prior work in robotics on goal babbling, originally studied in the context of learning kinematic models [20, 2]. More recent work considers goal babbling for automatic curriculum generation in model-free RL [8, 9, 15]. forestier2016modular forestier2016modular use goal babbling in a continuous model-based setting where trajectory optimization suffices for planning. Though quite different from our symbolic setting, we expect that insights from this line of research can be adapted to further extend the methods of our work.

4 Exploration via Goal-Literal Babbling

In this section, we begin by discussing limitations of existing methods (§3). We then introduce glib (goal-literal babbling) and provide theoretical guarantees on its performance.

4.1 Limitations of Existing Methods

Our objective in this work is to identify a method for efficient exploration in lifted operator learning. A naive approach would be to ignore the relational structure in the problem setting and apply a generic exploration method such as E or R-max. However, as discussed in detail by lang2012exploration lang2012exploration, the relational structure can be heavily exploited not only during planning and learning, but also during exploration.

Existing methods that were designed for lifted operator refinement, such as expo [10], can be adapted to our setting in combination with a fallback strategy that selects actions when no existing operators require refinement. However, given their original purpose, these methods lack intrinsic drive; they were designed to guide the behavior of an agent trying to achieve a given goal. Consequently, if the fallback strategy is naive, there is nothing driving the agent toward unexplored regions of the state space. irale [19] is similarly prone to local, myopic exploration, as it does not use its learned operators for lookahead and therefore relies on a fallback strategy when no action from the current state is considered to be novel.

To our knowledge, rex is the only previous method that includes a form of intrinsic drive and performs lookahead during exploration. rex calculates novelty with respect to lifted states and actions using the learned operators, favoring states where “rare” operator preconditions hold. rex struggles in two common regimes. (1) rex attempts to find the shortest plan from the current state to any novel state. Thus, its exploration is greedy, preferring to visit novel states that are closest to the current state first. This makes it difficult to learn effects that require exploring far away from the current state. (2) When the preconditions of the learned operators are overly general, the novelty score wrongly reports that certain states are familiar and thus not worth exploring, when in fact these states must be visited to fix the preconditions.

4.2 Goal-Literal Babbling (glib)

Algorithm Explore: Goal-Literal Babbling
       Input: Bound on literal count . Input: The mode [ground or lifted]. Input: Number of sampling tries . Input: .   // See Algorithm 1.
       Input: plan_in_progress.   // Internal state.
       if plan_in_progress exists then
             PopFirst(plan_in_progress) return , plan_in_progress
      // Queue of novel goal-action pairs.
       queue EnumNovelGA for  iterations do
             PopFirst(queue) plan SymbolicPlan(, , ) if plan found then
                   if mode is lifted then
                         GroundAct(, , plan)
                  Append to plan. PopFirst(plan) return , plan
      // Fallback: random ground action.
       return Sample(), null
Algorithm 2 Pseudocode for the goal-literal babbling (glib) family of algorithms. In practice, the queue returned by EnumNovelGA can be cached between calls for efficiency.

We now present glib, an exploration method that overcomes the limitations discussed above. See Algorithm 2 for pseudocode. The key idea is to set goals that are conjunctions of a small number of literals, intuitively representing a targeted set of effects that the agent would like to achieve in the world. glib has two main parameters: , a bound on the conjunction’s size; and a mode, representing whether the chosen goals are ground (such as OnTable(cup3)) or lifted (such as cup: OnTable(cup)). glib can be used with any implementation of OpLearn (Algorithm 1).

Another key aspect of glib is that goal literals are proposed not in isolation, but together with actions that the agent should execute if and when that goal is achieved. The reason for considering actions in addition to goals is that to learn operators that predict the effects of actions, we must explore the space of transitions rather than states. A proposed goal-action pair can be interpreted as a transition that the agent would like to observe as it improves its transition model. Like the goals, actions can be ground or lifted, optionally sharing variables with the goal in the lifted case.

How should the agent select goals and actions to set itself? The most naive method would be to sample goals uniformly from all possible ()-tuples of literals; however, this may lead the agent to pursue the same goals repeatedly, not learning anything new. Instead, glib uses a novelty metric, only selecting goals and actions that have never held in any previous transition. In Algorithm 2, the method EnumNovelGA enumerates all novel pairs whose goal size is at most .

Once the agent has selected a goal-action pair , it uses a symbolic planner to find a plan for under the current learned operators . If a plan is found, is appended to its end (in lifted mode, will be lifted, so we first run GroundAct to ground it by randomly sampling values for the free parameters not bound in the goal); we then execute this plan step-by-step. If a plan is not found after tries, we fall back to taking a random action, as discussed in (§4.3).

The choice of mode (ground or lifted) can have significant effects on the performance of glib, and the best choice is domain-dependent. On one hand, novelty in lifted mode has the tendency to over-generalize: if location5 is the only one containing an object, then lifted novelty cannot distinguish that object being at location5 versus anywhere else. On the other hand, novelty in ground mode may not generalize sufficiently, and so can be much slower to explore.

We note that glib exploits the relational structure in the domain, and has a natural sense of far-sighted intrinsic drive, unlike both operator refinement methods and irale. In (§5.1), we will discuss how glib improves over the previously identified limitations of rex, through a case study.

4.3 Is Planning for Exploration Wise?

glib rests on the assumption that planning with faulty operators for exploration can lead to informative data, from which we can learn better operators. In general, there is good reason to hesitate before using faulty operators to plan, especially over long horizons: prediction errors will inevitably compound over time. However, for the particular case of planning for exploration, it is important to disambiguate two failure cases: (1) a plan is found with the learned operators and it does not execute as expected; (2) no plan is found, but some plan exists. Interestingly, (1) is not problematic; in fact it is ideal, because following this plan gives useful data to improve the operators. The only truly problematic case is (2). wang1996planning wang1996planning identify a similar problem in the context of operator refinement and attempt to reduce its occurrence by approximating the “most general” operator. In our setting, where the same action template may lead to multiple sets of lifted effects [17], a “most general” operator is not well-defined.222If, for action and state , one operator predicts an effect , and another operator predicts a different effect , then planning with may produce a plan where does not (for instance, if either or is in the goal) and vice versa; therefore, neither operator can be described as universally “more general” than the other. We therefore take a different approach: when no plan has been found after goal attempts, we take a random action. This fallback strategy allows us to escape when no goals seem possible, either due to the domain structure or flaws in the current set of learned operators.

4.4 Theoretical Guarantees

We now present theoretical guarantees for the asymptotic behavior of glib. Let be a transition model induced by operators ; be a set of predicates; be a set of action templates; and be an initial state. A state is reachable if there exists any sequence of actions that leads to from under . Let be a set of all transitions where is reachable and is an action template in ground with objects from . Note that any exploration method Explore (see Algorithm 1

) induces a Markov chain over state-action pairs

. Let denote this Markov chain. Let Random denote a purely random Explore policy.

Definition 1 (Ergodic task).

A task is ergodic if is ergodic over .

Theorem 1 states that if a task is ergodic, then exploration with glib is also ergodic, that is, it will never get “stuck” in a subset of the reachable space. Let glib refer to glib called with parameter in either ground or lifted mode.

Theorem 1 (Ergodicity of glib).

Suppose that the task is ergodic, OpLearn is consistent, and SymbolicPlan is sound. Then for any integer , is ergodic over .


Each step in the Markov chain corresponds to one call to glib. In each call, there are three possible outcomes: (1) a plan is in progress, so the next action in it is taken; (2) a new plan is made for a novel goal and action ; (3) a random action is taken. We will show that (2) can only occur finitely many times. Note that when a plan for is found, either (i) is reached and then is taken or (ii) is not reached. In case (i), the number of novel goal-action pairs decreases; this can only happen finitely many times, since , , and are finite. In case (ii), the operators must be incorrect, and the data generated by the execution of the plan will be used to update at least one such operator. Since OpLearn is consistent, this can only happen finitely many times (at most ). Thus, instances of outcome (2) are finite. Further, (1) can only occur after (2). Therefore, there exists a time after which only (3) occurs. Ergodicity of glib follows from the assumption that the task is ergodic (Definition 1). ∎

The consistency assumption on OpLearn holds for our operator learner and most others in the literature [1]. The soundness assumption on SymbolicPlan is similarly mild. The assumption of task ergodicity does not always hold in practice; some domains will have irreversible action effects. However, note that in the episodic regime, where a new initial state is sampled periodically, task ergodicity is guaranteed for any domain, since all initial states will get revisited infinitely often. While the conclusion of Theorem 1 is somewhat weak, it does not hold for all possible Explore implementations, including some natural variations on glib. For example, suppose that in place of EnumNovelGA, we were to enumerate all goal-action pairs irrespective of novelty. This method could get stuck in regions of the state space where it can achieve some goals ad infinitum.

Definition 2 (Sufficiently representative transitions).

Given a learning method OpLearn and operators , a set of transitions is sufficiently representative if the learned operators are logically equivalent to .

Corollary 1.

Suppose is sufficiently representative for OpLearn and . Then under the assumptions of Theorem 1, glib converges to a set of operators that are logically equivalent to the ground truth .


By Theorem 1, the Markov chain induced by glib is ergodic over ; therefore, all transitions will eventually appear in the dataset (Algorithm 1). At that time, by Definition 2, will be logically equivalent to . ∎

The challenge of proposing a practical exploration method with strong sample-complexity guarantees still remains open. walsh2010efficient walsh2010efficient and mehta2011efficient mehta2011efficient provide algorithms with guarantees that are intractable in practice; rodrigues2011active rodrigues2011active and lang2012exploration lang2012exploration provide practical algorithms without guarantees. To compare glib against previous practical methods, we now turn to empirical investigations.

5 Experiments

In this section, we present empirical results for glib and several baselines. We begin with a case study and then proceed with an evaluation on four benchmark planning tasks.

5.1 Case Study: GLIB and REX

In (§4.1), we outlined two major limitations of rex: that it searches greedily for unfamiliar states and actions, and that its inspection of the learned operators can be harmful when the preconditions are overly general. Now, we consider a domain that is hand-crafted to demonstrate these two issues.

Figure 2: Results of glib (our method) and rex

(closest prior approach) on 5-location (left) and 10-location (right) versions of the case study environment. All curves show averages over 10 seeds, with standard deviations shaded.

glib continues to perform well as the domain size increases, while rex does not. Note that both action babbling and rex have 0.0 success rate for the 10-location case.

The domain is a simple 1-dimensional grid where a robot starts on the left-most square, and can move left or right in a single timestep. Each location has a gadget which the robot can choose to interact with — interacting with each gadget produces a different set of effects . To build a complete model of the world, the agent should MoveRight, then with gadget , and repeat until done. At test time, we will give it the goal of producing the effects , where is the right-most location. To successfully plan for this goal, the agent must have explored interacting with during training, which requires executing MoveRight times, then executing .

Figure 2 shows the learning curves of this test-time goal, averaged over 10 seeds, for glib and rex. For further clarity, we also evaluate action babbling (random action selection). It is clear that glib continues to perform well as the size of the domain increases, while rex and action babbling do not.

There are two primary reasons for this difference in performance. First, rex explores unfamiliar states and actions greedily around the current state, and so does not quickly explore the far-away locations. For instance, when the robot is at location and interacts with the gadget to produce effects , then since the current state now contains in it, MoveLeft seems like an interesting action to take because it leads to a novel state. On the other hand, glib can set itself the goal of moving to the right-most location, thereby committing to moving there over the next timesteps rather than being interested by moving back to the left frequently.

Second, rex is particularly sensitive to overly general preconditions, which arise frequently in this domain. For instance, when the robot interacts with a gadget and sees effects , our learner induces that can be achieved no matter where the robot is currently located, which is incorrect (overly general). Because rex uses this precondition to derive novelty scores, it concludes that there is no point in further attempting to interact with gadget , and so will never learn the ground-truth transition model for inducing the effects . glib shines here: when the operators have overly general preconditions, the agent sets a goal for itself and (mistakenly) believe it is achievable; when it executes its plan for achieving this goal, it is quickly able to correct the operators.

Figure 3: Results on the four benchmark planning tasks. All curves show averages over 10 seeds. Standard deviations are omitted for visual clarity. We can see that our proposed method glib-l2 performs better than all baselines in Blocks, Keys and Doors, and Travelling Salesman, while glib-l1 performs the best in Gripper.

5.2 Benchmark Tasks

Exploration Methods Evaluated.

  • Action babbling. A uniformly random exploration policy over the set of ground actions of the domain.

  • irale [19]. This exploration method uses the current learned operators for action selection, but does not perform lookahead with them.

  • expo [10]. This operator refinement method allows for correcting errors in operators when they are discovered. Since we do not have goals at training time, we run action babbling until an error is discovered.

  • rex [14] in E-exploration mode.

  • glib-g1 (ours). glib in ground mode with .

  • glib-l1 (ours). glib in lifted mode with .

  • glib-l2 (ours). glib in lifted mode with .

  • Oracle. This method has access to the ground-truth operators and is intended to provide a rough upper bound on performance. The oracle picks an action for the current state whose predicted effects under the current operators and true operators do not match. If all effects are correctly predicted, the oracle performs breadth-first search (with horizon 2) for any future mismatches, falling back to action babbling when none are found.

Experimental Details. We evaluate four benchmark planning tasks: Blocks [23], Keys and Doors (also called Lightworld [13]), Travelling Salesman, and Gripper [23]. We use PDDLGym [22], a library for interacting with predicate-based environments. Each training episode is 25 timesteps, after which the environment is randomly reset to a new initial state. All experiments are run sequentially on a dual-core Intel Xeon E5 with 8GB RAM. All methods use Fast-Forward [11] with a 10-second timeout.

glib can be used with any implementation of OpLearn. Our implementation uses tilde [4]

, an inductive logic programming method that extends decision tree learning to the first-order logic regime. As done by rodrigues2011active rodrigues2011active, for efficiency we train only on transitions that previously led to prediction errors.

Results and Discussion. Figure 3 shows learning curves for all tasks, averaged over 10 seeds. Across all domains, glib outperforms all baselines, sometimes by large margins. In three of the four domains, glib-l2 performs better than irale, expo, and rex; in Blocks there is a 3x improvement in sample complexity, while in Keys and Doors there is a 4x improvement. In Gripper, glib-l1 performs best. This suggests that the setting of

is an important hyperparameter.

In the Keys and Doors domain, to open the door to the next room, the agent must first navigate to and pick up the key to unlock that door. In such settings with bottlenecks, we can see major improvements by using glib, as the agent often sets goals that drive itself through and beyond the bottleneck.

In all domains, glib-l outperforms glib-g

, suggesting that these domains gain a lot from the computational benefits afforded by over-generalization in the novelty heuristic.

While the results suggest that glib is a strong approach for exploration, its per-iteration time can be higher than that of methods which do not perform lookahead (see Table 1). This difference can be partly attributed to the time- and compute-intensive nature of planning. For instance, glib-l2 is relatively slow in Gripper because of the large number of infeasible goals that may be proposed (e.g., is-spoon(fork3)). This motivates a straightforward extension to glib: provide as input an auxiliary logic that defines feasible goals. We have implemented this extension and found that it reduces the per-iteration time for glib-l2 on Gripper from 5.512 to 0.263 seconds without impacting performance.

Blocks Keys and Doors Travelling Salesman Gripper
Action babbling 0.000 0.000 0.000 0.000
irale 0.028 0.669 0.051 0.028
expo 0.042 0.033 0.013 0.064
rex 0.121 44.509 0.128 0.429
glib-g1 (ours) 0.213 7.045 0.202
glib-l1 (ours) 0.003 0.002 0.001 0.023
glib-l2 (ours) 1.683 0.022 0.092 5.512
Table 1: Average seconds per iteration taken by each method. rex and glib typically take much more time than the other baselines; this is because they perform search using the current learned operators. glib-g1 is intractable on Keys and Doors because the space of ground literals is extremely large in this domain.

6 Conclusion

We have addressed the problem of efficient exploration for learning lifted operators in goal-free settings. We proposed glib as a new exploration method and showed it to be an empirically useful strategy, as validated on a range of benchmark planning tasks against several state-of-the-art baselines.

An important avenue for future work is to devise a mechanism for automatically detecting the optimal length of conjunctive goals for a particular domain, rather than starting from single-literal goals and working upward incrementally. This can be done by examining correlations in neighboring states to understand how quickly literals tend to change. Another line of work could be to develop other goal-setting methods; for instance, one could imagine fruitfully combining the insights of rex and glib, planning for potentially long horizons but only within known parts of the state space.


We would like to thank Ferran Alet and Caris Moses for their valuable comments on an initial draft. We gratefully acknowledge support from NSF grants 1523767 and 1723381; from ONR grants N00014-13-1-0333 and N00014-18-1-2847; from AFOSR grant FA9550-17-1-0165; from Honda Research; from the MIT-Sensetime Alliance on AI; and from the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. Rohan is supported by an NSF Graduate Research Fellowship. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.


  • [1] A. Arora, H. Fiorino, D. Pellier, M. Métivier, and S. Pesty (2018) A review of learning planning action models.

    The Knowledge Engineering Review

    Cited by: §4.4.
  • [2] A. Baranes and P. Oudeyer (2013) Active learning of inverse models with intrinsically motivated goal exploration in robots. Robotics and Autonomous Systems 61 (1), pp. 49–73. Cited by: §3.
  • [3] S. Benson (1995) Inductive learning of reactive action models. In Machine Learning Proceedings 1995, pp. 47–54. Cited by: §3.
  • [4] H. Blockeel and L. De Raedt (1998) Top-down induction of first-order logical decision trees. Artificial intelligence 101 (1-2), pp. 285–297. Cited by: §5.2.
  • [5] R. I. Brafman and M. Tennenholtz (2002) R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3 (Oct), pp. 213–231. Cited by: §3.
  • [6] A. Cropper (2019) Playgol: learning programs through play. arXiv preprint arXiv:1904.08993. Cited by: §1.
  • [7] R. E. Fikes and N. J. Nilsson (1971) STRIPS: a new approach to the application of theorem proving to problem solving. Artificial intelligence 2 (3-4), pp. 189–208. Cited by: §1, §2.1.
  • [8] C. Florensa, D. Held, X. Geng, and P. Abbeel (2017) Automatic goal generation for reinforcement learning agents. arXiv preprint arXiv:1705.06366. Cited by: §3.
  • [9] S. Forestier, Y. Mollard, and P. Oudeyer (2017) Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190. Cited by: §3.
  • [10] Y. Gil (1994) Learning by experimentation: incremental refinement of incomplete planning domains. In Machine Learning Proceedings 1994, pp. 87–95. Cited by: §1, §1, §3, §4.1, 3rd item.
  • [11] J. Hoffmann (2001) FF: the fast-forward planning system. AI magazine 22 (3), pp. 57–57. Cited by: §5.2.
  • [12] M. Kearns and S. Singh (2002) Near-optimal reinforcement learning in polynomial time. Machine learning 49 (2-3), pp. 209–232. Cited by: §3.
  • [13] G. Konidaris and A. G. Barto (2007) Building portable options: skill transfer in reinforcement learning.. In IJCAI, Vol. 7, pp. 895–900. Cited by: §5.2.
  • [14] T. Lang, M. Toussaint, and K. Kersting (2012) Exploration in relational domains for model-based reinforcement learning. Journal of Machine Learning Research 13 (Dec), pp. 3725–3768. Cited by: §1, 4th item.
  • [15] A. Laversanne-Finot, A. Pere, and P. Oudeyer (2018) Curiosity driven exploration of learned disentangled goal spaces. In Conference on Robot Learning, pp. 487–504. Cited by: §3.
  • [16] J. Lehman and K. O. Stanley (2008) Exploiting open-endedness to solve problems through the search for novelty.. In Eleventh International Conference on Artificial Life (ALIFE XI), Cited by: §1.
  • [17] E. P. Pednault (1989) ADL: exploring the middle ground between strips and the situation calculus.. Kr 89, pp. 324–332. Cited by: §4.3, footnote 1.
  • [18] G. D. Plotkin (1970) A note on inductive generalization. Machine intelligence 5 (1), pp. 153–163. Cited by: §3.
  • [19] C. Rodrigues, P. Gérard, C. Rouveirol, and H. Soldano (2011) Active learning of relational action models. In International Conference on Inductive Logic Programming, pp. 302–316. Cited by: §1, §3, §3, §4.1, 2nd item.
  • [20] M. Rolf, J. J. Steil, and M. Gienger (2010) Goal babbling permits direct learning of inverse kinematics. IEEE Transactions on Autonomous Mental Development 2 (3), pp. 216–229. Cited by: §3.
  • [21] W. Shen and H. A. Simon (1994) Autonomous learning from the environment. W. H. Freeman and Company. Cited by: §1, §3.
  • [22] T. Silver and R. Chitnis (2019) PDDLGym: openAI gym environments from PDDL domains. GitHub. Note: Cited by: §5.2.
  • [23] M. Vallati, L. Chrpa, M. Grześ, T. L. McCluskey, M. Roberts, S. Sanner, et al. (2015) The 2014 international planning competition: progress and trends. Ai Magazine 36 (3), pp. 90–98. Cited by: §5.2.
  • [24] X. Wang (1996) Planning while learning operators.. In AIPS, pp. 229–236. Cited by: §1, §3.
  • [25] H. H. Zhuo, Q. Yang, D. H. Hu, and L. Li (2010) Learning complex action models with quantifiers and logical implications. Artificial Intelligence 174 (18), pp. 1540–1569. Cited by: §3.