CAMPs: Learning Context-Specific Abstractions for Efficient Planning in Factored MDPs

07/26/2020 ∙ by Rohan Chitnis, et al. ∙ MIT 7

Meta-planning, or learning to guide planning from experience, is a promising approach to improving the computational cost of planning. A general meta-planning strategy is to learn to impose constraints on the states considered and actions taken by the agent. We observe that (1) imposing a constraint can induce context-specific independences that render some aspects of the domain irrelevant, and (2) an agent can take advantage of this fact by imposing constraints on its own behavior. These observations lead us to propose the context-specific abstract Markov decision process (CAMP), an abstraction of a factored MDP that affords efficient planning. We then describe how to learn constraints to impose so the CAMP optimizes a trade-off between rewards and computational cost. Our experiments consider five planners across four domains, including robotic navigation among movable obstacles (NAMO), robotic task and motion planning for sequential manipulation, and classical planning. We find planning with learned CAMPs to consistently outperform baselines, including Stilman's NAMO-specific algorithm. Video: https://youtu.be/wTXt6djcAd4

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Online planning is a popular paradigm for sequential decision-making in robotics and beyond, but its practical application is limited by the computational burden of planning while performing a task. In meta-planning, the agent learns to guide planning efficiently and effectively based on its previous planning experience. Learning to impose constraints on the states considered and actions taken by an agent is a promising paradigm for meta-planning; it reduces the space of policies the agent must consider [23, 14, 38]. In contrast to (e.g., physical or kinematic) constraints beyond the agent’s control, these constraints are imposed by the agent on itself for the sole purpose of efficient planning.

Figure 1: (A, B) In the namo domain, the robot (red circle) must reach the red object, which requires navigating there while moving obstacles out of the way. Two sample problems are shown. (C, D) If the robot constrains itself to stay within certain rooms (yellow), the obstacles in other rooms become irrelevant. (E, F) In the sequential manipulation domain, the robot must put the red objects into bins. (G, H) If the robot constrains itself to top grasps, the blue objects become irrelevant. Similarly, if the robot constrains target placements to certain bins (yellow), the green objects become irrelevant.

Beyond reducing the space of policies, imposing constraints can improve planning efficiency in another important way. In factored domains, where the states and actions decompose into variables, imposing constraints can induce context-specific independences (CSIs) [7] that render some variables irrelevant. For example, consider the two navigation among movable obstacles (namo) problems in Figure 1A–B. Imposing a constraint that forbids certain rooms induces CSIs between the robot’s position and that of all obstacles in those forbidden rooms. Consequently, these obstacles can be ignored, as in Figure 1C–D. A planning problem with a context imposed111We henceforth use context as a synonym for constraint. and the resulting irrelevant variables removed constitutes an abstraction of the original problem [30, 27]. Adopting the Markov decision process (mdp) formalism, we refer to this as a context-specific abstract mdp (camp).

Planning in a camp is often more efficient than planning in the original mdp, but may abstract away important details of the environment, leading to a suboptimal policy. Practically speaking, we are often interested in a trade-off: we would like our planners to produce highly rewarding behavior, but not be too computationally burdensome; we are willing to sacrifice optimality, and in the case of goal-based tasks, even soundness and completeness, to maximize this trade-off in expectation. In this work, we propose a learning-based approach to maximize this trade-off. Given a set of training tasks with a shared transition model and factored states and actions, we first approximate the set of CSIs present in these tasks. We then train a context selector, which predicts a context that should be imposed for a given task. At test time, given a novel task, we use the learned context selector and CSIs to induce a camp, which we then use to plan. This overall pipeline is summarized in Figure 2.

Our approach rests on the premise that predicting contexts to impose is easier, and generalizes better, than learning a reactive policy. Intuitively, the burden on reactive policy learning is higher, as the policy must exactly carve out a specific, good path through transition space, whereas an imposed context must only carve out a region of transition space that includes at least one good path.

In experiments, we consider four domains, including robotic namo and sequential manipulation, that collectively exhibit discrete and continuous states and actions, relational states, sparse rewards, stochastic transitions, and long planning horizons. To evaluate the generality of camps, we consider multiple planners, including Monte Carlo tree search [9], FastDownward [17], and a task and motion planner [35]. Our results suggest that planning with learned camps strikes a strong balance between pure planning and pure policy learning [32]. In the namo domain, we also find that camps with a generic task and motion planner outperform Stilman’s namo-specific algorithm [37]. We conclude that camps offer a promising path toward fast, effective planning in large and challenging domains.

2 Preliminaries

We introduce notation and background formalism (§2.1), and then present a problem definition (§2.2).

2.1 Context-Specific Independence in Factored Markov Decision Processes

A Markov decision process (mdp) is given by , with: state space ; action space ; transition model where , , and ,

are random variables denoting the state and action taken at time

; reward function ; and horizon .222We say the reward is a function of only for simplicity of notation; this is not critical to our method. The solution to an mdp is a policy , a mapping from states to actions, that maximizes the expected sum of rewards with respect to the transitions.

We focus on factored mdp[16], where each state variable is factored into variables , where has domain . A state is then an assignment with ; thus, . Actions are similarly factored into variables with domains so that is an assignment with ; thus, . The reward function for a factored mdp is defined in terms of a subset of state variables , which we call the reward variables. Let denote all state and action variables together. Variable domains may be discrete or continuous for both states and actions.

Following [7], we define a context as a pair , where is some subset of state and action variables, and is a space of possible joint assignments. A state-action pair is in the context when its joint assignment of variables in is present in . Two variables are contextually independent under if , in which case we write . This relation is called a context-specific independence (CSI). In this paper, we explore how CSIs can be automatically identified and exploited for planning in factored mdps.

Figure 2: Three approaches to solving an mdp. Given a task, our approach (top row) applies its learned context selector to generate a camp, then plans in this camp to get a policy. Our approach often achieves higher reward than pure policy learning (middle row), and lower computational cost than pure planning (bottom row), leading to a good objective value (right; uses in Equation 1).

2.2 Problem Formulation

A task is a pair of initial state and reward function, denoted . We are given a set of training tasks, , and a test task, , all of which are drawn from some unseen distribution . All tasks share the same factored state space , factored action space , transition model , and horizon ; therefore, each task induces a factored mdp, denoted

. Each task is parameterized by a feature vector, denoted

, with featurizer . For instance, in our robotic namo domain, gives a top-down image of the initial scene (which also implicitly describes the goal). The agent interacts with and as black boxes; it does not know their analytical representations or causal structure. We also assume a black-box mdp solver, Plan, which takes as input an mdp and a current state , and returns a next action: .333Planners generally return a policy or a sequence of actions; we suppose that the planner is called at every timestep to simplify exposition. In our experiments, we replan in domains that have stochastic transitions.

Before being presented with the test task, the agent may first interact with the training tasks , perhaps compiling useful knowledge that it can deploy at test time, such as a task-conditioned policy. Then, it is given the test task , and its goal is to efficiently produce actions that accrue high cumulative reward in the test mdp . We formalize this trade-off via the objective:

(1)

where , denotes the cost (e.g., wall-clock time) of evaluating the policy , is a trade-off parameter, and the expectation is over stochasticity in the transitions. Note that ComputeCost includes the cost of both computation performed before the agent starts acting and any computation that might be performed on each timestep after the first.

We seek to find . One possible approach is to call Plan on the full test mdp, that is, . This method would yield high rewards, but it may also incur a large ComputeCost. Another possibility is to learn (at training time) and transfer (to test time) a task-conditioned reactive policy; this can have low ComputeCost at test time, but perhaps at the expense of rewards if the policy fails to generalize well to the test task (Figure 2).

3 Context-Specific Abstract Markov Decision Processes (CAMPs)

The objective formulated in Equation 1 trades off the computational cost of planning with the resulting rewards. In this section, we present an approach to optimizing this trade-off that lies between the two extremes of pure planning and pure policy learning [32]. Rather than planning in the full test task, we propose to learn to generate an abstraction [30, 27] of the test task, in which we can plan efficiently.

An abstraction over state space and action space is a pair of functions , with and , where and are abstract state and action spaces. We are specifically interested in abstractions that are projections: . This has the effect of dropping state variables and action variables.

The relevant-variables projection is a simple projective abstraction that has been studied in prior work (under different names) [3, 18, 8]. It drops all irrelevant variables, in the following sense:

Definition 1 (Variable relevance).

Given a factored mdp with variables and reward variables , any is relevant iff , , and s.t. .

Intuitively, a variable is relevant if there is any possibility that its value at some timestep will have an eventual influence, directly or indirectly, on the value of the reward. Unfortunately, as identified by Baum et al. [3], relevance is often too strong of a property for the relevant-variables projection to yield meaningful improvements in practice — most variables typically have some way of influencing the reward, under some sequence of actions taken by the agent. In search of greater flexibility, we now define a generalization of variable relevance that is conditioned on a particular context (§2.1).

Definition 2 (Context-specific variable relevance).

Given a context and a factored mdp with variables and reward variables , any is relevant in the context iff , , and s.t. .

Each possible context induces a projection that drops variables which are irrelevant in ; let denote this abstraction. We now define a camp, an abstract mdp associated with .

Definition 3 (Context-Specific Abstract mdp (camp)).

Consider an mdp and a context . Let with right inverses . Let be a new sink state, such that . The context-specific abstract mdp, , for and is , where and are defined as follows:

  1. if is not in the context;

  2. if is in the context;

We may also say that is with context imposed.

Intuitively, a camp imposes a projective abstraction that drops the variables that are irrelevant under the given context, and also imposes that any transition in violation of the context leads the agent to , an absorbing sink state with reward . In practice, the right inverses and can be obtained by assigning arbitrary values to the dropped variables; the choice of value is inconsequential by Definition 2. For a graphical example of a camp, see Appendix C.

A camp is usually not optimality-preserving, because the context restricts the agent to a subregion of the state and action space [2]. However, context-specific relevance is much weaker than relevance: it only requires a variable to be relevant under the given context. For example, to a robot operating in a home, the weather outside is irrelevant as long as it remains in the context of staying indoors.

camps offer a way to solve the test task (§2.2) that lies between the extremes of pure planning and pure policy learning. Namely, given a test mdp , we select a context, compute the relevant variables under that context via backward induction, generate the camp, and finally plan in this camp to obtain a policy for . We have therefore reduced the problem of optimizing to that of determining the best context to impose. To address this issue, we now turn to learning.

4 Learning to Generate CAMPs

We now have the ability to generate a context-specific abstract mdp (camp) when given a context and the associated context-specific independences. However, contexts and their associated independences are not provided in our problem. In this section, we describe how to learn approximate context-specific independences and a context selector, for use at test time. Figure 3 gives a data-flow diagram.

4.1 Approximating the Context-Specific Independences

Recall that the agent is given mdps with factored states and variables, but only query access to the (shared) transition model. For example, the transition model may be a black-box physics simulator, as in two of our experimental domains. In order to approximately determine the context-specific independences that are latent within a factored mdp, we propose a sample-based procedure. Given a context, the algorithm examines each pair of state or action variables and tests for empirical dependence, that is, whether any sampled value of induces a change in the distribution of , conditioned on the sampled values of the remaining variables. For full pseudocode, see Appendix A.

The runtime of this algorithm depends on the size of the domain and the number of samples used to test dependence. In theory, the number of samples required to identify all independences could be arbitrarily large. In practice, for the tasks we considered in our experiments, including robotic manipulation and namo, we found this algorithm to be sufficient for detecting a useful set of context-specific independences. Moreover, our method is robust to errors in the discovered independences, which in the worst case will simply exclude some candidate abstractions from consideration.

Figure 3: Data-flow diagram for our method during learning and test time. (Left) Approximate context-specific independence (CSI) learning derives the relevant state and action variables under each context. (Middle) A context selector is learned by optimizing the objective on the training tasks. The transition model and approximate CSIs are used to evaluate the objective. (Right) Given a test task, the agent selects a context to impose using the learned context selector. The relevant variables for this context are calculated from the learned CSIs. From the context, relevant variables, test task, and transition model, the agent derives a camp, and can plan in it to obtain a policy for the test task.

Where Does the Space of Contexts Come From?

Our approximate algorithm allows us to estimate independences

given a context; this raises the question of which contexts should be evaluated. We propose a simple method for deriving a space of possible contexts that works well across our varied experimental domains. From the set of variables

, we consider conjunctions and disjunctions up to some length (a hyperparameter), excluding any terms whose involved variables have joint domain size less than some threshold (another hyperparameter). Note that for any finite threshold, this procedure immediately excludes contexts involving continuous variables. While this family of contexts has the benefit of being fairly general to describe, we emphasize that other choices, e.g., more domain-specific context families, may be used as well

[23, 15, 10].

4.2 Learning the Context Selector

The performance of a camp depends entirely on the selected context; if the context constrains the agent to a poor region of plan space, or induces independences that make important variables irrelevant, the resulting policy could get very low rewards. However, if the context is selected judiciously, the camp may exhibit substantial efficiency gains with minor impact on rewards.

We now describe an algorithm for learning to select a context that optimizes the objective (Equation 1). Pseudocode is presented in Appendix B. Given each training task , we first identify the best possible context according to the objective in Equation 1. This process sets up a supervised multiclass classification problem that maps the featurized representation of a task to the best context

to impose on that task. We solve this classification problem by training a neural network with cross-entropy loss, resulting in a context selector

, where denotes the parameters of the neural network. At test time, we choose a context by calling , generate the associated camp, and plan in this camp to efficiently obtain a policy for the test task.

5 Experiments and Results

Our experiments aim to answer the following key questions:

  • How does planning with learned camps compare to pure planning and pure policy learning across a varied set of domains, both discrete and continuous? (§5.1), (§5.3)

  • To what extent is the performance of camps planner-agnostic? (§5.1), (§5.3), (Appendix H)

  • How does the performance of camps vary with the choice of (Equation 1)? (§5.4)

  • How does the performance of camps vary with the number of training tasks? (Appendix F)

We overview the experimental setup in (§5.1) and (§5.2), and provide details in Appendix E. Then, we present our main results in (§5.3), with additional results in (§5.4) and Appendix FH.

5.1 Domains and Planners

We consider four domains and five planners (four online, one offline). Full details are in Appendix D.

Domain D1: Gridworld. A maze-style gridworld in which the agent must navigate across rooms to reach a goal location, while avoiding or destroying stochastically moving obstacles. Task features are top-down images of the maze layout. The state is a vector of the current position and room of each obstacle, the agent, and the goal. The actions are moving up, down, left, right; and destroying each obstacle. For planning in this domain, we consider Monte Carlo tree search (MCTS), breadth-first graph search with replanning (BFSReplan), and value iteration (VI). VI results are in Appendix H.

Domain D2: Classical planning. A deterministic dinner-making domain written in PDDL [31], with three different possible meals to make. Preparing each meal requires a different number of actions. The relative rewards for making each meal are the only source of variation between tasks; task features are simply these rewards. States are binary vectors describing which logical fluents hold true, and actions are logical operators described in PDDL, each containing parameters, preconditions, and effects. We use an off-the-shelf classical planner (Fast-Downward [17]).

Domain D3: Robotic navigation among movable obstacles (namo), simulated in PyBullet [12]. Task features are overhead images. The state is a vector of the current pose of each object and the robot, and the robot’s current room. The actions are moving the robot to a target pose, and clearing an object in front of the robot. We use a state-of-the-art tamp planner [35], which is not namo-specific.

Domain D4: Robotic sequential manipulation, simulated in PyBullet [12]. Task features are a vector of the object radii and occupied bins. The state is a vector of the current pose of each object, the grasp style used by the robot, and the current held object (if any). The actions are moving the robot to a target base pose and grasping at a target gripper pose, and moving the robot to a target base pose and placing at a target placement pose. The planner for this task is the same as in namo.

5.2 Methods and Baselines

We consider the following methods:

  • camp. Our full method.

  • camp ablation. An ablation of our full method in which the camp only sends the agent to a sink state for context violation, but does not project away irrelevant variables.

  • Pure planning. This baseline does not use the training tasks, and just solves the full test task.

  • Plan transfer. This baseline solves each training task to obtain a plan, and at test time picks actions via majority vote across the training task plans. This can be understood as learning a “policy” that is conditioned only on the timestep, not on the state.

  • Policy learning. This baseline solves each training task to obtain a plan, then trains a state-conditioned neural network policy to imitate the resulting dataset of state-action trajectories, using supervised learning. This policy is used directly to choose actions at test time.

  • Task-conditioned policy learning. This baseline is the same as policy learning, but the neural network also receives as input the features of the task, in addition to the current state.

  • (Domain D3 only) Stilman’s planning algorithm [37] for namo problems, named ResolveSpatialConstraints, which attempts to find a feasible path to a target location by first finding feasible paths to any obstructing objects and moving them out of the way.

In all our domains, every variable is relevant under no context. For this reason, the pure planning baseline can also be understood as an ablation of camp that does not account for contexts.

5.3 Main Results

Figure 4:

Mean returns versus computation time on the test tasks, for all domains and methods. All points report an average over 10 independent runs of training and evaluation, with lines showing per-axis standard deviations.

camps generally provide a better trade-off than the baselines: the blue points are usually higher than pure policy learning (camps accrue more reward) and to the left of pure planning (camps are more efficient). For the left-most plot, only the returns vary because MCTS is an anytime algorithm, so we run it with a fixed timeout. See (§5.3) for discussion on these results.

Figure 4 plots the mean returns versus computation time on the test tasks, for all domains and methods. Table 1 in Appendix G shows the corresponding objective values (Equation 1). All results report an average over 10 independent runs of context selector training and test task evaluation.

Discussion. camps outperform every baseline in all but Domain D2 (classical planning). camps often fare better than task-conditioned policy learning because the latter fails to generalize from training tasks to test tasks. This generalization failure manifests in low test task rewards, and in a substantial difference between the training and test objective values. In classical planning, however, task-conditioned policy learning outperforms camps; both achieve high task rewards, but the policy is faster to execute. This is because in this domain, learned policies generalize well: there is little variation between task instances, in stark contrast to the other domains. In Appendix F, we unpack this result further by testing the relative performance of camps and task-conditioned policy learning as a function of the number of training tasks.

Another clear conclusion from the main results is that camps outperform pure planning across all experiments, consistently achieving lower computational costs. In several cases, including namo and manipulation, camps also achieve higher rewards than pure planning does, since the latter sometimes hits the 60-second timeout before discovering the superior plans found very quickly by camps.

Results for the camp ablation show that imposing contexts alone provides clear benefits, focusing the planner on a promising region of the search space. This result is consistent with prior work showing that learning to impose constraints reduces planning costs [23, 14, 38]. However, the difference between camp and the ablation shows that dropping irrelevant variables provides even greater benefits.

A final observation is that camps perform comparably to Stilman’s namo algorithm [37]. This is notable because Stilman’s algorithm employs namo-specific assumptions, whereas the planner we use does not; in fact, we can see that the pure planner is strongly outperformed by Stilman’s algorithm. In our method, the context selector learns to constrain the robot to stay in emptier rooms, meaning it must move comparatively few objects out of the way. This leads to efficient planning, making the computational cost of camps almost as good as that of Stilman’s algorithm; additionally, it leads camps to obtain higher rewards than Stilman’s algorithm, because it often reaches the goal faster.

5.4 Performance as a Function of

The plots on the right illustrate how the objective value (left) and returns (right) accrued by the camp policy vary as a function of , in Domain D2 (classical planning).

Discussion. The right figure shows the returns from campinterpolate between those obtained by pure planning (when , the agent is okay with spending a long time planning out its actions) and those obtained by a random policy (when , the agent spends as little time as possible choosing actions, and so performs no better than random). The green line is dashed because pure planning does not use , so its returns are unaffected by the value of . The left figure (note the log-scale -axis) shows objective values. We see that camp never suffers a lower objective value than that of a random policy, while pure planning drops down greatly as increases. This is because as , our context selector learns to choose contexts that induce very little planning (but get low returns), whereas pure planning does not adapt in this way.

6 Related Work

At a high level, our work falls under the broad research theme of learning to make planners more efficient using past planning experience. A fundamental question is deciding what to predict; for instance, it is common to learn a policy and/or value function from planning experience [34, 24, 25, 11]. In contrast, we learn to predict contexts that lead to a favorable trade-off between computational cost and rewards. Recent work leverages a given set of contexts to represent planning problem instances in “score space” [23], but does not consider the resulting context-specific independences, which we showed experimentally to yield large performance improvements. Other methods predict the feasibility of task plans or motion plans [14, 38, 15], which can also be seen as learning constraints on the search space. These methods can be readily incorporated into the camp framework.

We have formalized camps as a particular class of mdp abstractions. There is a long line of work on deriving abstractions for mdp

s, much of it motivated by the prospect of faster planning or more sample-efficient reinforcement learning

[30, 18, 22, 36, 1]. One common technique is to aggregate states and actions into equivalence classes [4, 5, 27], a generalization of our notion of projective abstractions. Other work has learned to select abstractions [21, 28]; a key benefit of camps is that the contexts induce a structured hypothesis space of abstractions that greatly improve planning efficiency.

camps identify and exploit context-specific independences [7] in factored planning problems. In graphical models, context-specific independences can be similarly used to speed up inference, for example, by applying divide-and-conquer over contexts [39, 33, 13]. Stochastic Planning using Decision Diagrams (SPUDD) is a method that adapts these insights for planning with context-specific independences [19]. These insights are orthogonal to camps, but could be integrated to yield further efficiencies. SPUDD is a pure planning approach that considers all contexts, whereas in this work we learn a context selector that induces abstractions in which to plan.

7 Conclusion

In this work, we have presented a method for learning to generate context-specific abstractions of mdps, achieving more efficient planning while retaining high rewards. There are several clear directions for future work. On the learning side, one interesting question is whether factorizations of initially unfactored mdps can be automatically discovered in a way that leads to useful camps. Another direction to pursue is learning the task featurizer , which we assumed to be given in our problem formulation. Following [23], it could also be useful to extend the methods we have presented here so that multiple contexts can be imposed in succession at test time, using the performance of previous contexts to inform the choice of future ones. However, note that such a method would lead to an increase in computational cost, possibly to the detriment of the overall objective we formulated.

We would like to thank Kelsey Allen for valuable comments on an initial draft. We gratefully acknowledge support from NSF grant 1723381; from AFOSR grant FA9550-17-1-0165; from ONR grant N00014-18-1-2847; from the Honda Research Institute; from MIT-IBM Watson Lab; and from SUTD Temasek Laboratories. Rohan and Tom are supported by NSF Graduate Research Fellowships. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.

References

  • [1] D. Abel, D. E. Hershkowitz, and M. L. Littman (2017) Near optimal behavior via approximate state abstraction. arXiv preprint arXiv:1701.04113. Cited by: §6.
  • [2] D. Abel, N. Umbanhowar, K. Khetarpal, D. Arumugam, D. Precup, and M. L. Littman (2019) Value preserving state-action abstractions. ICLR. Cited by: §3.
  • [3] J. Baum, A. E. Nicholson, and T. I. Dix (2012) Proximity-based non-uniform abstractions for approximate planning. Journal of Artificial Intelligence Research 43, pp. 477–522. Cited by: §3, §3.
  • [4] J. C. Bean, J. R. Birge, and R. L. Smith (1987) Aggregation in dynamic programming. Operations Research 35 (2), pp. 215–220. Cited by: §6.
  • [5] D. P. Bertsekas, D. A. Castanon, et al. (1988) Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Transactions on Automatic Control. Cited by: §6.
  • [6] C. Boutilier, T. Dean, and S. Hanks (1999) Decision-theoretic planning: structural assumptions and computational leverage. Journal of Artificial Intelligence Research 11, pp. 1–94. Cited by: Figure 5, Appendix C.
  • [7] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller (1996)

    Context-specific independence in bayesian networks

    .
    In Proceedings of the Twelfth conference on Uncertainty in artificial intelligence, Cited by: §1, §2.1, §6.
  • [8] C. Boutilier (1997) Correlated action effects in decision theoretic regression.. In UAI, pp. 30–37. Cited by: §3.
  • [9] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton (2012) A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games 4 (1), pp. 1–43. Cited by: §1.
  • [10] J. Carpentier, R. Budhiraja, and N. Mansard (2017) Learning feasibility constraints for multi-contact locomotion of legged robots. In Robotics: Science and Systems, pp. 9p. Cited by: §4.1.
  • [11] R. Chitnis, D. Hadfield-Menell, A. Gupta, S. Srivastava, E. Groshev, C. Lin, and P. Abbeel (2016)

    Guided search for task and motion plans using learned heuristics

    .
    In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pp. 447–454. Cited by: §6.
  • [12] E. Coumans and Y. Bai (2016)

    Pybullet, a python module for physics simulation for games, robotics and machine learning

    .
    GitHub repository. Cited by: Appendix D, Appendix D, §5.1, §5.1.
  • [13] C. Domshlak and S. E. Shimony (2004) Efficient probabilistic reasoning in bns with mutual exclusion and context-specific independence. International journal of intelligent systems 19 (8), pp. 703–725. Cited by: §6.
  • [14] D. Driess, J. Ha, and M. Toussaint (2020) Deep visual reasoning: learning to predict action sequences for task and motion planning from an initial scene image. In Proc. of Robotics: Science and Systems (R:SS), Cited by: §1, §5.3, §6.
  • [15] D. Driess, O. Oguz, J. Ha, and M. Toussaint (2020) Deep visual heuristics: learning feasibility of mixed-integer programs for manipulation planning. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §4.1, §6.
  • [16] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman (2003) Efficient solution algorithms for factored mdps. Journal of Artificial Intelligence Research 19, pp. 399–468. Cited by: §2.1.
  • [17] M. Helmert (2006) The fast downward planning system. Journal of Artificial Intelligence Research 26, pp. 191–246. Cited by: Appendix D, §1, §5.1.
  • [18] N. Hernandez-Gardiol (2008) Relational envelope-based planning. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA. External Links: Link Cited by: §3, §6.
  • [19] J. Hoey, R. St-Aubin, A. J. Hu, and C. Boutilier (1999) SPUDD: stochastic planning using decision diagrams. In UAI, Cited by: §6.
  • [20] J. Hoffmann (2001) FF: the fast-forward planning system. AI magazine 22 (3), pp. 57–57. Cited by: Appendix D.
  • [21] N. Jiang, A. Kulesza, and S. Singh (2015) Abstraction selection in model-based reinforcement learning. In International Conference on Machine Learning, pp. 179–188. Cited by: §6.
  • [22] N. K. Jong and P. Stone (2005) State abstraction discovery from irrelevant state variables.. In IJCAI, Vol. 8, pp. 752–757. Cited by: §6.
  • [23] B. Kim, L. P. Kaelbling, and T. Lozano-Pérez (2017) Learning to guide task and motion planning using score-space representation. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 2810–2817. Cited by: §1, §4.1, §5.3, §6, §7.
  • [24] B. Kim, L. P. Kaelbling, and T. Lozano-Pérez (2018) Guiding search in continuous state-action spaces by learning an action sampler from off-target search experience. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §6.
  • [25] B. Kim and L. Shimanuki (2019) Learning value functions with relational state representations for guiding task-and-motion planning. Conference on Robot Learning. Cited by: §6.
  • [26] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix E.
  • [27] G. Konidaris and A. Barto (2009) Efficient skill learning using abstraction selection. In Twenty-First International Joint Conference on Artificial Intelligence, Cited by: §1, §3, §6.
  • [28] G. Konidaris (2016) Constructing abstraction hierarchies using a skill-symbol loop. In IJCAI: proceedings of the conference, Vol. 2016, pp. 1648. Cited by: §6.
  • [29] J. J. Kuffner and S. M. LaValle (2000) RRT-connect: an efficient approach to single-query path planning. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), Vol. 2, pp. 995–1001. Cited by: Appendix D.
  • [30] L. Li, T. J. Walsh, and M. L. Littman (2006) Towards a unified theory of state abstraction for mdps. In In Proceedings of the Ninth International Symposium on Artificial Intelligence and Mathematics, pp. 531–539. Cited by: §1, §3, §6.
  • [31] D. McDermott, M. Ghallab, A. Howe, C. Knoblock, A. Ram, M. Veloso, D. Weld, and D. Wilkins (1998) PDDL-the planning domain definition language. Technical Report CVC TR-98-003/DCS TR-1165, Yale. Cited by: §5.1.
  • [32] T. M. Moerland, A. Deichler, S. Baldi, J. Broekens, and C. M. Jonker (2020) Think too fast nor too slow: the computational trade-off between planning and reinforcement learning. External Links: 2005.07404 Cited by: §1, §3.
  • [33] D. L. Poole (2013) Context-specific approximation in probabilistic inference. arXiv preprint arXiv:1301.7408. Cited by: §6.
  • [34] D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016) Mastering the game of Go with deep neural networks and tree search. Nature. Cited by: §6.
  • [35] S. Srivastava, E. Fang, L. Riano, R. Chitnis, S. Russell, and P. Abbeel (2014) Combined task and motion planning through an extensible planner-independent interface layer. In 2014 IEEE international conference on robotics and automation (ICRA), pp. 639–646. Cited by: Appendix D, §1, §5.1.
  • [36] K. A. Steinkraus (2005) Solving large stochastic planning problems using multiple dynamic abstractions. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §6.
  • [37] M. Stilman and J. J. Kuffner (2005) Navigation among movable obstacles: real-time reasoning in complex environments. International Journal of Humanoid Robotics 2 (04), pp. 479–503. Cited by: Table 1, §1, 7th item, §5.3.
  • [38] A. M. Wells, N. T. Dantam, A. Shrivastava, and L. E. Kavraki (2018) Learning feasibility for task and motion planning in tabletop environments. IEEE Robots and Automation Letters. Cited by: §1, §5.3, §6.
  • [39] N. L. Zhang and D. Poole (1999) On the role of context-specific independence in probabilistic inference. In IJCAI, Vol. 1, pp. 9. Cited by: §6.

Appendix A Pseudocode: Approximate Context-Specific Independence Learning

The following pseudocode describes our algorithm for learning approximate context-specific independences (CSIs). See Figure 5B for an example output, and see (§4.1) for discussion.

Input: State and action variables Input: Black-box transition model Input: Context Input: Number of samples // Hyperparameters
Returns: Approximate CSIs , for arbitrary // Initialize all pairs of variables to be independent
Initialize: CSIs // Sample state and action assignments in the context
// Test pairs of variables for dependence
for  do
       for  do
             for up to samples of  do
                   if  then
                         // is dependent on ; remove this pair from CSIs
                        
return CSIs

Appendix B Pseudocode: Context Selector Learning

The following pseudocode describes our algorithm for learning a context selector model, given training tasks and their context-specific independences (CSIs). See (§4.2) for discussion.

Input: Training tasks with features Input: Black-box transition model Input: Set of contexts Input: All learned CSIs Returns: Context selector // Neural network with parameters
Initialize: Inputs for supervised learning Initialize: Targets for supervised learning for  do
       // See Subroutine below
      
// Perform supervised learning (multiclass classification)
return Subroutine ScoreContext
       Input: Training task Input: Black-box transition model Input: Context Input: Learned CSIs Returns: A score // See (§3)
       // Plan in the camp
       return // See Equation 1
      

Appendix C camp Graphical Example

Figure 5 provides an example of a camp. Note that standard influence diagrams [6] cannot capture context-specific independence, so we use a dotted line in the second panel to denote this concept.

Figure 5: (A) Example of a factored mdp represented as an influence diagram [6]. As seen in the diagram, . With no contexts imposed, all variables are relevant. (B) Imposing contexts can induce new independences. In this example, a context involving is imposed, inducing an independence between and (red ). Variables and are irrelevant under the imposed context; relevant variables are highlighted in blue. Note that relevance is a time-independent property. (C) Dropping the irrelevant variables leads to a camp, an abstraction of the original mdp.

Appendix D Domain and Planner Details

Domain D1: Gridworld. The first domain we consider is a simple maze-style gridworld in which the agent must navigate across rooms to reach a goal location, while avoiding obstacles that stochastically move around at each timestep. The agent has available to it remove(obj) actions, which remove the given obstacle from the world so that the agent can no longer collide with it, but these actions can only be used when the agent is adjacent to the obstacle. Whenever the agent collides with an obstacle, it is placed back in its initial location. Each obstacle remains within a particular room, and so the agent can impose a context of not entering particular rooms, allowing it to ignore the obstacles that are in those rooms, and also not have to consider the action of removing those obstacles. Across tasks, we vary the maze layout. We train on 50 task instances and test on 10 held-out instances.

Planners. We consider the following planners for this domain: Monte Carlo tree search (MCTS), breadth-first graph search with replanning (BFSReplan), and asynchronous value iteration (VI). Both MCTS and BFSReplan are online planners, while VI is offline. As such, VI computes a policy over the full state space, and thus is only tractable in this relatively small (about 100,000 states) domain.

Representations. The features of each task are a top-down image of the maze layout. The state is a vector of the current position and room of each obstacle, the agent, and the goal. The actions are moving up, down, left, right; and removing each obstacle in the environment.

Domain D2: Classical planning. We next consider a deterministic classical planning domain in which an agent must make a meal for dinner, and has three options: to stay within the living room to make ramen, to go to the kitchen to make a sandwich, or to go to the store to buy and prepare a steak. Making any of these terminates the task. The steak gives higher terminal reward than the sandwich, which in turn gives higher terminal reward than the ramen. However, planning to go to the store for steak requires reasoning about many objects that would be irrelevant under the context of staying within the home (for a sandwich or ramen), and planning to go to the kitchen for a sandwich requires reasoning about many objects that would be irrelevant under the context of staying within the living room (for ramen). There is also a timestep penalty, incentivizing the agent to finish quickly. Optimal plans may involve 2, 16, or 22 actions depending on the relative rewards for obtaining the ramen, sandwich, and steak. These rewards are the only thing that varies between task instances; there is thus small variation between task instances relative to the other domains. We train on 20 task instances and test on 25 held-out instances.

Planner. We use an off-the-shelf classical planner (Fast-Downward [17] in mode with the lmcut heuristic). The various rewards are implemented as action costs. As this domain is deterministic, we only run the planner once per task; it is guaranteed to find a reward-maximizing trajectory.

Representations. The features of each task are a vector of the terminal rewards for each meal. The state is a binary vector describing which logical fluents hold true (1) versus false (0). The actions are logical operators described in PDDL, each containing parameters, preconditions, and effects.

We also use this domain as a testbed for additional experiments into the impact of (the trade-off parameter in Equation 1), and the number of training tasks on our method. See (§5.4) and (§F).

Domain D3: Robotic navigation among movable obstacles (namo). Illustrated in Figure 1A, this domain has a robot navigating through rooms with the goal of reaching the red object in the upper-right room. Roughly 20 blue obstacles are scattered throughout the rooms, and like in the gridworld, the robot may impose the context of not entering particular rooms; it may also pick up the obstacles and move them out of its way. Across tasks, we vary the positions of all objects. We train on 50 task instances and test on 10 held-out instances. This domain has continuous states and actions, and as such is extremely challenging for planning. Though the obstacles do not move on their own (like they do in the gridworld), the difficulty of this domain stems from the added complexity of needing to reason about geometry and continuous trajectories. We simulate this domain using PyBullet [12]. The reward function is sparse: 1000 if the goal location is reached and 0 otherwise.

Planner. Developing planners for robotic domains with continuous states and actions is an active area of research. For this domain, we use a state-of-the-art task and motion planner [35], which is not specific to namo problems. We use the RRT-Connect algorithm [29] for motion planning and the Fast-Forward PDDL planner [20] for task planning.

Representations. The features of each task are a top-down image of the scene. The state is a vector of the current pose of each object and the robot, and the robot’s current room. The actions are moving the robot base to a target pose, and clearing an object in front of the robot.

Domain D4: Robotic sequential manipulation. Illustrated in Figure 1C, this domain has a robot manipulating the two red objects that start off on the left table to be placed into the bins on the right table. The fifteen blue objects on the left table serve as distractors, with which the robot must be careful not to collide when grasping the red objects; the green objects in the bins indicate that certain bins are already occupied. Across tasks, we vary the positions of all objects, and which bins are occupied by green objects. We also vary the radii of the red objects. We train on 50 task instances and test on 10 held-out instances. We again simulate this domain using PyBullet [12]. As in Domain D3, the reward function is sparse: 1000 if the goal location is reached and 0 otherwise.

Broadly, there are two types of contexts that are useful to impose in this domain. (1) If the robot chooses to constrain its grasp style to only allow top-grasping the red objects, then it need not worry about colliding with the blue objects, and can thus ignore them. However, this does not always work, since not all geometries are amenable to being top-grasped; for instance, sometimes an object’s radius may be too large. Note, however, that to place the red objects into the bins upright, a side-grasp is necessary, and so we provide the robot a regrasp operator in addition to the standard move, pick, and place. Importantly, this regrasp operator is never necessary, but including it can allow the robot to simplify its planning problem by ignoring the blue objects (see Equation 1). (2) If the robot chooses to constrain which bins it will place the red objects into, then it need not worry about the green objects in the other bins, simplifying the planning problem.

Planner. Same as in Domain D3 (namo).

Representations. The features of each task are a vector of the object radii and occupied bins. The state is a vector of the current pose of each object, the grasp style used by the robot, and the current held object (if any). The actions are moving the robot to a target base pose and grasping at a target gripper pose (which requires an empty gripper), and moving the robot to a target base pose and placing at a target placement pose (which requires an object to be currently held).

Appendix E Experimental Details

In all experiments, computational cost is measured in wall-clock time (seconds). We use the following values of : 0 for MCTS444Since MCTS is an anytime algorithm, we give it a timeout of 0.25 seconds. With , the objective then reflects the best returns found within this timeout., 100 for BFSReplan, 250 for FastDownward, and 100 for tamp. Every domain uses horizon . Additionally, to ensure that shorter plans are preferred in general, all domains use a discount factor , except for Domain D2 which uses a timestep penalty as previously discussed.

To properly evaluate our objective (Equation 1), we would need to run every method until it completes, which can be extremely slow, e.g. for the pure planning baseline, or when our context selector picks a bad context. To safeguard against this, we impose a timeout of 60 seconds on all planning calls.

All neural networks are either fully connected for vector inputs or convolutional for image inputs. Fully connected networks have hidden layer sizes [50, 32, 10]. Convolutional networks use a convolutional layer with 10 output channels and kernel size 2, followed by a max-pooling layer with stride 2, and then fully connected layers of [32, 10]. Neural networks are trained using the Adam optimizer 

[26] with learning rate , until the loss on the training dataset reaches .

To generate the spaces of contexts, we use the method described in (§4.1). In Domains D1, D2, and D3, we consider disjunctive and single-term (a single variable and a single value in its domain) constraints only, while in Domain D4 we also consider conjunctive constraints. All contexts only consider the discrete variables in the domain. Our parameters and (Appendix A) are: for Domain D1, for Domain D2, and for Domains D3 and D4.

Appendix F Performance as a Function of Number of Training Tasks

The following plots illustrate how the objective value (left) and returns (right) accrued by the camp policy vary as a function of the number of training tasks, in Domain D2 (classical planning):

Discussion. As following a policy in the test task requires near-zero computational effort (our neural networks are small enough that inference is very fast), the red curves in both plots are nearly identical. Interestingly, in the regime of fewer training tasks (), camp outperforms policy learning, despite policy learning performing better with the full set of 20 tasks. This leads us to believe that in domains where policy learning would perform well when given a lot of data, generating and planning in a camp may be a more viable strategy when data is limited. As shown in the main results, this disparity between camps and policy learning is more dramatic for the other three domains, where task instances are far more varied, so much so that camps sharply outperform policy learning for any reasonable number of training environments that we were able to test.

Appendix G Objective Values for Main Experiments

Method Test Task Objective Value (St. Dev.)
D1 (Grid), MCTS D1 (Grid), BFSReplan D2 (Classical) D3 (namo) D4 (Manip)
camp (ours) 70 (16) 21 (10) -286 (9.6) 896 (63) 744 (94)
camp ablation 25 (11) 0.9 (24) -308 (52) 707 (154) 453 (237)
Pure planning 6 (5) -17 (11) -414 (20) 242 (385) 335 (86)
Plan transfer -7 (0.4) -6 (15) -467 (0.02) 141 (227) 21 (34)
Policy learning -3 (4) -11 (13) -469 (0.2) -0.2 (0.01) -0.2 (0.01)
Task-conditioned 5 (5) -2 (11) -145 (0.4) -0.3 (0.01) -0.2 (0.02)
Stilman’s [37] - - - 826 (36) -
Table 1: Compilation of test task objective values on all our domains and methods. Objective values are computed using the same values of that were used during training. All table entries report an average over 10 independent runs of both context selector training and test task evaluation. Stilman’s algorithm [37] is namo-specific and so is only run on the namo domain.

Table 1 complements Figure 4 in the main text, showing the objective values obtained for all domains, planners, and methods with the values of that were used during training. See (§5.3) for further analysis and discussion.

Appendix H Performance with an Offline Planner

The table on the right shows test task objective values with an offline planner (asynchronous value iteration), in Domain D1 (gridworld).

Method Objective (SD)
camp (ours) 25 (3)
camp ablation 4 (11)
Pure planning -1 (2)
Plan transfer -
Policy learning 7 (0.01)
Task-conditioned 7 (0.01)

Discussion. This result mirrors the trend found in the main results: camps strongly outperform both pure policy learning and pure planning, for reasons of generalization error and high computational cost, respectively. camp’s reduction of the state space leads to substantial benefits for offline planning, because offline planners find a policy over the entire state space. Nonetheless, camp remains primarily motivated by online planning for robotics, where the continuous states and actions make offline planning completely infeasible in practice.