Online Replanning in Belief Space for Partially Observable Task and Motion Problems

11/11/2019 ∙ by Caelan Reed Garrett, et al. ∙ MIT Nvidia 9

To solve multi-step manipulation tasks in the real world, an autonomous robot must take actions to observe its environment and react to unexpected observations. This may require opening a drawer to observe its contents or moving an object out of the way to examine the space behind it. If the robot fails to detect an important object, it must update its belief about the world and compute a new plan of action. Additionally, a robot that acts noisily will never exactly arrive at a desired state. Still, it is important that the robot adjusts accordingly in order to keep making progress towards achieving the goal. In this work, we present an online planning and execution system for robots faced with these kinds of challenges. Our approach is able to efficiently solve partially observable problems both in simulation and in a real-world kitchen.



There are no comments yet.


page 1

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robots acting autonomously in human environments are faced with a variety of challenges. First, they must make both discrete decisions about what object to manipulate as well as continuous decisions about which motions to execute to achieve a desired interaction. Planning in these large hybrid spaces is the subject of integrated Task and Motion Planning (TAMP) [11, 14, 33, 35, 9, 8]. Second, real-world robot actions are often quite stochastic. Uncertainty in the effects of actions can manifest both locally and globally through, effects such as noisy actuation or dropping objects. Third, the robot can only partially observe the world due to occlusions caused by doors, drawers, other objects, and even the robot itself. Thus, the robot must maintain a belief over the locations of entities and intentionally select actions that reduce its uncertainty about the world [15].

This class of problems can be formalized as a hybrid

partially observable Markov decision process

(POMDP) [13]. Solutions are policies, mappings from distributions over world states (belief-states) to actions. Because solving these problems exactly is intractable [13], we compute a policy online via repeatedly replanning [40, 19], each time solving an approximate, determinized [40, 19] version of the problem using an existing TAMP approach [10]. POMDP planning can be viewed as searching through belief-space, the space of belief states, where both motion and perception actions operate on belief states instead of individual states.

Most related prior work has modeled belief space using either discrete [26, 36, 32] or fluent-based [15, 12] abstractions. In contrast, we operate directly

on belief distributions by specifying procedures that model observation sampling, visibility checking, and Bayesian belief filtering. This allows us to tackle problems where a continuous component of the state governs the probability of an observation. For example, a movable object at a particular pose might occlude a target object, reducing the probability that it will be detected. By using a

particle-based belief representation, we can model multi-modal beliefs that arise when several objects occlude regions of space. During planning, we conservatively approximate the probability of detection by factoring it into a product of conditions on each individual object. This exposes sparse interactions between an observation and the belief about each object’s pose, allowing the planner to identify and remove objects that are likely occluding the target object.

Fig. 1: The robot pulls open a drawer to detect whether the spam object lies within it.

Additionally, we introduce a replanning algorithm that uses past plans to constrain the structure of solutions for the current planning problem. These constraints ensure that future plans retain the discrete structure of prior plans, even if the exact parameter values must be changed due to stochastic execution or new observations. As a result, this ensures that the overall policy is making progress towards achieving the goal. Reusing prior plan structure, which includes any constant values, also reduces the search space of the planner and thus speeds up successive replanning invocations.

We introduce a mechanism that defers binding some plan parameters that are not needed in order to select and execute the first action in the plan. This technique prevents the planner from evaluating expensive sampling procedures each time that it replans. However, we only defer procedures that are likely to succeed, such as motion planners operating in free space. Deferring procedures that are not likely to succeed might cause the planner to find a plan that cannot be executed as intended. Intuitively, this strategy performs the least amount of computation possible to both obtain the next action and ensure it will make progress towards the goal. Finally, we evaluate our algorithms on several simulated tasks, and demonstrate our system running on a real robot acting in a kitchen environment in the accompanying videos.

Ii Related Work

There is much work that addresses the problem of efficiently solving deterministic, fully-observable TAMP problems [11, 14, 33, 35, 9, 8]. However, only a few of these approaches have been extended to incorporate some level of stochasticity or partial observability [15, 12, 26].

Solving for an optimal, closed loop policy for even discrete POMDPs is undecidable in the infinite-horizon case [13, 22]. An alternative strategy is to dynamically compute a policy online in response to the current belief, rather than offline for all beliefs, by replanning [19]. One approach to online planning is to use Monte-Carlo sampling [30, 31] to efficiently explore likely outcomes of various actions. These methods have been successfully applied to robotic planning tasks such as grasping in clutter [20], non-prehensile rearrangement  [18], and object search [38]. However, the hybrid action space in our application is too high-dimensional for uninformed action sampling to generate useful actions.

Another online planning strategy is to approximate the original stochastic problem as a deterministic problem through the process of determinization [40, 39, 19]. This enables deterministic planners, which are able to efficiently search large spaces, to be applied. Most-likely outcome determinization always assigns the action outcome that has the highest probability. When applied to observation actions, this approach is called maximum likelihood observation (MLO) determinization [28, 27, 12]. However, the approximation fails when the success of a policy depends on some outcome other than the most likely one actually occurring.

There are many approaches for representing and updating a belief such as joint, unscented Kalman filtering 

[28, 15], factoring the belief into independent distributions per object [12, 16], and maintaining a particle filter, which represents the belief as a set of weighted samples [30, 31, 20, 38]. Many approaches use a different belief representation when planning versus when filtering. Several approaches plan on a purely discrete abstraction of the underlying hybrid problem [26, 36, 32]. Other approaches plan using a calculus defined on belief fluents [15, 12]

, logical tests on the underlying belief such as “the value of random variable

is within of value with probability at least

”. In contrast, our approach plans directly on probability distributions, where actions update beliefs via proper transition and observation updates.

Iii Problem Definition

We address hybrid, belief-state Stochastic Shortest Path Problems (SSPP) [2], a subclass of hybrid POMDPs where the cost of action is strictly positive. The robot starts with a prior belief . Its objective is to reach a goal set of beliefs while minimizing the cost it incurs. The robot selects actions according to a policy defined on belief states . We evaluate online by replanning given the current belief state . We approximate the original belief-space SSPP by determinizing its action outcomes (Section V). We formalize each determinized SSPP in the PDDLStream [10] language and solve them using a cost-minimizing PDDLStream planner.

Although our technique is general-purpose, our primary application is partially-observable TAMP in a kitchen environment that contains a single mobile manipulator, counters, cabinets, drawers, and a set of unique, known objects. The robot can observe the world using an RGBD camera that is fixed to the world frame. The camera can detect the set of objects that are visible as well as noisily estimate their poses. The latent world state is given by the robot configuration, door and drawer joint angles, the discrete frame that each object is attached to, and the pose of the object relative to its attached frame. We maintain a factored

belief as the product of independent posterior distributions over each variable. In our environment, the robot’s configuration as well as the door and drawer joint angles can be accurately estimated using our perception system 

[29], so we only maintain a point estimate for these variables. However, there is substantial partial observability when estimating object poses due to occlusions from doors, drawers, other objects, and even the robot. We represent and update our belief over the pose state of each object using particle filtering.

Iv PDDLStream Formulation

We use the PDDLStream [10] planning formalism to model and solve determinized, hybrid belief-state SSPPs. PDDLStream is an extension of Planning Domain Description Language (PDDL)  [23] that adds the ability to programmatically declare procedures for sampling values of continuous variables in the form of streams.

PDDLStream uses predicate logic to describe planning problems. An evaluation of a predicate for a given set of arguments is called a literal. A fact is a true literal. Static literals always remain constant, but fluent literals can change truth value as actions are applied. States are represented as a set of fluent literals. Our domain makes use of the following fluent predicates: (AtConf ?r ?q) states that robot part ?r (the base or arm) is at configuration ?q; (AtAngle ?j ?a) states that a door or drawer ?j is at joint angle ?a; (HandEmpty) indicates that the robot’s end-effector is empty; (AtGrasp ?o ?g) states that object ?o is attached to the end-effector using grasp ?g; (AtPoseB ?o ?pb) states that object ?o is at pose ?pb.

An action schema is specified by a set of free parameters (:param), a precondition formula (:pre) that must hold in a state in order to execute the action, and a conjunctive effect formula (:eff) that describes the changes to the state. Effect formulas may set a fluent fact to be true, set a fluent fact to be false (not), or increase the plan cost (incr[7]. For example, consider the following action descriptions for move and pick. Other actions such as place, pull, push and press button can be defined similarly to pick. We used universally quantified conditional effects [25] (omitted here for clarity) to update the world poses of objects placed in drawers for pull and push actions.

(:action move
 :param (?r ?q1 ?t ?q2)
 :pre (and (Motion ?r ?q1 ?t ?q2) (AtConf ?r ?q1))
 :eff (and (AtConf ?r ?q2) (not (AtConf ?r ?q1))))
(:action pick
 :param (?o ?pb ?g ?bq ?aq)
 :pre (and (Kin ?o ?pb ?g ?bq ?aq) (AtPoseB ?o ?pb)
  (HandEmpty) (AtConf base ?bq) (AtConf arm ?aq))
 :eff (and (Holding ?o ?g)
  (not (AtPoseB ?o ?pb)) (not (HandEmpty))))

The novel representational aspect of PDDLStream is streams: functions from a set of input values (:inp) that enumerate a possibly infinitely-long sequence of output values (:out). Streams have a declarative component that specifies the arity of input and output values as well as a domain formula (:dom) that governs legal inputs and a conjunctive certified formula (:cert) that expresses static facts that all input-output pairs are guaranteed to satisfy. Additionally, streams have a programmatic component that implements the procedure in a programming language such as Python. For example, the inv-kin stream takes in a tuple of values specifying an object ?o, its pose ?pb, a grasp ?g, and a robot base configuration ?bq. U sing an inverse kinematics solver, it generates robot arm configurations ?aq that satisfy the Kin relationship that if the base and arm were at those configurations and holding the object in the specified grasp, then it would be at the specified pose. The motion stream performs motion planning, certifying the static Motion precondition of the move action.

(:stream inv-kin
 :inp (?o ?pb ?g ?bq)
 :dom (and (Conf base ?bq)
  (PoseB ?o ?pb) (Grasp ?o ?g)
 :out (?aq)
 :cert (and (Conf arm ?aq)
  (Kin ?o ?p ?g ?bq ?aq)))
(:stream motion
 :inp (?r ?q1 ?q2)
 :dom (and (Conf ?r ?q1)
           (Conf ?r ?q2))
 :out (?t)
 :cert (and (Traj ?r ?t)
  (Motion ?r ?q1 ?t ?q2)))

Iv-a Modeling Observations

In order to enable deliberate information gathering, we model the ability for the robot to perform a sensing action, receive an observation, and update its belief using the detect action. The detect action is parameterized by an object ?o, a prior pose belief ?pb1, an observation ?obs, and a posterior belief ?pb2. Thus, (AtPoseB ?o ?pb) now states that object ?o has the current pose belief ?pb. By the BeliefUpdate precondition, these four values must represent a valid Bayesian update. If the observation ?obs is not BOccluded by another object, detect updates the current pose belief for ?o.

(:action detect
 :param (?o ?pb1 ?obs ?pb2)
 :pre (and (BeliefUpdate ?o ?pb1 ?obs ?pb2)
  (AtPoseB ?o ?pb1) (not (BOccluded ?o ?pb1 ?obs)))
 :eff (and (AtPoseB ?o ?pb2) (not (AtPoseB ?o ?pb1))
  (incr (total-cost) (ObsCost ?o ?pb1 ?obs))))

The sample-obs stream samples from the set of possible observations given pose belief ?pb. We sample observations according to their likelihood in ?pb in order to prioritize likely, and thus low cost, observations. The sample-obs stream tests whether object ?o2 at belief ?pb prevents observation ?obs with probability exceeding , a value described in Section V.

(:stream sample-obs
 :inp (?o ?pb)
 :dom (PoseB ?o ?pb)
 :out (?obs)
 :cert (Obs ?o ?obs))
(:stream test-vis
 :inp (?o1 ?obs ?o2 ?pb2)
 :dom (and (Obs ?o1 ?obs)
  (PoseB ?o2 ?pb2))
 :cert (BVis ?o1 ?obs
             ?o2 ?pb2))

The update-belief stream computes the posterior pose belief ?pb2 that results from updating prior pose belief ?pb1 with observation ?obs. Although observations are stochastic, the belief update process is deterministic.

(:stream update-belief
 :inp (?o ?pb1 ?obs)
 :dom (and (PoseB ?o ?pb1) (Obs ?o ?obs))
 :out (?pb2)
 :cert (and (PoseB ?o ?pb2)
  (BeliefUpdate ?o ?pb1 ?obs ?pb2)))

Finally, we specify BOccluded as a derived predicate [6, 34], a logical formula defined on the state. BOccluded is true if there exists another object ?o2 at currently at pose belief ?pb2 that prevents observation ?obs from being received with high probability.

(:derived (BOccluded ?o ?obs)
 (exists (?o2 ?pb2)
 (and (Obs ?o ?obs)  (AtPoseB ?o2 ?pb2)
 (not (= ?o ?o2)) (not (BVis ?o1 ?obs ?o2 ?pb2)))))

V Determinized Observation Costs

We are interested in enabling a deterministic planner to perform approximate probabilistic reasoning by minimizing plan costs. The maximum acceptable risk can always be specified using a user-provided maximum expected cost . We focus on computing ObsCost, the cost of detect, which is a function of the prior pose belief ?pb1 and the observation ?obs. Similar analysis can be applied to other probabilistic conditions, such as collision checks.

(:function (ObsCost ?o ?pb ?obs)
 :dom (and (PoseB ?o ?pb) (Obs ?o ?obs)))

Self-Loop Determinization. The widely-used most-likely-outcome and all-outcome determinization schemes do not provide a natural way of integrating the cost of action and the probability of an intended outcome  [3, 15]. Thus, we instead use self-loop determinization [17, 19], which approximates the original SSPP as a simplified self-loop SSPP. In a self-loop SSPP, an action executed from state may result in only two possible states: a new state or the current state . For this simple class of SSPPs, a planner can obtain an optimal policy by optimally solving a deterministic problem with transformed action costs. Let be the cost of upon a failed (self-loop) transition. The determinized cost of action is then


We directly model our domain as a self-loop SSPP by specifying an upper bound for expected cost of a successful outcome , an upper bound for the expected recovery cost to return to (i.e. the self-loop transition), and a lower bound for the probability of a successful outcome .

Computing the Likelihood of an Observation. Suppose there are unique objects in the world, and we are interested in detecting object . Let be the latent continuous pose random variable for an object , and let be a value of . As shorthand, define to be a tuple of latent poses for each of the objects except for object . Let be a probability density over , which in our application, is represented by a set of weighted particles. Let and be observed Bernoulli random variables for whether object is visible and is detected. When is true, let

be a continuous random variable for the observed pose of object

. Otherwise, is undefined. For detection, we will assume that where is the probability of a false negative. We will conservatively use zero as a lower bound for the probability of a false positive is zero, i.e. , which removes false detection terms. For pose observations, we will assume a multivariate Gaussian noise model . We are interested in , the probability of observing a pose for object .

The key component of this expression is , the probability that is currently visible, which is contingent on the poses of the other objects . Define as a deterministic function that is if object at pose blocks object from being visible at pose and otherwise is . Ultimately, will be a component of the cost function ObsCost and thus must only depend on pose belief ?pb1 and observation ?obs. However, it is currently still dependent on the current beliefs for each of the other objects all at once. While we could instead parameterize ObsCost using the pose belief of all objects, it would be combinatorially difficult to instantiate as increases. And due to its unfactored form, we will not be able to benefit from efficient deterministic search strategies that leverage factoring. Thus, we marginalize out , which ties the objects together, by taking the worst-case probability of visibility due to object over a subset of states .


As a result, we can provide a non-trivial lower bound for that no longer depends on . Suppose satisfies , then


Inequality 4 follows from the fact that some combinations of would result in object collision and thus are not possible. Finally, this gives us the following lower bound for :


This probability depends on both and . Ideally, we would select and that maximize equation 7; however, this would require operating on all of the objects at once. Instead, we let the planner select . However, detect can only be applied at this cost if , which is enforced through BOccluded quantifying over each BVis condition. The choice of

presents a trade off because the prior probability

increases as grows but each decreases. In practice, we sample points and take to be a -neighborhood of , capturing a local region where we anticipate observing object .

Observation Example. Consider the scenario in Fig. 2 with objects A, B, C, and D. Suppose that the object poses for A, B, and C are perfectly known, but object D is equally believed to be either at pose or (but not ). First, note that for all choices because object obstructs , object obstructs , and object obstructs , all with probability one. If we take , then , meaning all three objects must be moved before applying detect, despite the fact that . If we take then but , indicating that does not need to be moved. Finally, if we take then only . Intuitively, this shows that selecting to be a small, local region improves sparsity with respect to which objects likely affect a particular observation under our bound.

Fig. 2: An example detection scenario where object D is believed to be either behind object A or object C with equal probability.

Vi Online Replanning

Now that we have incorporated probabilistic reasoning into our deterministic planner, we induce a policy by replanning after executing each action . However, done naively, it is possible to result in a policy that never reaches the goal set of beliefs . This is even true when acting in a deterministic problem using replanning. For example, consider a deterministic, observable planning problem where the goal is for the robot to hold object A. The first plan the robot finds might require moving its base, moving its arm, and finally picking object A:


Suppose the robot executes the first move action, arrives at base configuration , and replans to obtain a new plan.

While this is a satisfactory solution when solving for a single plan in isolation, it is not desirable when generating the next plan because it requires another base movement action despite the robot having just executed one. This process could repeat indefinitely, causing the robot to never reach its goal despite never failing to find a plan. For a deterministic problem, this can be prevented by simply executing the first plan all at once. However, in a stochastic environment where, for example, base movements are imprecise, executing the full plan open loop will almost always fail. Thus, we must replan using the base pose that we actually reach instead of , the one we intended to reach.

Intuitively, we need to enforce that some amount of overall progress is obtained when replanning after each action. One way to do this is to impose a constraint on the length or cost of future plans that converges to zero after a finite number of replanning iterations. For length, this constraint could be that the next plan must have at least one fewer action than the previous plan. If each action has positive probability of successful execution and the domain is dead-end free, then this strategy will achieve the goal with probability 1.

While this strategy ensures that the robot almost certainly reaches the goal, it incurs a significant computational cost because the robot plans from scratch on each iteration. However, while some of the values in the previous plan may change, if modeled correctly, its overall structure likely will not. Thus, one way to speed up each search is to additionally constrain the next plan to adhere to the same structure as the previous plan. To do this, we first identify all action arguments that are constants, meaning that they are valid quantities in subsequent problems. These include the names of objects and grasps for objects but not poses or motion plans, which are conditioned on the most recent observations of the world. We replace each use of a non-constant with a unique free parameter symbol (denoted by the prefix @). The example given in equation 8 results in the following plan structure after executing its first action.

Algorithm 1 gives the pseudocode for our online replanning policy. The inputs to Policy are the prior belief , goal set of beliefs , and maximum cost . Policy maintains a set of previously proven facts as well as the tail of the previous plan . On each iteration of the while-loop, first, the procedure Determinize models the belief SSPP as a deterministic planning problem with initial state , goal set of states , and actions . If the prior plan exists, Policy applies the plan constraints using the ConstrainPlan procedure described in algorithm 2. If the PDDLStream planner Plan is unable to solve within a user-provided timeout, the constraints are removed, and planning is reattempted. If successful, Plan returns not only a plan but also the certified facts within the preimage of that prove that is a solution. Then, Policy executes the first action of , receives an observation , and updates its current belief . Finally, it extracts the subset of constant facts in , static facts that only involve constants, and sets to be remainder of that was not executed.

1:procedure Policy()
3:      while True do
6:            if  then Reuse plan constraints
9:            if  then No plan constraints
11:            if  then No plan with cost below
12:                 return False             
13:            if  then Reached goal belief
14:                 return True             
15:             Receive observation
Algorithm 1 Online Replanning Policy

Algorithm 2 gives the pseudocode for the constraint transformation. It adds a new set of action schemas , each of which have modified preconditions and effects, for every action on the previous plan . The fact is a total-ordering constraint that enforces that action be applied before action . For each argument of action , if is a constant, the new action is forced to use the same value. The fact is true if symbol has already been assigned to some value in the action sequence. If is true, the fact is true if free parameter has been assigned to new value . Each free parameter must either be unbound or assigned to action argument ?p.

1:procedure ConstrainPlan()
3:      for  do
4:            if  then Total ordering constraint
6:            for  do
7:                 if IsConstant(then Enforce the same value
9:                 else
16:      return
Algorithm 2 Plan Constraint Compilation

Vii Deferred Stream Evaluation

We use the Focused algorithm [10] to solve each determinized PDDLStream problem. The Focused algorithm lazily plans using optimistic, hypothetical stream output values before actually calling any stream procedures. As a result, it not only generates candidate action plans but also stream plans, which consist of a sequence of stream evaluations that optimistically might bind the free parameters on the action plan. Then, it calls the corresponding procedures for each stream on the stream plan to test the action plan’s validity. For example, consider the following possible stream plan that supports the action plan given in equation 8:


Normally, the Focused algorithm would not terminate until it has successfully bound all the free parameters on an action plan. As a result, it would recompute the motion stream for every move action on its plan per replanning invocation, spending a significant amount of computation constructing motion plans that will never be used. An alternative strategy would be to defer evaluation of these expensive streams if they are not required before we anticipate replanning. For the example in equation VII, the inv-kin and motion(arm,…) streams could both be deferred because the first action they are used by is the move(arm, …) action.

However, it may not always be advantageous to defer computation of some streams. For instance, it might be the case that initial pose , sampled grasp , and sampled base configuration do not admit arm kinematic solution (Kin) required for a pick. Rather than move to before discovering this, it would be more efficient to infer this at the start and sample new values for or . Thus, we only defer the evaluation of streams that are both expensive and likely to succeed. In our domain, this corresponds to just the motion streams, which almost always succeed if the initial and final configurations are not in collision.

Viii Experiments

We experimented on ten randomly generated problems within four partially-observable domains. We used PyBullet [5] for ray-casting (visibility checking) and collision checking, and TRAC-IK [1] for inverse kinematics. Our planner was implemented in Python. We experimented with three policies: using deferred streams, using plan constraints, and using both plan constraints and deferred streams. Each policy was limited to 10 minutes of planning time. For switch drawers, the block starts in one drawer, but the goal is to believe it is in the other drawer. The robot’s pose prior is uniform over both drawers. Successful policies typically inspect the goal drawer, fail to observe it, and then are forced to retrieve it from the other drawer. This requires placing the block in an intermediate location to close one door and open the other. See the appendix for a description of the other tasks. Table I shows the results of the experiments. Applying plan constraints and deferring streams result in an improvement in success rate and reduction in total planning time while executing the policy.

Fig. 3: Left: the particle-filter pose belief for the green block after one observation. Green particles have high weight and black particles have low weight. Right: the robot must remove the sugar to place the block and close the drawer.
Task Deferred Constraints Both
% t % t % t
Stow Block (fig 3 right) 100 100 100 186 100 89
Inspect Drawer (fig 1) 83 115 100 39 100 18
Switch Drawers 60 311 60 492 80 266
Cook Block (fig 3 left) 20 586 67 450 80 268
TABLE I: The success rate (%) and mean total planning time in seconds (t) over 10 generated problems per task.

We applied our planner to real-world kitchen manipulation tasks performed by a Franka Emika Panda Robot Arm. A fixed Microsoft Kinect V1 records RGB and depth data. See for videos of the tasks. The stow spam, inspect drawer, switch drawers, and cook spam videos show the robot solving the tasks in table I We used PoseCNN [37] to detect several YCB objects [4] in the scene and DeepIM [21] to refine the estimates of their poses. Finally, we used DART [29] to track the robot arm, door and drawer joint angles, and the detected objects. The integrated system was described in prior work [24].

Ix Conclusions

We presented a replanning system for acting in partially-observable domains. By planning directly on beliefs, the planner can approximately compute the likelihood of detection given each movable object pose belief. Through plan structure constraints, we ensure our replanning policy makes progress towards the goal. And by deferring expensive stream evaluations, we enable replanning to be performed efficiently.


For each simulated experiment in table I, the goal condition, the prior belief, the latent initial state, and a successful execution trace are listed as follows.

Ix-1 Stow Block

The goal is for the green block to be in the top drawer and for the top drawer to be closed. The prior for the green block is uniform over the counter, and the prior for the sugar box is uniform over the top drawer. The green block is initially on the counter, and the sugar box is initially on the top drawer. Successful policies remove the sugar box from the top drawer (in order to close the top drawer), stow the green block in the top drawer, and close the top drawer. The robot automatically infers that it must move the sugar box, but not the green block, before closing the top drawer as otherwise the tall sugar box would collide with the cabinet.

Ix-2 Inspect Drawer

The goal is for the green block to be in the bottom drawer and for the bottom drawer to be closed. The prior for the green block is uniform over both drawers. The green block is initially in the bottom drawer. Successful policies open the bottom drawer, detect the green block, and then close the bottom drawer. The robot intentionally opens the bottom drawer, undoing one of its goals, in order to attempt to localize the green block. Afterwards, it must reachieve this goal by closing the bottom drawer.

Ix-3 Switch Drawers

The goal and prior are the same as in inspect drawer. However, the green block is instead initially in the top drawer. Successful policies open the bottom drawer, fail to detect the green block, and close the bottom drawer in order to open the top drawer. Then, they detect the green block, pick up the green block, temporarily place the green block on the counter, close the top drawer, open the bottom drawer, stow the green block in the bottom drawer, and finally close the bottom drawer. The robot must update its belief upon failing to detect the green block and plan to investigate the other drawer.

Ix-4 Cook Block

The goal is for the green block to be cooked. The prior for the green block is uniform over the counter. A cracker box and sugar box are initially on the counter, one of which always occludes green block at its initial pose. Successful policies move the cracker box and/or the sugar box out of the way until the green block is detected. Then, they place the green block on the stove, press the stove’s button to turn it on (which cooks the green block), and press the stove’s button to turn it off. Depending on the initial pose of the green block and the robot’s first manipulation action, the robot might need to inspect behind one or both of the occluding objects in order to localize the spam.


  • [1] P. Beeson and B. Ames (2015-11)

    {TRAC-IK}: An Open-Source Library for Improved Solving of Generic Inverse Kinematics

    In Proceedings of the IEEE RAS Humanoids Conference, Seoul, Korea. Cited by: §VIII.
  • [2] D. P. Bertsekas and J. N. Tsitsiklis (1991) An analysis of stochastic shortest path problems. Mathematics of Operations Research 16 (3), pp. 580–595. Cited by: §III.
  • [3] A. L. Blum and J. C. Langford (1999) Probabilistic planning in the graphplan framework. In European Conference on Planning, pp. 319–332. Cited by: §V.
  • [4] B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar (2015) The ycb object and model set: Towards common benchmarks for manipulation research. In 2015 international conference on advanced robotics (ICAR), pp. 510–517. Cited by: §VIII.
  • [5] E. Coumans and Y. Bai

    PyBullet, a Python module for physics simulation for games, robotics and machine learning

    Note: \url{} Cited by: §VIII.
  • [6] S. Edelkamp (2004) PDDL2.2: The language for the classical part of the 4th international planning competition. 4th International Planning Competition (IPC’04), at ICAPS’04.. External Links: Link Cited by: §IV-A.
  • [7] M. Fox and D. Long (2003) PDDL2.1: An extension to {PDDL} for expressing temporal planning domains.

    Journal of Artificial Intelligence Research (JAIR)

    20, pp. 2003.
    Cited by: §IV.
  • [8] C. R. C.R. Garrett, T. Lozano-Pérez, and L.P. L. P. Kaelbling (2018) Sampling-based methods for factored task and motion planning. The International Journal of Robotics Research 37 (13-14). External Links: Link, Document, ISSN 17413176 Cited by: §I, §II.
  • [9] C. R. Garrett, T. Lozano-Pérez, and L. P. Kaelbling (2017) FFRob: Leveraging symbolic planning for efficient task and motion planning. The International Journal of Robotics Research. External Links: Link, Document Cited by: §I, §II.
  • [10] C. R. Garrett, T. Lozano-Pérez, and L. P. Kaelbling (2018) STRIPStream: Integrating Symbolic Planners and Blackbox Samplers. arXiv preprint arXiv:1802.08705. Cited by: §I, §III, §IV, §VII.
  • [11] F. Gravot, S. Cambon, and R. Alami (2005) aSyMov: a planner that deals with intricate symbolic and geometric problems. In Robotics Research, pp. 100–110. Cited by: §I, §II.
  • [12] D. Hadfield-Menell, E. Groshev, R. Chitnis, and P. Abbeel (2015) Modular task and motion planning in belief space. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4991–4998. Cited by: §I, §II, §II, §II.
  • [13] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra (1998) Planning and acting in partially observable stochastic domains. Artificial intelligence 101 (1-2), pp. 99–134. Cited by: §I, §II.
  • [14] L. P. Kaelbling and T. Lozano-Pérez (2011) Hierarchical Planning in the Now. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §I, §II.
  • [15] L. P. Kaelbling and T. Lozano-Pérez (2013) Integrated task and motion planning in belief space. International Journal of Robotics Research (IJRR). Cited by: §I, §I, §II, §II, §V.
  • [16] L. P. Kaelbling and T. Lozano-Pérez (2016) Implicit belief-space pre-images for hierarchical planning and execution. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 5455–5462. Cited by: §II.
  • [17] E. Keyder and H. Geffner (2008) The HMDPP planner for planning with probabilities. Sixth International Planning Competition at ICAPS 8. Cited by: §V.
  • [18] J. E. King, V. Ranganeni, and S. S. Srinivasa (2017) Unobservable monte carlo planning for nonprehensile rearrangement tasks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 4681–4688. Cited by: §II.
  • [19] A. Kolobov (2012) Planning with Markov decision processes: An AI perspective. Synthesis Lectures on Artificial Intelligence and Machine Learning 6 (1), pp. 1–210. Cited by: §I, §II, §II, §V.
  • [20] J. K. Li, D. Hsu, and W. S. Lee (2016) Act to see and see to act: POMDP planning for objects search in clutter. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5701–5707. Cited by: §II, §II.
  • [21] Y. Li, G. Wang, X. Ji, Y. Xiang, and D. Fox (2018)

    DeepIM: Deep Iterative Matching for 6D Pose Estimation


    European Conference Computer Vision (ECCV)

    Cited by: §VIII.
  • [22] O. Madani, S. Hanks, and A. Condon (1999) On the undecidability of probabilistic planning and infinite-horizon partially observable Markov decision problems. In AAAI/IAAI, pp. 541–548. Cited by: §II.
  • [23] D. McDermott, M. Ghallab, A. Howe, C. Knoblock, A. Ram, M. Veloso, D. Weld, and D. Wilkins (1998) PDDL: The Planning Domain Definition Language. Technical report Yale Center for Computational Vision and Control. Cited by: §IV.
  • [24] C. Paxton, N. Ratliff, C. Eppner, and D. Fox (2019) Representing Robot Task Plans as Robust Logical-Dynamical Systems. arXiv preprint arXiv:1908.01896. Cited by: §VIII.
  • [25] E. P. D. Pednault (1989) ADL: Exploring the Middle Ground Between STRIPS and the Situation Calculus.. Kr 89, pp. 324–332. Cited by: §IV.
  • [26] C. Phiquepal and M. Toussaint (2017) Combined task and motion planning under partial observability: An optimization-based approach. In RSS Workshop on Integrated Task and Motion Planning, Cited by: §I, §II, §II.
  • [27] R. Platt Jr and R. Tedrake (2012) Non-Gaussian belief space planning as a convex program. In Proc. of the IEEE Conference on Robotics and Automation, Cited by: §II.
  • [28] R. Platt, R. Tedrake, L. Kaelbling, and T. Lozano-Perez (2010) Belief space planning assuming maximum likelihood observations. Robotics: Science and Systems VI. External Links: Link, ISBN 9780262516815, Document Cited by: §II, §II.
  • [29] T. Schmidt, R. A. Newcombe, and D. Fox (2014) DART: Dense Articulated Real-Time Tracking.. In Robotics: Science and Systems, Vol. 2. Cited by: §III, §VIII.
  • [30] D. Silver and J. Veness (2010) Monte-Carlo planning in large POMDPs. In Advances in neural information processing systems, pp. 2164–2172. Cited by: §II, §II.
  • [31] A. Somani, N. Ye, D. Hsu, and W. S. Lee (2013) DESPOT: Online POMDP planning with regularization. In Advances in neural information processing systems, pp. 1772–1780. Cited by: §II, §II.
  • [32] S. Srivastava, N. Desai, R. Freedman, and S. Zilberstein (2018) An anytime algorithm for task and motion mdps. arXiv preprint arXiv:1802.05835. Cited by: §I, §II.
  • [33] S. Srivastava, E. Fang, L. Riano, R. Chitnis, S. Russell, and P. Abbeel (2014) Combined Task and Motion Planning Through an Extensible Planner-Independent Interface Layer. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §I, §II.
  • [34] S. Thiébaux, J. Hoffmann, and B. Nebel (2005) In defense of PDDL axioms. Artificial Intelligence 168 (1-2), pp. 38–69. Cited by: §IV-A.
  • [35] M. Toussaint (2015) Logic-geometric programming: an optimization-based approach to combined task and motion planning. In AAAI Conference on Artificial Intelligence, pp. 1930–1936. Cited by: §I, §II.
  • [36] Y. Wang, S. Chaudhuri, and L. E. Kavraki (2018) Bounded policy synthesis for POMDPs with safe-reachability objectives. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 238–246. Cited by: §I, §II.
  • [37] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox (2018)

    PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

    Robotics: Science and Systems (RSS). Cited by: §VIII.
  • [38] Y. Xiao, S. Katt, A. ten Pas, S. Chen, and C. Amato (2019) Online Planning for Target Object Search in Clutter under Partial Observability. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8241–8247. Cited by: §II, §II.
  • [39] S. W. Yoon, A. Fern, R. Givan, and S. Kambhampati (2008) Probabilistic Planning via Determinization in Hindsight.. In AAAI, pp. 1010–1016. Cited by: §II.
  • [40] S. W. Yoon, A. Fern, and R. Givan (2007) FF-Replan: A Baseline for Probabilistic Planning. In International Conference on Automated Planning and Scheduling (ICAPS), Cited by: §I, §II.