Our objective is to design and build robot policies that can interact robustly and safely with large collections of objects that are only partially observable, where the objects have never been seen before and where achieving the goal may require many coordinated actions, as in putting away all the groceries or collecting all the ingredients for a meal. Our goal is a policy that will generalize without specialized re-engineering or re-training to a broad range of novel objects, physical environments, and goals, but also be able to acquire whole new competencies, cumulatively, through incremental engineering and learning.
There is a broad appreciation of the importance of generality in design methods: trajectory optimization and reinforcement learning, for example, are both general tools that can address a large array of problems. However, the policies that are typically built with them are quite narrow in their domain of application. We seek, instead,systems generality, in which the focus is on the generality of a single policy. In this paper, we describe an approach for building such policies as deliberative systems
and then instantiate it with an implementation that is able to manipulate novel objects in novel arrangements to achieve novel goals, both in simulation and on a real robot. It makes use of engineered as well as machine-learned modules for object segmentation, shape estimation of 3D object meshes, and grasp prediction, along with a state-of-the-art task-and-motion planner.
The operation of our system, called M0M (Manipulation with Zero Models), is illustrated in Figure 1. The goal is for all objects to be on a blue target region. Importantly, the system has no prior geometric models of objects and no specification of what objects are present in the world. It takes as input RGB-D images, which it segments and processes to find surfaces, colored target regions and object candidates (see Section VII-A). The goal for this task is communicated to the system by the following logical formula:
This formula involves a relation On that the system knows how to achieve, by picking and placing, and perceptual properties (Is) such as color that the system can compute from the input images (see Section IV). The goal does not reference any individual objects by name because, in our problem setting, the object instances have no names. Instead, goals existentially and universally quantify over the perceivable objects, which may vary substantially in number and properties across, and even within, problem instances.
Initially, two objects are purposefully hidden behind the tall cracker box so that the robot cannot perceive them. Finding only a single object on the table, the robot first picks and places the cracker box on the blue target region. It selects a placement for the cracker box on the blue region that is roughly in the middle of the region. Because the initial cracker box placement was planned without knowledge of the other two objects, upon observing the new objects, the robot intentionally moves the cracker box to a temporary new placement to make room for the tape measure and green cup. Finally, the robot plans a new placement for the cracker box that avoids collisions with the other two objects while also satisfying the goal. A video of this trial is available at: https://youtu.be/f-GCKQWuPyM; additional experiments are described in Section VIII and in Appendix -C.
M0M can perform purposeful manipulation for a general class of object shapes, object arrangements and goals, while operating directly from perceptual data, even in partially observable settings. Importantly, the system is designed in a modular fashion so that different modules can be used for perceptual tasks such as segmenting the scene or choosing grasps on detected objects. Furthermore, new manipulation operations, such as pushing or pouring, can be added and immediately combined with existing operations to achieve new goals. Many more examples of M0M in operation are illustrated in the remainder of the paper and at https://tinyurl.com/open-world-tamp.
Any robot system that has an extended interaction with its environment, selecting actions based on the world state and the outcomes of its previous actions, can be seen at the most basic level as a control policy that maps a sequence of inputs (generally intensity and depth images and joint angles) into motor torques. It has been traditional to hand-design and implement such control policies. A classical strategy of multi-level model-predictive control with a general-purpose planner at the top level results in very robust behavior that generalizes over a wide range of situations and goals [1, 2, 3, 4, 5]; however, these approaches have traditionally required a substantial amount of prior knowledge of the objects in the world and their dynamics.
A relatively newer strategy for constructing such control policies is to learn them via supervised, imitation, or reinforcement-learning methods in simulation or real-world settings [6, 7]. These approaches are attractive because they require less human engineering but they make heavy demands for real-world training, which again poses a substantial development burden. In addition, these learned policies are often narrowly focused on a single “task”.
In this paper, we present a strategy for obtaining the best of both worlds: we encode fundamental, very generic, aspects of physical manipulation of objects in three-dimensional space in an algorithmic framework that implements a feedback control policy mapping sensory observations and a goal specification into motor controls. To instantiate this framework for a new domain one must provide:
A description of the robot’s kinematics and a basic position trajectory controller;
A characterization of the manipulation operations that robot can use; and
A set of perceptual modules that estimate properties of objects that the system will interact with, which can generally be acquired via off-line training and shared over a variety of applications and robots.
Systems that instantiate this framework will immediately generalize without any re-engineering or re-training to a broad range of novel objects, physical environments, and goals. Due to the modularity of the architecture, they can also serve as a basis for acquiring whole new competencies, cumulatively, by adding new learned or engineered modules.
Our approach leverages the planning capabilities of general-purpose task and motion planning (tamp) systems . The key insight behind our approach is that such planners do not necessarily need a perfect and complete model of the world, as is often assumed; they only need answers to some set of “queries”, which can be answered by direct recourse to perceptual data. Existing tamp systems that have been demonstrated in real-world settings, including our prior work, require known object instance 3D mesh models that can be accurately aligned to the observed data using human-calibrated fiducials or pose estimators, which restricts their applicability to known environments, often with only a few unique object instances [1, 2, 3, 4, 5]. Even several extensions to tamp-based systems that actively deal with some uncertainty from perception (such as substantial occlusions) [9, 10, 11] require observations in the form of poses of known objects. This pose registration process is critical for these approaches for identifying human-annotated affordances, such as grasps and placement surfaces, and representing collision volumes during planning. However, we show that one can also fulfill these operations using only the observed point cloud, without the need for prior models. In this paper, we develop a strategy in which all such queries are resolved in sensory data, see Section V.
The system instance described in this paper constructs a “most likely” estimate of the current scene by segmenting it into objects that can then be used to estimate shapes, grasp affordances, and other salient properties. It then solves for a multi-step motion plan to achieve the goal given that interpretation, executes the first few steps of the plan, re-observes the scene, determines whether the goal is satisfied, and if not, re-plans.
We demonstrate, in simulation and on a real robot, that our system can handle objects of unknown types and a variety of goals. Even if it makes perceptual errors, which are often reflected in taking imperfect actions, it recovers from these problems by continually re-perceiving and re-planning. We experiment with different implementations of perceptual modules, illustrating the importance of modularity for the overall flexibility and extensibility of the system.
Iii Related work
The most closely related work to ours in manipulation without shape models is by Gualtieri et al. . Many of the components of our system, e.g., grasping and shape estimation, are analogous to theirs. They, however, assume a task-specific rearrangement planner is provided and do not consider tasks that may require more general manipulation of the environment, e.g. moving an object out of the way, or the more complex goals enabled by a tamp system.
A number of other approaches [13, 14] demonstrate systems that exploit the ability to gain information by interacting with objects. There is also a long line of work aimed at “interactive segmentation”, that is, using robot motions to disambiguate among object hypotheses when manipulating in clutter 
. Object search under partial observability has been studied within a partially observable Markov decision process (POMDP) framework[16, 17], including work that learns policies that uncover hidden objects in piles .
Iv Manipulation with zero models
We begin by describing the scope of the Manipulation with Zero Models (M0M) framework for prehensile manipulation, in which the robot moves objects using pick and place operations. We have previously implemented a variety of other manipulation operations, including pushing, pouring, scooping, and unscrewing bottle caps [19, 20]. In this paper, we focus on a single, prehensile, manipulation “mode”, which is to pick up objects in a rigid grasp, move them while not contacting any other objects, and then place them stably back onto a surface. The M0M system has already built into it the necessary descriptions of these operations for planning; an overview can be found in Section V. This single domain description is used for all the objects, arrangements and goals. The description provided here is the most basic version of the framework; Section IX discusses the simplifications and assumptions inherent in this version and outlines strategies for relaxing them.
To apply M0M to a manipulation robot, it is necessary to provide a urdf description of the robot’s kinematics and a position configuration controller for the robot. The robot may have multiple manipulators that move sequentially.
An instantiation of M0M requires perceptual modules of several different types. The first modules take an RGB-D image as input:
rigid objects: Output is a set of object hypotheses, each of which is characterized by an RGB partial point cloud.
fixed surfaces: Output is a set of approximately horizontal surfaces (such as tables, shelves, parts of the floor) that could serve as support surfaces for placing objects.
Associated with each entity is an arbitrary reference coordinate frame, the simplest being the robot’s base frame. When we speak of , we mean a transform relative to the reference coordinate frame of . This notion of a pose is useful for representing relative transformations but has no semantics outside the system. The remaining modules take an object point cloud as input:
grasps: Output is a possibly infinite sequence of transforms between the robot’s hand and the reference coordinate frame of such that, if the robot were to reach that relative pose with the gripper open and then close it, it would likely acquire a stable grasp of .
collision volume: Output is a predicted volume regarding , primarily used for reasoning about collision-avoidance and containment.
stable orientations: Output is a set of stable orientation of in its reference frame.
object properties: Output is a list of properties of , which will be used in goal specifications. They can include object class, aspects of shape, color, etc.
These modules can (and do) use different representations for their computations. Some may use conservative over-estimates of the input point cloud to find volumes for avoiding collisions, others may use tight approximations of local areas to find candidate grasps, while others may use learned networks operating on the whole input to compute such affordances.
V M0M using pddlstream
Our implementation of M0M uses pddlstream 
, an existing open-source domain-independent planning framework for hybrid domains. We have previously usedpddlstream to solve a rich class of observable manipulation problems; however, in our previous applications, object shapes were assumed to be known exactly. Other tamp frameworks that provide a similar interface between perceptual operations and the planner through for example, suggestors  or a refinement layer , could also be used as the basis of an implementation of our approach.
pddlstream takes as input models of the manipulation operations, in the form of Planning Domain Definition Language (pddl) operator descriptions (see Figure 3), and a set of samplers (referred to as streams), which produce candidate values of continuous quantities, including joint configurations, grasps, object placements, and robot motion trajectories that satisfy the constraints specified in the vocabulary of the problem (see Figure 4). Critically, aside from a small declaration of the properties that their inputs and outputs satisfy, the implementation of each stream is treated as a blackbox. As a result, pddlstream is agnostic toward both the representation of stream inputs and outputs as well as whether operations are implemented using engineering or learning techniques. This allows state-of-the-art machine learning methods to be flexibly incorporated, without modification, during planning where they will be automatically combined with other independent operations by the pddlstream planning engine.
Problems described using the pddlstream planning language can be solved by a variety of pddlstream planning algorithms . Because the pddlstream planning engine is responsible for querying the perceptual operations (in the form of streams), it will automatically decide online which operations are relevant to the problem and how many generated values are needed. Furthermore, several pddlstream algorithms (e.g. the focused algorithm) will lazily query the perceptual operations in order to avoid unnecessary computation. As a result, the planner will not perform computationally expensive perceptual operations on images and point clouds to, for example, predict properties and grasps unless the segmented object or property have been identified as relevant to the problem.
Below we describe our pddlstream formulation in detail so as to make the contract between perceptual operations and action descriptions precise.
V-B pddlstream formulation
In pddl, an action is specified by a list of free parameters (:parameters), a precondition logical formula (:precondition) that must hold to correctly apply the action, and an effect logical conjunction (:effect) that describes changes to the state when the action is applied. Figure 3 gives the pddl description of the move and place actions for M0M. The move action models collision-free motion of the robot while it is not holding anything. The place action models the instantaneous change from when its hand is exerting a force to hold an object to when a force is no longer applied and the object is released. The move-holding and pick actions are described in Figure 15 in Appendix -A.
A state is a goal state if the goal formula holds in it. Goal specifications, even those with quantifiers, can be directly and automatically encoded in a pddl formulation using axioms, logical inference rules [21, 22, 5]. Intuitively, an axiom has the same precondition and effect structural form as an action but is automatically derived at each state. Due to their similarities with actions, axioms can straightforwardly be incorporated in pddl, enabling a planner to efficiently reason about complex goal conditions, such as the ones present in M0M.
pddlstream builds on pddl by introducing stream descriptions, which are similar in syntax to pddl operator descriptions. An stream is declared by a list of input parameters (:inputs), a logical formula that all legal input parameter values must satisfy (:domain), a list of output parameters (:outputs), and a logical conjunction that all legal input parameter values and generated output parameter values are guaranteed to satisfy (:certified). Each stream is accompanied by a procedure that maps input parameter values to a possibly infinite sequence of output parameter values. Figure 4 displays six streams used in M0M, which we will describe in detail in Section V-D.
V-C Predicate semantics
First, we describe the semantics of the predicates used in Figure 3 and Figure 4. The following predicates to encode parameter values type: (Conf ?q) indicates ?q is a continuous robot joint configuration; (Traj ?t) indicates ?t is a continuous robot joint trajectory; (ObjectCloud ?oc) indicates ?oc is an object, which crucially is represented by a segmented point cloud observation; (Pose ?oc ?p) indicates ?p is a pose transform for an object point cloud ?oc relative to its observed frame; (Grasp ?oc ?g) indicates ?g is a grasp transform for an object point cloud ?oc relative to its observed frame. The choice to use the observed frame as the reference frame for an object is arbitrary and has no bearing on the system as poses are only used internally during planning. As a result of this decision, the initial pose of each object cloud is the identity pose. (Property ?pr) denotes that ?pr is a perceivable property, such as a particular color or category.
The following fluent predicates model the current state of the system: (AtConf ?q) represents the current robot configuration ?q; (HandEmpty) is true if the robot’s hand is empty; (AtGrasp ?oc ?g) indicates that object cloud ?oc is held by the robot at grasp ?g; (AtPose ?oc ?p) indicates that object cloud ?oc is resting at placement ?p; (On ?oc ?oc2) signifies that object cloud ?oc is resting on object cloud ?oc2. Normally, in a fully observable tamp setting, ?oc would be the name of an object instance; however, those do not exist in our setting, so ?oc is simply a unique point cloud. The initial planning state of the system after object clouds were segmented from the last observation is:
where is the current robot configuration, are identity poses, and the robot’s hand is empty.
V-D Engineered and learned streams
Next, we describe the streams as well as the constraint predicates that they certify. We highlight the distinction between streams that can be directly engineered and those that must be at least partially learned. The engineered streams we consider are robot-centric operations that can performed using the robot’s fully-observed urdf, which encodes the robot’s kinematics and geometry. The inverse-kinematics stream solves for configurations ?q that satisfy the kinematic constraint (Kin ?q ?g ?p) with grasp ?g and pose ?p, for example, using IKFast . The plan-motion stream plans a continuous trajectory ?t between configurations ?q1 and ?q2 that respects joint limits and self collisions, certifying (Motion ?q1 ?t ?q2). It can be directly implemented by any off-the-shelf motion planner, such as RRT-Connect .
The learned streams can use a combination of machine learning and classical estimation techniques. In our system, we consider several implementations of each stream that each are a wrapper around a state-of-the-art estimation technique for their subproblem. The predict-grasps stream generates grasps ?g for object cloud ?oc that are predicted to remain stably in the robot’s hand, certifying (Grasp ?oc ?g). In Section VII-C, we describe several machine learning implementations of predict-grasps, some of which make predictions directly from ?oc without any intermediate representation.
The predict-placements stream generates poses ?p1 for object cloud ?oc1 that are predicted to rest stably on object cloud ?oc2 when at pose ?p2, certifying (Stable ?oc1 ?p1 ?oc2 ?p2)). Our implementation of predict-placements decomposes the operation into two estimation subprocedures. First, we perform point cloud completion (Section VII-B) on object cloud ?oc2 and then estimate approximately horizontal planar surfaces in ?oc2 when at pose ?p2 using Random Sample Consensus (RANSAC)  plane estimation. Next, we perform shape estimation (Section VII-B) on object cloud ?oc1 and then estimate stable orientations relative to a planar surface using the resulting mesh . By combining these two subprocedures, we obtain placements ?p1 for object cloud ?oc1.
The predict-cfree stream predicts whether all robot configurations along trajectory ?t do not collide (i.e. are collision-free) with object cloud ?oc2 at pose ?p2, certifying (CFreeTrajPose ?t ?oc2 ?p2). By finely sampling configurations along trajectory ?t, this test can be reduced to sequence of robot configuration and object cloud collision predictions. Although these predictions could be made directly, we instead use shape estimation (Section VII-B) to estimate the collision volume of both the observable and unobservable object volume as a set of convex bodies. This enables us to use fast convex body collision checkers to answer these queries . A similar predict-traj-grasp stream that predicts collisions with a grasped object is described in Appendix -A. Finally, the detect-property stream tests whether object cloud ?oc2 has property ?pr2 and, if so, certifies (Is ?oc ?pr). Section VII-D describes two property estimators, which detect the category and color of an object from the RGB image observation.
Vi Manipulation policy
The pseudocode for the manipulation policy, which at its core leverages planning using the model described in Section V, is displayed in Algorithm 1. The M0M solution strategy A flowchart of the policy is illustrated in Figure 2. The policy assumes the set of manipulation actions (Section V-B) and the engineered streams (Section V-D). It requires a implementation of the learned streams . Several options per stream are discussed in Section VII. The policy is conditioned on a particular robot and a specified goal . To apply the policy to a new robot , it is necessary to provide a urdf description of the robot’s kinematics and a position configuration controller for the robot.
On each decision-making iteration, the robot receives the current RGB-D image from its camera and its current joint configuration from its joint encoders. From each input RGB-D image, it segments out table point clouds and object point clouds . The segmented object and table point clouds as well as the robot configuration instantiate the current pddlstream state of the world and robot. This current state along with the goal , actions , and streams form a pddlstream planning problem, which is solved by solve-pddlstream, a procedure that denotes a generic pddlstream planning algorithm. In some cases, such as when a necessary attribute is not detected, solve-pddlstream will return None, indicating that the goal is unreachable from the current state . Otherwise, solve-pddlstream will return a plan , which consists of a finite sequence of instances of the actions in . If the plan is empty (i.e. ), the current state was proved to satisfy the goal and the policy terminates successfully. Otherwise, the robot executes the first action using its position controllers and repeats this process by reobserving the scene. Note that this control structure enforces that the robot observes the scene to infer whether it has achieved the goal; otherwise, the robot could erroneously declare success after executing a plan open loop.
We have implemented an instance of M0M, using pddlstream and experimented with different strategies for implementing the perceptual modules. All make use of RGB-D images gathered from the PR2’s Kinect 1 sensor. In this section, we briefly describe implementation of individual modules and include experimental results comparing alternative implementations of several of the modules.
We used standard position trajectory controllers for simulation in PyBullet  and on a physical PR2 robot, and simply opened and closed the parallel-jaw grippers to implement grasping and releasing objects. We used the actual opening of the gripper after commanding the gripper to close to detect grasp failure.
Vii-a Segmentation of objects and surfaces
Category-agnostic segmentation is used to identify rigid collections of points that collectively move as an object when manipulated. We compare three different segmentation approaches: uois-net-3d, geometric clustering, and a combined method. uois-net-3d 
is a neural-network model that takes RGB-D images as input and returns a segmentation of the scene. It assumes that objects are generally resting on a table; it attempts to segment out image regions corresponding to the table, as well as a set of objects.
For geometric clustering, we first remove the points assigned to the table by uois-net-3d, then use density-based spatial clustering of applications with noise (dbscan) , which finds connected components in a graph constructed by connecting points in the point cloud that are nearest-neighbors in 3D Euclidean distance. In a combined approach, we apply dbscan to the segmented point cloud produced by uois-net-3d in order to reduce under-segmentation. We additionally use post-processing to filter degenerate clusters.
Figure 5 displays the segmentation mask predicted by uois-net-3d while our system was executing the task 5 trial displayed in Figure 14. As can be seen, uois-net-3d generally correctly segments the four instances; however, it does oversegment the cracker box into two contacting instances in the last two images.
We compared all three segmentation methods on the ARID-20 subset of the object clutter indoor dataset (OCID)  and GraspNet-1Billion  datasets. Detailed results are reported in the appendix. We found that the different segmentation algorithms have advantages in different settings. In domains where objects have simple geometries and are scattered on the table, an Euclidean-based approach produces reliable predictions. But in a more cluttered domain, the learned approach often outperforms the Euclidean-based approach. When it comes to challenging situation where objects have more complicated geometry, the performance of the learned approach drops but it still outperforms the pure Euclidean-based approach. In all experiments, the combined approach performs better than the pure learned approach, indicating the effectiveness of applying dbscan and filtering to neural-network-predicted results. In our system experiments, we use the combined method.
Vii-B Shape estimation
A subroutine of our implementation of both the predict-placements and predict-cfree stream operations (Section V-D) is shape estimation, which takes in a partial point cloud as input and predicts a completed volumetric mesh. We again explore a combination of neural-network-based and geometric methods.
The morphing and sampling network (MSN) 
is a neural-network model that takes as input a partial point cloud and predicts a completed point cloud. Our geometric method works by augmenting the partial point cloud by computing the projection of the visible points onto the table plane. This simple heuristic is motivated by the intuition that the base of an object must be large enough to stably support the visible portion of an object and is particularly useful given a viewpoint that tends to observe objects from above. As a post-processing step for both methods, we filter the result by back-projecting predicted points onto the depth image and pruning any visible points that are closer to the camera than the observed depth value.
Vii-B1 Mesh interpolation
While it is possible directly use the estimated point cloud in downstream operations, for example by treating the points as spheres or downsampling them as into voxel grid, it is more accurate and efficient to interpolate among the points to produce a volumetric mesh. The simplest way to do this is to take the convex hull of the points; however, this can substantially overestimate the volume when the object is non-convex and fail to find feasible plans when attempting to grasp non-convex objects such as bowls. Instead, we produce the final volume by computing a “concave hull” in the form of an alpha shape, a non-convex generalization of a convex hull, from the union of the visible, network-predicted, and projected points. To enable efficient collision checking in the predict-cfree stream, we build an additional representation that approximates the mesh as the union of several convex meshes, implemented by volumetric hierarchical approximate convex decomposition (V-HACD) .
Figure 6 visualizes the estimated meshes produced by four of the shape estimation strategies in an uncluttered scene with a diverse set of objects. The first two images compare creating a mesh by taking the convex hull (Figure 6 left) versus a concave hull (Figure 6 middle-left) of the set of visible points (V). The convex hull can significantly overestimate non-convex objects in certain areas, as evidenced by the spray bottle in the top left of the image and the real-world power drill in the right side of the image. The last three images compare three strategies for populating the set of points to be used the input to a concave hull. Adding the shape-completed points from MSN (VLF) fills in some but not all of the occluded volume of each object, as shown by the cracker box in the middle of the image (Figure 6 middle-right). Also including the projection of the points to the table (VLPF) better fills in the occluded volume at the cost of overestimating the volume when the ground truth base projection is smaller than the visible base projection (Figure 6 right). We evaluated the performance of these methods in four different domains, each on 2000 images taken from a randomly-sampled camera pose; details of the experiments and results can be found in Appendix -B. The fully combined method (VLPF) in general performed the best across the domains and is the one we use in the system experiments.
Vii-C Grasp affordances
Grasp affordances are transformations between the robot’s hand and an object’s reference frame such that, if the robot’s hand was at that pose and closed its fingers, it would acquire the object in a stable grasp. They are purely local and do not take reachability, obstacles, or other constraints into account. The modularity of our planning framework enables us to consider three interchangeable grasping methods for implementing the predict-grasps
stream, each take a partial point cloud as input. Grasp Pose Detection (GPD)
first generates grasp candidates by aligning one of the robot’s fingers to be parallel to an estimated surface in the partial point cloud and then scores these candidates using a convolutional neural network, which is trained on successful grasps for real objects. GraspNet
uses a variational autoencoder (VAE) to learn a latent space of grasps that, when conditioned on a partial point cloud, yields grasps.
We also developed a method, estimated mesh antipodal (EMA), that performs shape estimation using the methods described in Section VII-B and then identifies antipodal contact points on the estimated mesh. Specifically, to generate a new grasp, EMA samples two points on the surface of the estimated mesh that are candidate contact points for the center of the robot’s fingers. The pair of points is rejected if the distance between them exceeds the gripper’s maximum width or if the surface normal at either of the corresponding faces is not approximately parallel to the line between the two points. Then, EMA samples a rotation for the gripper about this line and yields the resulting grasp if the gripper, when open and at this transformation, does not collide with the estimated mesh. A key distinction between EMA’s and GPD’s candidate grasp generation process is that, by using shape estimation, EMA is able to directly reason about the occluded regions of an object instead of just the visible partial point cloud. Additionally, it can take into account unsafe contacts between the robot’s gripper and the object.
Figure 7 illustrates some of the grasps produced by these three approaches in three scenes with varying amounts of clutter, where clutter introduces additional opportunities for occlusion. We performed a real-world experiment to compare the success rates of GPD, GraspNet, and EMA. The details of the experiment and the results can be found in Appendix -B. GPD and EMA outperformed GraspNet in our experiments, both in speed and accuracy, with EMA having an edge over GPD. We used EMA in our system experiments.
Vii-D Object properties
In our implementation of the detect-property stream, we considered detectors for two object properties: category and color. We use Faster R-CNN  trained on the bowl-cup subset of IIT-AFF  to detect bowls and cups so that the robot can identify which objects can contain other objects. Additionally, we use Mask R-CNN , trained on both real images in Yale-CMU-Berkeley (YCB) Video Dataset  and a synthetic dataset we generated using PyBullet 
, to classify any YCB objects that are mentioned in the goal formula. We also have simple modules that aggregate color statistics directly from segmented RGB images.
Viii Whole-system experiments
Finally, we evaluated the whole M0M system by testing its ability to solve challenging real-world manipulation tasks. As an example, Figure 8 illustrates a task where the goal is for a mustard bottle to be on a blue target region:
In its effort to solve this task, the robot moved two obstructing objects out of the way to safely pick the mustard bottle and then place it on the goal region. Additionally, although not pictured, the robot’s first attempt to pick the mustard bottle fails, causing the system to abort execution, re-observe, re-plan, and execute a new grasp that this time was successful. The full video of the trial can be seen at https://youtu.be/tNHjpXP8RFo.
Viii-a Repeated trials
We performed experiments consisting of five repeated real-world trials for five tasks, obtaining the results shown in Figure 9. Here, tasks are loosely defined as a set of problems with the same goal formula and qualitatively similar initial states. Recall that, outside of the PR2’s description, the goal formula is the only input to each trial. We summarize the results here and describe the tasks in subsequent subsections. In addition to these tasks, we applied the system to over 25 individual problem instances. Appendix -C highlights several of these tasks along with how the system behaved to solve them. See https://tinyurl.com/open-world-tamp for full videos of our system solving these problems.
The column Iterations refers to the average number of combined estimation and planning iterations that were performed per trial. Unless the initial state satisfies the goal conditions, the system always takes two or more iterations because it must at least achieve the goal and then validate that the goal is in fact satisfied. Sometimes the system will perform more than two iterations in the event that the perception module identifies a new object due to undersegmentation, an action is aborted due to a failed grasp, or an action has unanticipated effects.
The columns Estimation, Planning, and Execution report the average time spent perceiving, planning, and executing per iteration. Each module was implemented in Python to flexibly support multiple implementations of each module. Many of the perceptual operations that manipulate raw point clouds could be sped up by using C++ instead of Python and deploying the system using state-of-the-art graphics hardware. During planning, a majority of the time is spent checking for collisions, particularly when the robot is planning free-space motions. The overall runtime could be reduced by simultaneously planning motions for later actions while executing earlier actions . The column Successes reports the number of times out of five trials that the system terminated having identified that it achieved the goal. Our system was able to achieve the goal on every trial except for a single trial that was a part of Task 3. These results show that this single system can perform a diverse set of long-horizon manipulation tasks robustly and reliably.
Viii-A1 Task 1
This task evaluates our system’s ability to grasp and stably place novel objects that are not well approximated by a simple box. Almost all existing tamp approaches assume that the manipulable objects can be faithfully modeled using a simple shape primitive for the purpose of manually specifying grasps. The goal in this task is for all objects to be on a blue target region, which corresponds to the following logical formula:
In each trial, a single object is placed arbitrarily on the table. The five objects we used across the five trials were a bowl, a real power drill, a plastic banana, a cup, and a tennis ball. Figure 10 demonstrates a successful trial where the object was a bowl. A video of this trial is available at: https://youtu.be/PREUU8nVetI.
Viii-A2 Task 2
This task evaluates our system’s ability to safely place multiple objects in tight regions. The goal in this task is also for all objects to be on a blue target region. Two objects are initially present on the table, so the robot must plan a pair of placements and motions for the objects avoid collision. Figure 11 visualizes a successful trial involving a mustard bottle and a toy drill. A video of this trial is available at: https://youtu.be/BPa_Mpkf31M.
Viii-A3 Task 3
This task evaluates our system’s ability to react to unexpected observations. The goal in this task is also for all objects to be on a blue target region. The task was presented in the introduction to the paper, see Figure 1.
This task had the only failed trial among all five tasks. Figure 12 visualizes the failed trial. The cracker box made contact with the occluded objects when lifted and knocked the tennis ball to the end of the table. Upon re-observation, the robot identifies all three objects and deduces that the goal conditions are not satisfied. However, the robot fails to find a plan that achieves the goal within a generous timeout due to the fact that the tennis ball is now outside of the reachable workspace of the robot, causing the robot to fail to complete the task. A video of this trial is available at: https://youtu.be/NrTug_1EluI.
Viii-A4 Task 4
This task evaluates our system’s ability to reason about collisions when placing a target object. The goal in this task is for the object that is closest in hue to red to be on a blue target region, which corresponds to the following logical formula:
Initially, two objects that are far away in hue from red are located on the blue target region, occupying most of the region, and a third object that is close in hue to red is on the table. Figure 13 shows a successful trial. The robot picks and relocates a sugar bottle and mustard bottle that initially cover the blue target region in order to make room for the toy drill to be safely placed on the region. A video of this trial is available at: https://youtu.be/uqZT5gUBOo0.
Viii-A5 Task 5
This task evaluates our system’s ability to reason about collisions when attempting to pick objects. The goal is for the object that is closest in color to yellow to be on a blue target region; the goal has a similar form to that in Task 4. Initially, a mustard bottle is placed near the exterior of the table, surrounded by three potentially obstructing objects placed near the table’s interior.
Figure 14 displays a successful trial. First, the robot picks and relocates the obstructing water bottle. Second, the the robot picks and relocates the obstructing detergent bottle. It falls over during placement; however, the robot is able to infer this during its next observation and its estimates. Finally, the robot picks the mustard bottle and places it on the blue target region. A video of this trial is available at: https://youtu.be/qBD2FyR2ktc.
Ix Extensions to M0M
We have presented a simple instance of the M0M framework, which is already quite capable, as illustrated by experimental results in Section VIII. It does have several assumptions and simplifications which can be removed, providing a path to even more general and capable systems.
Object categories can play an important role in supporting the inference of latent object properties. For example, recognizing that an object is likely to be an instance of the coffee mug category, based on its shape and appearance, might allow us to make additional inferences about its material, functional properties (can contain hot liquid), and parts (its opening and handle). It is straightforward to augment the planning model so that if an object is perceived to have some property or class membership, then additional properties are inferred. This capability enables examples, illustrated in Appendix -C, in which by recognizing an object as belong to the category bowl, we infer that objects can be dropped into it.
Additional manipulation operations
Extending the system to have additional manipulation operations is somewhat more complex, but the work is substantially amortized, again, over object arrangements, shapes, and goals. Added operations can be smoothly combined with the existing prehensile operation to generate a rich class of plans. For example, to add the ability to push an object, it would be necessary to add:
a pushing controller to the robot description (although open-loop pushing could be accomplished with an existing position controller);
a push operator description that models the predicted change in object pose after a push;
a plan-push sampler, which can generate diverse choices of possible paths along which an object can be pushed, subject to some constraints which may include start and target poses.
Similar characterizations can be given for operations such as pouring and scooping , opening a child-proof bottle using impedance control , moving kinematic objects such as drawers and doors , and many more.
In the basic system, there is no memory; actions are selected based only on the current view, which like many other vision-based manipulation approaches, is assumed to imperfect but sufficient for acting in the world. For robustness, it is critical to integrate observations over time (e.g., to remember objects that were once visible but are now not) and to integrate the predicted effects of actions (e.g., to increase the belief that an object is located in a bowl after the robot drops it there, even if it cannot be observed inside the bowl from the current angle. In addition, it can be beneficial to be able to fuse information from other sensory modalities, including tactile and auditory sensing.
The current perception system generates a single hypothesis about the world state, which is used by the planner to select actions as if it were true. Because it does not take into account the degree to which the robot is uncertain about the world state when it selects actions, it cannot decide that in some situations it would be better to do explicit information-gathering rather than pursue its goal more directly given a point estimate of the state. Previous work [9, 10, 11] has provided methods for tamp in belief space, but addressed uncertainty only in robot base and object instance pose, but not object shape or other properties. Future work involves integrating these approaches with the proposed approach.
We have demonstrated an instance of a strategy for designing and building very general robot manipulation systems using a combination of analytical and empirical methods. The system is a closed-loop policy that maps from images to position commands and generalizes over a broad class of objects, object arrangements, and goals. It is able to solve a larger class of open-world sequential manipulation problems than methods that are either purely analytical (using classic hand-built algorithms for perception, planning, and control) or purely empirical (using modern methods for learning goal-conditioned policies).
We gratefully acknowledge support from NSF grant 1723381; from AFOSR grant FA9550-17-1-0165; from ONR grant N00014-18-1-2847; from the Honda Research Institute; and from MIT-IBM Watson Lab. Caelan Garrett and Aidan Curtis are supported by NSF GRFP fellowships. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.
-  L. Kaelbling and T. Lozano-Perez, “Hierarchical task and motion planning in the now,” ICRA, 2011.
-  S. Srivastava, E. Fang, L. Riano, R. Chitnis, S. Russell, and P. Abbeel, “Combined task and motion planning through an extensible planner-independent interface layer,” ICRA, 2014.
-  M. Toussaint, “Logic-geometric programming: An optimization-based approach to combined task and motion planning,” in IJCAI, 2015.
-  N. T. Dantam, Z. K. Kingston, S. Chaudhuri, and L. Kavraki, “An incremental constraint-based framework for task and motion planning,” IJRR, vol. 37, 2018.
-  C. R. Garrett, T. Lozano-Perez, and L. P. Kaelbling, “PDDLStream: Integrating symbolic planners and blackbox samplers,” in ICAPS, 2020.
-  M. Andrychowicz, B. Baker, M. Chociej, R. Józefowicz, B. McGrew, J. W. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba, “Learning dexterous in-hand manipulation,” IJRR, vol. 39, 2020.
-  A. Nagabandi, K. Konoglie, S. Levine, and V. Kumar, “Deep dynamics models for learning dexterous manipulation,” in Conference on Robot Learning (CoRL), 2019.
-  C. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano-Perez, “Integrated task and motion planning,” Annual Review: Control, Robotics, Autonomous Systems, vol. 4, 2021.
-  L. Kaelbling and T. Lozano-Perez, “Integrated task and motion planning in belief space,” IJRR, vol. 32, 2013.
-  D. Hadfield-Menell, E. Groshev, R. Chitnis, and P. Abbeel, “Modular task and motion planning in belief space,” IROS, 2015.
-  C. R. Garrett, C. Paxton, T. Lozano-Pérez, L. P. Kaelbling, and D. Fox, “Online Replanning in Belief Space for Partially Observable Task and Motion Problems,” in ICRA, 2020.
-  M. Gualtieri and R. W. Platt, “Robotic pick-and-place with uncertain object instance segmentation and shape completion,” IEEE Robotics and Automation Letters, vol. 6, pp. 1753–1760, 2021.
-  M. Gualtieri, A. ten Pas, and R. Platt, “Pick and place without geometric object models,” ICRA, 2018.
-  C. Mitash, R. Shome, B. Wen, A. Boularias, and K. Bekris, “Task-driven perception and manipulation for constrained placement of unknown objects,” Robotics and Automation Letters, vol. 5, 2020.
-  T. Patten, M. Zillich, and M. Vincze, “Action selection for interactive object segmentation in clutter,” IROS, 2018.
-  L. Wong, L. Kaelbling, and T. Lozano-Perez, “Manipulation-based active search for occluded objects,” ICRA, 2013.
-  J. Li, D. Hsu, and W. S. Lee, “Act to see and see to act: POMDP planning for objects search in clutter,” IROS, 2016.
-  A. Kurenkov, J. Taglic, R. Kulkarni, M. Dominguez-Kuhne, A. Garg, R. Martín-Martín, and S. Savarese, “Visuomotor mechanical search: Learning to retrieve target objects in clutter,” ArXiv, vol. abs/2008.06073, 2020.
-  Z. Wang, C. R. Garrett, L. Kaelbling, and T. Lozano-Perez, “Learning compositional models of robot skills for task and motion planning,” IJRR, 2020.
-  R. Holladay, T. Lozano-Pérez, and A. Rodriguez, “Planning for multi-stage forceful manipulation,” in ICRA, 2021.
-  E. P. D. Pednault, “ADL: Exploring the Middle Ground Between STRIPS and the Situation Calculus.” Kr, vol. 89, pp. 324–332, 1989.
-  S. Thiébaux, J. Hoffmann, and B. Nebel, “In defense of PDDL axioms,” Artificial Intelligence, vol. 168, no. 1-2, pp. 38–69, 2005.
-  R. Diankov, “Automated construction of robotic manipulation programs,” Ph.D. dissertation, Robotics Institute, Carnegie Mellon University, 2010.
-  J. J. Kuffner Jr. and S. M. LaValle, “RRT-Connect: An efficient approach to single-query path planning,” in IEEE International Conference on Robotics and Automation (ICRA), 2000.
-  M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” CACM, vol. 24, no. 6, 1981.
-  K. Goldberg, B. V. Mirtich, Y. Zhuang, J. Craig, B. R. Carlisle, and J. Canny, “Part pose statistics: Estimators and experiments,” IEEE Transactions on Robotics and Automation, vol. 15, no. 5, pp. 849–857, 1999.
-  E. G. Gilbert, D. W. Johnson, and S. S. Keerthi, “A fast procedure for computing the distance between complex objects in three-dimensional space,” IEEE Journal on Robotics and Automation, vol. 4, no. 2, pp. 193–203, apr 1988.
-  E. Coumans and Y. Bai, “PyBullet, a Python module for physics simulation for games, robotics and machine learning,” 2016. [Online]. Available: http://pybullet.org
-  C. Xie, Y. Xiang, A. Mousavian, and D. Fox, “Unseen object instance segmentation for robotic environments,” in arXiv:2007.08073, 2020.
-  M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in KDD, 1996.
-  M. Suchi, T. Patten, D. Fischinger, and M. Vincze, “EasyLabel: A semi-automatic pixel-wise object annotation tool for creating robotic RGB-D datasets,” in ICRA, 2019.
-  H.-S. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet-1billion: A large-scale benchmark for general object grasping,” in CVPR, 2020.
-  M. Liu, L. Sheng, S. Yang, J. Shao, and S.-M. Hu, “Morphing and sampling network for dense point cloud completion,” in AAAI, vol. 34, no. 07, 2020.
-  H. Edelsbrunner, D. Kirkpatrick, and R. Seidel, “On the shape of a set of points in the plane,” IEEE Transactions on Information Theory, vol. 29, no. 4, 1983.
-  M. Müller, N. Chentanez, and T.-Y. Kim, “Real time dynamic fracture with volumetric approximate convex decompositions,” ACM Transactions on Graphics (TOG), vol. 32, no. 4, pp. 1–10, 2013.
-  M. Gualtieri, A. T. Pas, K. Saenko, and R. Platt, “High precision grasp pose detection in dense clutter,” IROS, 2016.
-  A. Mousavian, C. Eppner, and D. Fox, “6-dof graspnet: Variational grasp generation for object manipulation,” CoRR, vol. abs/1905.10520, 2019.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards real-time object detection with region proposal networks,” PAMI, vol. 39, no. 6, 2016.
-  A. Nguyen, D. Kanoulas, D. G. Caldwell, and N. G. Tsagarakis, “Object-based affordances detection with convolutional neural networks and dense conditional random fields,” in IROS, 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in ICCV, 2017.
-  Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes,” in RSS.
-  B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar, “The YCB object and model set: Towards common benchmarks for manipulation research,” in ICAR, 2015.
-  A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “ShapeNet: An Information-Rich 3D Model Repository,” arXiv:1512.03012, 2015.
C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3d classification and segmentation,” inCVPR, 2017.