Log In Sign Up

Object-Centric Task and Motion Planning in Dynamic Environments

by   Toki Migimatsu, et al.

We address the problem of applying Task and Motion Planning (TAMP) in real world environments. TAMP combines symbolic and geometric reasoning to produce sequential manipulation plans, typically specified as joint-space trajectories, which are valid only as long as the environment is static and perception and control are highly accurate. In case of any changes in the environment, slow re-planning is required. We propose a TAMP algorithm that optimizes over Cartesian frames defined relative to target objects. The resulting plan then remains valid even if the objects are moving and can be executed by reactive controllers that adapt to these changes in real time. We apply our TAMP framework to a torque-controlled robot in a pick and place setting and demonstrate its ability to adapt to changing environments, inaccurate perception, and imprecise control, both in simulation and the real world.


Dynamically Constrained Motion Planning Networks for Non-Holonomic Robots

Reliable real-time planning for robots is essential in today's rapidly e...

Simultaneous Scene Reconstruction and Whole-Body Motion Planning for Safe Operation in Dynamic Environments

Recent work has demonstrated real-time mapping and reconstruction from d...

Provably Constant-time Planning and Replanning for Real-time Grasping Objects off a Conveyor Belt

In warehouse and manufacturing environments, manipulation platforms are ...

Cautious Planning with Incremental Symbolic Perception: Designing Verified Reactive Driving Maneuvers

This work presents a step towards utilizing incrementally-improving symb...

Predicted Composite Signed-Distance Fields for Real-Time Motion Planning in Dynamic Environments

We present a novel framework for motion planning in dynamic environments...

Towards Plan Transformations for Real-World Pick and Place Tasks

In this paper, we investigate the possibility of applying plan transform...

Real-time Model Predictive Control and System Identification Using Differentiable Physics Simulation

Developing robot controllers in a simulated environment is advantageous ...

I Introduction

Robot manipulation tasks often require a combination of symbolic and geometric reasoning. For example, if a robot wants to grasp an object outside of its workspace, a symbolic planner might devise a plan to use a long stick to bring the object closer, while a motion planner would determine where to grasp the stick and how to use it to push the object. Task and Motion Planning (TAMP) algorithms address this challenge of interleaving symbolic and geometric reasoning. They often produce full trajectories for the robot to execute. However, due to the PSPACE-hard complexity of motion planning [reif1979complexity], generating these trajectories can take on the order of minutes, and if the environment changes, these trajectories may become invalid and require replanning.

In this paper, we combine TAMP with reactive control to enable robot manipulation in real-world scenarios where the environment may change during execution, perception may be inaccurate, and robot control may be imprecise. Instead of generating full joint-space trajectories like most TAMP algorithms, our approach returns desired object poses at key timepoints. These object poses are defined relative to target frames, which may be moving. Optimizing over relative poses instead of joint configurations at this stage of planning facilitates integration with local controllers that can react to changes in the environment in real time. For example, if the robot needs to pick up an object but specifies this pick action using a desired joint configuration, this joint configuration becomes invalid as soon as the object’s position changes, perhaps due to perception uncertainty, leading to unintended contact or external disturbances. However, if the planner outputs a desired pose of the end-effector relative to the object, then this relative pose is still valid no matter where the object moves, eliminating the need to stop and replan. Figure 1 provides an overview of the proposed framework.

In addition to facilitating integration with reactive controllers, planning over object poses also has the benefit of being a more natural representation for object manipulation. The geometric constraints for actions such as pick, place, and push are straightforward in Cartesian space, which simplifies the process of defining new manipulation actions.

The main theoretical contributions of this work are twofold. First, we introduce a formulation of TAMP using object-centric frames that work with reactive controllers to accommodate for changing environments, inaccurate perception, and imprecise control. Second, we derive the SE(3) functions necessary for optimizing object-centric frames in pick and place applications. We then apply this TAMP framework to a manipulation setting where a robot can pick, place, and push, and demonstrate it working on a real robot where the environment state is visually tracked through RGB-D cameras.

Fig. 1: The proposed TAMP algorithm plans over relative Cartesian frames for integration with reactive controllers that can adapt to changes in the environment in real-time. The top row demonstrates the robot’s ability to complete the block stacking task without stopping to re-plan even when the poses of the grasped object and target location change. The bottom row summarizes our TAMP framework and shows an example execution on the real robot (see supplementary material for a video).

Ii Related Work

Task and motion planning has been widely studied since the introduction of STRIPS, the first automated task planner [fikes1971strips]. One common approach of TAMP algorithms is to interleave the symbolic and geometric search processes by calling a motion planner at every step of the symbolic search and tentatively assigning geometric parameters to the current symbolic state before advancing to the next one [plaku2010sampling, cambon2009hybrid].

One issue that arises from interleaving symbolic and geometric search is when a state found in the search process is symbolically legal but geometrically infeasible due to a previous instantiation of geometric parameters. lagriffoul2012constraint, lagriffoul2014efficiently introduce the concept of geometric backtracking, where the planner must search for alternative geometric instances until a solution is found. They propose an algorithm that maintains interval bounds over placement parameters to reduce the search space of geometric backtracking. bidot2017geometric

limit geometric backtracking with heuristics based on statistics for kinematic violations and collisions with movable objects.

de2013towards use Hierarchical Task Networks (HTNs) to perform symbolic search using hierarchically abstracted tasks, performing HTN backtracking whenever geometric backtracking fails. Hierarchical Planning in the Now (HPN) [kaelbling2010hierarchical] similarly uses a hierarchical approach, but interleaves planning with execution, such that primitive actions at the lowest level of the hierarchy are executed as soon as they are reached by the search algorithm. This requires that the actions themselves be reversible when backtracking is necessary.

While integrating geometric search with symbolic search can prune out large sections of the symbolic space, calling motion planning with every symbolic search increment can be costly, and even disadvantageous if most states are geometrically feasible [lagriffoul2013combining]. An alternative approach is to perform geometric search only on full candidate symbolic plans, or action skeletons. One method of conveying geometric feasibility back to the symbolic search is to construct predicates from the geometric parameters [lozano2014constraint, ferrer2017combined, srivastava2014combined]. dantam2016incremental use an incremental satisfiable modulo theory (SMT) solver to incrementally generate action skeletons and invoke a motion planner in between for validation. Similarly, zhang2016co formulate the TAMP problem as a Traveling Salesman Problem (TSP), using a motion planner in the inner loop of a symbolic planner to update the weights of the TSP.

While most TAMP algorithms use sampling-based motion planners, Toussaint et al. [toussaint2015logic, toussaint2018differentiable] use nonlinear optimization to identify geometric parameters for given action skeletons. This approach, called Logic Geometric Programming (LGP), is what we extend in this work to handle dynamic environments. Advantages of using nonlinear optimization include the ability to consider all geometric parameters simultaneously without the need for backtracking, as well as the ability to encode complex goals in the objective such as minimizing the energy of the robot or maximizing the height of a stack of blocks.

A common limitation of the TAMP methods above is that planning is slow, taking on the order of tens of seconds to minutes. To this end, garrett2018sampling reduce the sampling space of the motion planning problem by performing conditional sampling on factoring patterns in the action skeletons. wells2019learning use a Support Vector Machine

(SVM) to estimate the feasibility of actions to guide the symbolic search, only calling the motion planner on action skeletons classified by the SVM as fully feasible.

Even with these improvements in efficiency, the theoretical complexity of TAMP makes it impractical to execute inside a real-time closed-loop controller. Therefore, common assumptions in TAMP approaches are that objects in the environment are static, object poses are accurately perceived before planning, and robot commands are executed precisely. Otherwise, errors can accumulate, becoming problematic when a robot needs to grasp an object that it imprecisely placed earlier in the plan, for example. To deal with perception errors, suarez2018interleaving incorporate a symbolic action where the robot physically examines an object up close when perception uncertainty is high. However, this approach is still unable to handle dynamic environments.

We address these limitations by planning over relative object poses, so that an open-loop plan generated by the TAMP algorithm can still be executed by a real-time closed-loop controller even if object poses change during execution. This new planner is presented in the next section.

Iii LGP Background

Iii-a General LGP Formulation

An LGP is comprised of two subproblems: a STRIPS problem [fikes1971strips] specified using first-order logic that operates in a discrete domain, and a nonlinear trajectory optimization problem that operates in a continuous domain. The goal of an LGP is first to find a sequence of discrete actions that results in a discrete state trajectory such that the final state satisfies a set of goal propositions . Second, the LGP needs to find a sequence of continuous control inputs across time such that the resulting state trajectory satisfies the requirements of the discrete states . This can be framed as the following optimization problem: [1] a_1:K, u(t)h(x(T)) + ∫_0^T g(x(t), u(t))  dt x(0)= x_init,  s_K ⊨g ˙x(t) = f_path(x(t), u(t), s_k(t)) t ∈[0, T] ˙x(t_k)= f_switch(x(t_k), u(t_k), a_k) k = 1, …, K s_k∈succ(s_k-1, a_k) k = 1, …, K

is the terminal cost and is the trajectory cost typical in optimal control problems. describes the continuous system dynamics, which change depending on the discrete state at a given timestep . An example of this dependency is when a robot manipulator throws a ball—while the manipulator is holding the ball, the ball’s trajectory is determined by the manipulator’s dynamics, but after release, it follows unconstrained projectile dynamics. describes the instantaneous system dynamics at timesteps when the discrete action changes. An example of such a constraint is when a robot uses a stick to hit a ball—at the time of impact, the ball’s acceleration is determined by the instantaneous impulse imparted by the stick.

In STRIPS planning, each action defines a set of preconditions that must be true before the action is performed and a set of postconditions that will be true after the action is performed. These discrete transition dynamics are encoded in the constraint using first-order logic.

While the LGP could be solved as a Mixed Integer Program, the combined complexity of STRIPS planning and nonlinear optimal control make it intractable to solve exactly. toussaint2015logic proposes an algorithm that approximately solves LGPs by breaking it into a multi-part process. First, a tree search is performed in the STRIPS domain to find a candidate action skeleton . Next, the action skeleton spawns a trajectory optimization problem whose constraints are defined by the action skeleton and corresponding symbolic states. This trajectory optimization is first solved over key timesteps for , and then subsequently over the full trajectory . This optimization will either produce a full trajectory with an objective score, or fail to find a solution, which means the candidate action skeleton is not physically feasible. The tree search continues until the user decides to terminate, at which point the trajectory with the smallest objective score gets returned.

Iii-B Joint Space Formulation

toussaint2018differentiable formulate the trajectory optimization subproblem as a -order Markov Optimization (KOMO) problem, which discretizes time and represents time derivatives of position (up to th order) with their discrete equivalents. The state space and control inputs are defined to be the manipulator configuration extended with an extra 6-dof free body joint for each object the robot is manipulating. This augmented configuration is represented by , where is the number of objects the robot is manipulating at timestep . Note that the size of changes with . ¯q_0:Th(¯q_T) + ∑_t=0^T g(¯q_t-2:t) ¯q_-2:0= ¯q_init f_path_k(t)(¯q_-2:t)= 0 t = 1, …, T f_switch_k(¯q_-2:t_k)= 0 k = 1, …, K

The objective term is a function that can take into account the position, velocity, and acceleration at each timestep. The sum of these functions across all timesteps results in a banded Hessian that can be inverted efficiently for Gauss-Newton methods. However, the constraint functions depend on the entire trajectory of joint positions—for example, the constraint associated with placing an object on a table is affected by how it was picked up at a previous timestep, since this determines the configuration of the object in the robot’s end-effector. This coupling between timesteps makes the Jacobians for these constraints complex, since the constraint functions are dependent on all previous timesteps of the trajectory, not just the timestep of the constraint itself.

Iv Cartesian Formulation

In this section, we introduce the new Cartesian frame formulation that enables adaptation to dynamic environments. At the most basic level, a manipulation task can be reduced to controlling a point relative to some reference frame, e.g., controlling the tip of a hammer relative to the head of a nail. Instead of optimizing over augmented joint configurations of the robot and manipulated objects, the optimization variables in this new formulation are defined to be the relative pose between the control and target frames at each timestep. In a real system, this pose could be observed from a combination of 3D pose estimation and the forward kinematics of the robot.

Let and be the control and target frames determined by the action at time in the action skeleton. The relative pose of in is represented by a 6-dof variable , where is the relative position and is the relative orientation in axis-angle form. The axis-angle representation is chosen for the orientation because it does not have any constraints over its parameters and does not suffer from kinematic singularities like gimbal lock. Although it has representation singularities at rotations of , this should not pose a problem for the optimizer. Note that is a representation of the Euclidian lie algebra . ξ_0:Th(ξ_0:T) + ∑_t=1^T g_t(ξ_0:t) ξ_0= ξ_init f_path_k(t)(ξ_t)= 0 t = 1, …, T f_switch_k(ξ_t_k)= 0 k = 1, …, K

In this Cartesian formulation of the problem, we only consider the positions of the frames in the trajectory and ignore velocities and accelerations. Dynamic motions that require controlling velocities and accelerations are more likely to require high control frequencies that this stage of the trajectory planning cannot offer. Therefore, to enable real-time reactive behavior, the trajectory planner considers only key positions in the trajectory (e.g., at all the switch timesteps) and leaves the higher derivatives up to controllers designed for each action.

Fig. 2: Optimization variables define the 6-dof pose of the control frame relative to the target frame. The place task is unaffected by the grasp executed by the preceding pick task, since it only involves the relative pose between the two boxes. This decouples the action constraints in the optimization problem.

One advantage of optimizing over local frames, in addition to facilitating integration with local controllers and enabling reactive behavior to changing object positions, is that the optimization constraints become decoupled in time. For example, with joint configuration variables, the constraint for placing an object on a table depends on the configuration used to pick up the object, since this determines the pose of the object relative to the end-effector. However, in the Cartesian formulation, if we define the variable for the place constraint to be the relative pose between the object and the table, then this constraint no longer depends on the configuration used to pick up the object, since the relative pose of the object in the end-effector is irrelevant (see Figure 2). This decoupling simplifies the Jacobians for the constraints, since each constraint is only a function of a configuration variable at a single time step.

Instead, the cross-time dependency gets pushed onto the objective function, which may measure something like the total distance travelled by the end-effector, described below.

Iv-a Objective Function

The objective function for our trajectory optimization problems measures a combination of the linear and angular distance travelled by the end-effector. Let be the vector concatenation of . Given scaling factors , the combined distance between timesteps and is


where and compute the absolute position and orientation, respectively, of the end-effector at time given the entire history of relative pose variables . is the SO(3) logarithmic map that in this case computes the shortest axis-angle rotation between and .

Such an objective function requires mapping the relative variables to the absolute pose of the end-effector in the world frame, which may depend on the control actions leading up to the current timestep. This makes the gradient for the objective function more complex, but this may be outweighed by the benefit of simpler constraint Jacobians, which greatly simplifies the process of defining new actions for the optimization framework. The mapping from relative poses to absolute poses is described in detail below.

Iv-B Absolute Pose

Fig. 3: Diagram of the kinematic tree. In the initial state, the hook and shelf are on the table, and the table and end-effector are positioned with respect to the world. During , defines the end-effector’s pose relative to the hook such that the hook becomes the end-effector’s parent. During , defines the hook’s pose relative to the shelf, but the end-effector is still fixed to the hook via . Thus, computing the absolute pose of the end-effector at involves composing the relative poses down the chain from the end-effector to the world.

The variable defines the pose of relative to . This means that is kinematically attached to , because any pose change of will result in the same pose change of . This means that computing the absolute world pose of requires computing the pose of first. The pose of may be defined relative to its own kinematic parent, forming a tree structure with the world frame at the root.

The kinematic tree changes at each timestep, when a manipulation action changes the kinematic parent of the control frame to a new target frame (see Figure 3). This change lasts for all subsequent timesteps until the same control frame is manipulated again. For example, an action that places a ball on the table at time defines to be the ball frame and the table frame. Until the ball is manipulated by another action, such as , the ball will remain fixed to the table, with its relative pose defined by .

At any given timestep , the pose of frame relative to its parent is either equivalent to its relative pose at the previous timestep or is defined by a new pose . Formally, this can be written with the recursive definition


where and are the mappings from relative poses to the position and orientation, respectively, of frame in its parent . is the SO(3) exponential map that converts the axis-angle representation to its rotation matrix representation.

We can compute the pose of frame in any frame by applying (2) through the kinematic tree from to . The absolute world position and orientation and of the end-effector is computed in this manner.

Any objective function involving and requires the Jacobians and for gradient-based optimization methods. Because the combination of and is a representation of the special Euclidian group SE(3), and is a representation of the associated lie algebra , deriving these Jacobians is non-trivial. Their closed form solutions are given in Appendix A.

V Manipulation Primitives

In this section, we define the constraint functions for each primitive manipulation action as visualized in Figure 4.

V-a Pick

The action allows the manipulator to pick up a movable object . The control frame is the end-effector and the target frame is object . The preconditions are that the end-effector cannot be holding any objects before, and the postconditions are that is in the hand and no longer placed on another object. The symbolic definition is:

Parameters: : ,



The constraint associated with requires that the end-effector control point be inside the object, or the signed distance between the control point and object is negative. Here, we assume that a lower level controller will be able to perform the actual grasp given a rough desired position on the object.

Given function that finds the closest point on the mesh of object to point , and function that finds the mesh normal of object at point , we can compute the signed squared distance as follows:


Note that represents . A closed-form solution for the Jacobian of the signed squared distance can be derived by linearizing the mesh at the projection point. The Jacobian then points in the direction of the normal at the projection point.

Fig. 4: The constraint requires that the signed distance between the end-effector and target is negative. The constraint requires that is above and that its center of mass is inside the support area of . The constraint requires that the contact normal between and passes through the center of mass of and points in the desired direction of the push.

V-B Place

The action allows the manipulator to place object on object . The control frame is object and the target frame is object . The preconditions are that must be in the end-effector and must be in the workspace of the robot. The postconditions are that is no longer in the end-effector and is on . The symbolic definition is:

Parameters: : , ; :



The constraint is a concatenation of , , and , which constrains the pose of such that it is touching , inside the support area of , and supported by a contact normal in the opposite direction of gravity. These constraints are outlined below.

V-B1 Touch

Suppose we have a function that computes the signed distance between the closest points and between and (or the farthest points of overlap if they are interpenetrating). This can be computed with the GJK algorithm [gilbert1988fast]. The constraint function is:


Like the action, the linear portion of the Jacobian can be computed by linearizing the mesh at the contact point. The angular portion is approximated by finite differencing.


V-B2 Support Area

To ensure that is supported by , we check that the center of mass of is above the 2d projection of in the plane perpendicular to gravity. If we assume the frame of is positioned at its center of mass, then represents , the center of mass of in at timestep . This constraint is then equivalent to , except in 2d.


In the case where is non-convex, we split into convex regions and compute the above function with each region’s center of mass. The Jacobian is:


If is non-convex, the angular portion of the Jacobian is computed with finite differencing. Otherwise, it is zero.

V-B3 Support Normal

To ensure that the contact normal between and is pointing in the right direction (against gravity), we use a heuristic by constraining the height of to be above the center of mass of . This simplification assumes that is convex and relatively flat, which means this constraint may produce a physically infeasible solution if is something like a sphere, where a surface normal above the center of mass may be pointing sideways, not against gravity. However, objects with such geometry are also difficult to use as placement targets in the real world, and can be filtered out from the optimization with symbolic preconditions.


V-C Push

The action allows the manipulator to use to push on top of in cases where is outside the manipulator’s workspace. The preconditions are that has to be held by the end-effector, has to be on , and has to be at least partially in the workspace of the manipulator. The postcondition is that is in the manipulator’s workspace, which assumes that the trajectory optimizer will be able to find a target push location inside the workspace. If such a location cannot be found, the optimizer will fail and a different action skeleton will be chosen. The symbolic definition is:

Parameters: : , ; : , , :



This action is composed of two timesteps. First, the manipulator must position so that it is ready to push . Second, it must push to the target position. In the first timestep, the control frame is and the target frame is . In the second timestep, the control is and the target is . Because is a child of , it will follow wherever it is controlled.

The constraint at the first timestep is a concatenation of from and a constraint that constrains the vector from the point of contact to the object’s center of mass to be in the direction of the push. The constraint at the second timestep restricts the target push position to be along the surface of , which is assumed to be flat. Because this action involves three frames , , and across two timesteps, it introduces coupling between three timesteps to compute the relative poses of the three frames. The first timestep relates frame to , and the second relates to , but to relate to , it is necessary to look at when was last placed in . This coupling is limited to at most 3 timesteps, meaning the Jacobian depends on at most 18 variables.

V-C1 Push Normal

Let be a function that finds the point of intersection between a ray with origin and direction and object .


The linear portion of the Jacobian at the first timestep can be simplified by assuming the point of contact on will stay fixed to for small changes in the position of . Then, the Jacobian is simply


The rest of the Jacobian can be computed with finite differencing in the remaining variables.

V-C2 Push Direction

If we assume that has a flat surface whose normal points in the direction, this constraint can simply constrain movement along and rotation about and to be zero. Let be the last timestep when was positioned relative to , where -1 indicates that has not been manipulated yet. is the corresponding variable, or if , then the initial pose of in .


The Jacobian for the second timestep is


and the Jacobian for , if it exists, is the negation of (13).

V-D Collision

Every manipulation action includes an additional constraint to ensure objects do not collide with each other. The objects can be divided into two sets: the manipulated objects , which consist of all objects between the end-effector and the control frame, and the environment objects , which consist of all other objects. Let be the signed distance between the closest points between and (equivalent to the first value in the tuple returned by . The constraint function is:


The Jacobian is computed with numerical differentiation.

Vi Implementation

Vi-a Planning

Our STRIPS planner takes in a PDDL specification and performs a simple breadth first tree search to find candidate action skeletons. For more complex symbolic problems, a state of the art planner such as FD [helmert2006fast] might be considered.

Our Cartesian frame optimizer currently leverages two different nonlinear optimizers: IPOPT [biegler2009large] and NLOPT [johnson2014nlopt]. Any state of the art nonlinear optimizer can be plugged in at this stage by writing wrappers for our LGP optimization API. IPOPT is an interior point method, while NLOPT features many methods. We chose the Augmented Lagrangian method with Sequential Quadratic Programming for its speed and robustness. All optimization variables are initialized to zero.

Vi-B Control

The , , and actions use operational space control [khatib1987unified]. An attractive potential field in Cartesian space is created between the control frame and target frame to allow the robot to track moving targets. Repulsive potential fields are used to avoid obstacles. We choose to use torque control so that the robot can safely interact with the environment even with large perception uncertainty. Details of this controller are specified in Appendix B.

For grasping, the frame optimization simply outputs a position inside the object. To convert this position into a grasp pose for a two-fingered gripper, we use a simple heuristic that finds the top-down orientation at the optimized position with the smallest object width between the fingers. This heuristic may fail if the optimizer outputs a position that is not graspable, but it is sufficient for the simple objects in our application.

Vi-C Perception

To track the 6-dof poses of objects, we use DBOT with particle filtering [wuthrich2013probabilistic, issac2016depth] using depth data from a Kinect v2 downsampled to . Although DBOT can track objects with centimeter accuracy, even if they are partially occluded, it requires an initialization of the object poses. Furthermore, the particle filter can lose sight of objects if they move too quickly while occluded. To alleviate these two issues, we use ArUco fiducial marker tracking [garrido2014automatic] as a control input to DBOT to bias the tracking particles towards the expected global pose.

Vii Results

We present results for two different planning problems: Tower of Hanoi and Workspace Reach. The actions used in both of these problems are the ones described in Section V. We test the framework on a 7-dof Franka Panda fitted with a Robotiq 2F-85 gripper in simulation, and then demonstrate the Workspace Reach problem in the real world. Videos of the results can be found in the supplementary material.

Solution Optimal score Time [s]
Tower on left plate 0.444 3.63
Tower on middle plate 0.404 0.96
Hook on table 0.254 0.274 1.18 0.87
Hook on shelf 0.273 0.358 3.71 1.42
Hook on box 0.287 0.283 1.14 0.85
Table I: IPOPT vs. NLOPT: Optimization Scores and Times

Vii-a Tower of Hanoi

In the Tower of Hanoi problem, there are three plates upon which towers of blocks can be formed decreasing in size from bottom to top. The task is to transfer a tower of blocks from a plate on the right side to one of two other plates: in the middle or on the left. For a tower of three blocks, the minimum number of actions needed to complete the task is 14. This problem suffers from long-term coupling between timesteps, since the placement of blocks at the bottom of the stack affects later blocks. Using relative pose variables, however, removes this coupling from the pick and place constraints.

Given a maximum symbolic tree search depth of 14, STRIPS planning finds two candidate action skeletons, each one transferring the tower to a different plate. Transferring the tower to the middle plate yields a lower optimization score than transferring to the left. This difference is also reflected in the robot’s execution time for each action skeleton: transferring to the middle takes 25s while transferring to the left takes 28s.

The combined TAMP optimization takes a total of 4.02s for this problem when using IPOPT as the nonlinear solver. As shown in Table I, NLOPT failed to produce a feasible solution for this problem.

Vii-B Workspace Reach

Fig. 5: Optimization convergence with IPOPT. Convergence is fast, but the highly nonlinear nature of the problem makes it susceptible to local minima. The spikes occur when the IPOPT algorithm tries to escape infeasible minima.

In this problem, the robot’s goal is to place a box on a shelf, but the box is outside its workspace. The world also contains a hook within the robot’s workspace that can be used to bring the box closer.

With a maximum search depth of 5, there are three candidate action skeletons of the following form: 1. 2. 3. 4. 5. . The three action skeletons differ in where the hook gets placed after the action: on the table, shelf, or box. Frame optimization reveals that placing the hook on the table minimizes the total distance travelled by the end-effector.

The combined TAMP optimization takes a total of 3.76s for this problem with IPOPT. Figure 5 shows the convergence trends for the three different optimization subproblems, and Table I compares optimization scores and times between IPOPT and NLOPT. In general, we found IPOPT to produce more optimal results with fewer constraint violations. However, it is sensitive to the initial poses of objects in the environment; it can take up to a minute to find a feasible solution, if at all, depending on the problem instance.

Viii Conclusion

In this paper, we proposed a new TAMP algorithm based on the optimization of Cartesian frames defined relative to target objects. By design, these plans can be easily integrated with reactive closed-loop controllers that can adapt to changes in the environment in real-time and handle imperfect perception and control. While previous works on TAMP are limited to simulation or tightly controlled environments with accurate perception, we demonstrate our approach working on a real robot with noisy perception from RGB-D cameras in an environment with moving objects subject to external disturbances.

One limitation of using Cartesian frames instead of joint configurations is that robot collisions and joint limits are difficult to encode. Our TAMP method only checks object-level collisions, relying on lower level motion planners or controllers to avoid collisions with robot links and joint limits. In extreme cases, however, this heuristic would not be sufficient, and an approach that combines Cartesian and configuration variables in an augmented representation as done in [lozano2014constraint] may be more appropriate.

Manipulation actions in our TAMP framework are defined by their STRIPS preconditions and postconditions, their control and target frames, and their constraint functions for optimization. The effort required to define new actions makes scalability an issue. Future work includes using learning-based approaches to alleviate the engineering bottlenecks, such as learning the preconditions and postconditions of the symbolic actions, as is considered by huang2019continuous, or learning the constraint functions from demonstration.


Toyota Research Institute (”TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.


Appendix A SE(3) Jacobians

Objective functions used with our optimization framework will likely require the computation of object poses in world space, such as minimizing the distance travelled by the end-effector or maximizing the height of a tower of blocks. In order to use gradient-based optimization methods, it is necessary to compute the Jacobians of the object poses (in the special Euclidian group ) with respect to the optimization variables (in the Euclidian lie algebra ). Below, we give the closed form solutions of the end-effector Jacobians and . These Jacobians can be generalized to any object in the kinematic tree.

A-a Position Jacobian

From (2), it is straightforward to derive the block of the Jacobian corresponding to frame

using the chain rule. First, let

be the set of all ancestor frames of frame at time , including .


Let be the last timestep before where frame was manipulated, and let be the set of all last manipulated timesteps of the ancestors of .


Intuitively, if , then variable affects the global pose of frame at time . Now, the Jacobian block is:


The matrix in the equation above is a Jacobian of the exponential map of . Let and . This Jacobian above relates the change in the rotated vector to the change in coordinates . Let be the magnitude of , or equivalently, the angle of as an axis-angle representation. The closed form solution of the exponential map Jacobian is given below.


In the case where , we can use the approximation:


A-B Orientation Jacobian

The angular portion of the objective function measures the angle between the orientations of two consecutive timesteps and . Let . By using the formula for finding the angle of a rotation matrix, we can compute the derivative of the norm of the logarithmic map as the derivative of a matrix trace.


Because the frame orientations do not depend on the position variables for , the Jacobian segments are 0. Let . Using the chain rule for matrices from [petersen2008matrix], we can find write the th element of the gradient of the trace as:


The rotation as a function of takes on one of the following forms, depending on the kinematic tree:


where are constant in terms of . Letting , the corresponding Jacobians are:


Finally, the Jacobian can be derived from (18) by setting to be each of the unit axis vectors.


Here, is the th column of .

Appendix B Operational Space Controller

The reactive behavior of operational space control at the object level makes it a natural choice for the simpler manipulation actions, like , , and . An attractive potential field is constructed between the control frame and target frame, and repulsive fields are constructed between the control frame and all other objects.

The attractive potential field is computed using the pose of the control and target frames obtained from perception.


Here, and are the position and damping gains, respectively, of the PD controller. The desired angular acceleration can be computed with analagous equations for orientation.

Because the control frame needs to be able to make contact with the target object, the repulsive field should only dampen the end-effector’s velocity towards the closest obstacle.


is a gain controlling the strength of damping towards the obstacle and is an indicator function that returns 1 when the condition is true and 0 otherwise.

The desired end-effector acceleration passed to the operational space controller is .