Log In Sign Up

kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation

by   Lucas Manuelli, et al.

We would like robots to achieve purposeful manipulation by placing any instance from a category of objects into a desired set of goal states. Existing manipulation pipelines typically specify the desired configuration as a target 6-DOF pose and rely on explicitly estimating the pose of the manipulated objects. However, representing an object with a parameterized transformation defined on a fixed template cannot capture large intra-category shape variation, and specifying a target pose at a category level can be physically infeasible or fail to accomplish the task -- e.g. knowing the pose and size of a coffee mug relative to some canonical mug is not sufficient to successfully hang it on a rack by its handle. Hence we propose a novel formulation of category-level manipulation that uses semantic 3D keypoints as the object representation. This keypoint representation enables a simple and interpretable specification of the manipulation target as geometric costs and constraints on the keypoints, which flexibly generalizes existing pose-based manipulation methods. Using this formulation, we factor the manipulation policy into instance segmentation, 3D keypoint detection, optimization-based robot action planning and local dense-geometry-based action execution. This factorization allows us to leverage advances in these sub-problems and combine them into a general and effective perception-to-action manipulation pipeline. Our pipeline is robust to large intra-category shape variation and topology changes as the keypoint representation ignores task-irrelevant geometric details. Extensive hardware experiments demonstrate our method can reliably accomplish tasks with never-before seen objects in a category, such as placing shoes and mugs with significant shape variation into category level target configurations.


page 1

page 4

page 6

page 8

page 9


kPAM-SC: Generalizable Manipulation Planning using KeyPoint Affordance and Shape Completion

Manipulation planning is the task of computing robot trajectories that m...

kPAM 2.0: Feedback Control for Category-Level Robotic Manipulation

In this paper, we explore generalizable, perception-to-action robotic ma...

You Only Demonstrate Once: Category-Level Manipulation from Single Visual Demonstration

Promising results have been achieved recently in category-level manipula...

Optimal and Robust Category-level Perception: Object Pose and Shape Estimation from 2D and 3D Semantic Keypoints

We consider a category-level perception problem, where one is given 2D o...

Optimal Pose and Shape Estimation for Category-level 3D Object Perception

We consider a category-level perception problem, where one is given 3D s...

StarMap for Category-Agnostic Keypoint and Viewpoint Estimation

Semantic keypoints provide concise abstractions for a variety of visual ...

I Introduction

This paper focuses on pose-aware robotic pick and place at a category level. Contrary to single-instance pick and place, the manipulation policy should generalize to potentially unknown instances in the category with different shape, size, appearance, and topology. These tasks can be easily described using natural language, for example “put the mugs upright on the shelf,” “hang the mugs on the rack by their handle” or “place the shoes onto the shoe rack.” However, converting these intuitive descriptions into concrete robot actions remains a significant challenge. Accomplishing these types of tasks is of significant importance to both industrial applications and interactive assistant robots.

While a large body of work addresses robotic picking for arbitrary objects [5, 14, 27, 7], existing methods have not demonstrated pick and place with an interpretable and generalizable approach. One way to achieve generalization is at the object category level, and perhaps the most straightforward approach is to attempt to extend existing instance-level pick and place pipelines with category-level pose estimators  [17, 25]. However, as detailed in Sec. IV, representing an object with a parameterized pose defined on a fixed geometric template, as these works do, may not adequately capture large intra-class shape or topology variations, and can lead to physical infeasibility for certain instances in the category. Other recent work has developed dense correspondence visual models, including at a category level, as a general representation for robot manipulation [2], but did not formulate how to specify and solve the task of manipulating objects into specific configurations. As a different route to address category-level pick and place, without an explicit object representation, [4] trains end-to-end policies in simulation to generalize across the object category. It is unclear, however, how to measure the reward function for this type of approach in a fully general way without an object representation that can adequately capture the human’s intention for the task.


Fig. 1: kPAM is a framework for defining and accomplishing category level manipulation tasks. The key distinction of kPAM is the use of semantic 3D keypoints as the object representation (a), which enables flexible specification of manipulation targets as geometric costs/constraints on keypoints. Using this framework we can handle wide intra-class shape variation (a) and reliably accomplish category-level manipulation tasks such as perceiving (b), grasping (c), and (d) placing any mug on a rack by its handle. A video demo for this task is available on our project page.

Our main contribution is a novel formulation of the category-level pick and place task which uses semantic 3D keypoints as the object representation. This keypoint representation enables a simple and interpretable specification of the manipulation target as geometric costs and constraints on the keypoints, which flexibly generalizes existing pose-based manipulation targets. Using this formulation, we contribute a manipulation pipeline that factors the problem into 1) instance segmentation, 2) 3D keypoint detection, 3) optimization-based robot action planning 4) geometric grasping and action execution. This factorization allows us to leverage well-established solutions for these submodules and combine them into a general and effective manipulation pipeline. The keypoint representation ignores task-irrelevant geometric details of the object, making our method robust to large intra-category shape and topology variations. In addition to the usage in our pipeline, the keypoint representation can potentially contribute to various learning-based manipulation approaches as 1) a reward function to flexibly specify the manipulation target or 2) an alternative input to the policy/value neural network, which is more robust to shape variation and large deformation than the widely-used pose representation. We experimentally demonstrate the use of this keypoint representation with our manipulation pipeline on several category-level pick and place tasks implemented on real hardware. We show that our approach generalizes to novel objects in the category, and that this generalization is accurate enough to accomplish tasks requiring centimeter level precision.

This paper is organized as follows: in Sec. II we review related works. Sec. III describes our manipulation formulation. Sec. III-A introduces the formulation using a concrete example, while Sec. III-B describes the general formulation. Sec. IV compares our formulation with pose-based pick and place pipelines to highlight the flexibility and generality of our method. Sec. V demonstrates our methods on several pose-aware manipulation tasks and shows generalization to novel instances. Sec. VI discusses limitations and future work and Sec. VII concludes.

Ii Related Work

Ii-a Object Representations and Perception for Manipulation

There exist a number of object representations, and methods for perceiving these representations, that have been demonstrated to be useful for robot manipulation. The default solution to the pick and place of a known object is to estimate its 6 DOF pose. The robot then moves the object from its estimated pose to the target pose. Pose estimation is an extensively studied topic in computer vision and robotics, and existing methods can be generally classified into geometric based algorithms 

[15, 3] and learning based approaches [23, 25, 17]. Several datasets [25, 26] are annotated with pre-aligned geometric templates, and pose estimators [17, 25] trained on these datasets can produce a category-level pose estimation. Consequently, a straightforward approach to category-level pose-aware manipulation is to combine single object pick and place pipelines with these perception systems. However, pose estimation can be ambiguous under large intra-category shape variations, and moving the object to the specified target pose for the geometric template can lead to incorrect or physically infeasible states for different instances within a category of objects. For example knowing the pose and size of a coffee mug relative to some canonical mug is not sufficient to successfully hang it on a rack by its handle. A more technical discussion is presented in Sec. IV.

Other work has developed and used representations that may be more generalizable than object-specific pose estimation. Recent work has demonstrated dense visual descriptors [18] as a fully self-supervised object representation for manipulation that can generalize at the category level [2]. In comparison with our present work based on 3D keypoints: (i) it is unclear how to extend dense visual descriptors to represent the full object configuration due to self-occlusions which would require layers of occluded descriptors, (ii) the sparse keypoint representation may in practice be more effective at establishing task-relevant correspondence across significant topology variation, and (iii) correspondence alone may not fully define a class-general configuration-change manipulation task, but the addition of human-specified geometric costs and constraints on 3D keypoints may. Keypoints have also been used in prior works as components of manipulation pipelines. Several existing works demonstrate the manipulation of deformable objects where the keypoint detection is used in their perception pipelines. The detected keypoints are typically used as grasp points [10, 20] or building blocks for other shape parameterization, for instance the polygons in [24, 12, 13] on which the manipulation policy is defined. These works accomplished various challenging manipulation tasks such as bed making and towel folding. Compared with these works, we propose a novel manipulation target specification as costs and constraints on 3D keypoints. In addition, the state-machine [10, 20] and manipulation primitives [24, 13] are specific to the cloth and our manipulation task is out of the scope of these approaches.

Ii-B Grasping Algorithms

Grasping algorithms enable finding stable grasp poses that allow robots to reliably pick up objects. Among various approaches for grasping, model-based methods [28, 9] typically rely on a pre-built grasp database of common 3D object models labeled with sets of feasible grasps. During execution, these methods associate the sensor input with an object entry in the database for grasp planning. In contrast, model-free methods [27, 5, 14, 8] directly evaluate the grasp quality from raw sensor inputs. Many of these approaches achieved promising robustness and generality in the Amazon Picking Challenge [28, 19, 27]. Several works also incorporate object semantic information using instance masks [19], or non-rigid registrations [16] to accomplish tasks such as picking up a specific object or transferring a grasp pose to novel instances.

In this work we focus on placing objects into desired goal states. This is a task that requires much more than just being able to find a grasp on the object, and is out of scope for the above mentioned methods.

Fig. 2: An overview of our manipulation formulation using the “put mugs upright on the table” task as an example: (a) we train a category level keypoint detector that produces two keypoints: and . The axis of the mug

is a unit vector from

to . (b) Given an observed mug, its two keypoints on bottom center and top center are detected. The rigid transform , which represents the robotic pick-and-place action, is solved to move the bottom center of the mug to the target location and align the mug axis with the target direction .

Ii-C End-to-End Reinforcement Learning

There have been impressive contributions [4, 1]

in end-to-end reinforcement learning with applications to robotic manipulation. In particular,

[4] has demonstrated robotic pick and place across different instances and is the most related to our work. These end-to-end methods encode a manipulation task into a reward function and train the policy using trial-and-error.

However, in order to accomplish the category level pose-aware manipulation task, these end-to-end methods lack a general, flexible, and interpretable way to specify the desired configuration, which is required for the reward function. In [4], the target configuration is implemented specific to the demonstrated task and object category. Extending it to other desired configurations, object categories and tasks is not obvious. In this way, using end-to-end reinforcement learning allows the policy to be learned from experience without worrying about the details of shape variation, but only transfers the burden of shape variation to the choice and implementation of the reward function. Our proposed object representation of 3D keypoints could be used as a solution to this problem.

Iii Manipulation Formulation

In this section, we describe our formulation of the category level manipulation problem. Sec.  III-A describes the approach using a concrete example and Sec.  III-B presents the general formulation.

Iii-a Concrete Motivating Example

Consider the task of “put the mug upright on the table”. We want to come up with a manipulation policy that will accomplish this task for mugs with different size, shape, texture and topology.

To accomplish this task, we pick 2 semantic keypoints on the mugs: the bottom center and the top center , as shown in Fig. 2 (a). Additionally, we assume we have a keypoint detector, discussed in Section III-B, that takes raw observations (typically RGBD images or point clouds) and outputs the 3D locations of the specified keypoints. Note that there is no restriction that the keypoints be on the object surface, as evidenced by keypoint in Fig. 2 (a). The 3D keypoints are usually expressed in the camera frame, but they can be transformed to an arbitrary frame using the known camera extrinsics. In the following text, we use to denote the detected keypoint positions in world frame, where is the detected keypoint, and is the total number of keypoints. In this example .

For robotic pick-and-place of mostly rigid objects, we represent the robot action as a rigid transform on the manipulated object. Thus, the keypoints associated with the manipulated object will be transformed as using the robot action. In practice, this action is implemented by first grasping the object using the algorithm detailed in Sec. III-B and then planning and executing a trajectory which ends with the object in the desired target location. This trajectory may require approaching the target from a specific direction, for example in the “mug upright on the table” task the mug must approach the table from above.

Given the above analysis, the manipulation task we want to accomplish can be formulated as: find a rigid transformation such that

  1. The transformed mug bottom center keypoint should be placed at some target location:

  2. The transformed direction from the mug bottom center to the top center should be aligned with the upright direction. This is encoded by adding a cost to the objective function


    where is the rotational component of the rigid transformation , the target orientation , and


An illustration is presented in Fig. 2 (b). The above problem is an inverse kinematics problem with as the decision variable, a constraint given by Equ. (1) and cost given by Equ. (2). This inverse kinematics problem can be reliably solved using off-the-shelf optimization solvers such as [22]. We then pick up the object using robotic grasping algorithms [9, 14, 5] and execute a robot trajectory which applies the manipulation action to the grasped object.

Fig. 3: An overview of the category level pick and place pipeline using our manipulation formulation. Given a RGBD image with instance segmentation, the semantic 3D keypoints of the object in question are detected. We then feed these 3D keypoints into an optimization based planning algorithm to compute the robot pick and place actions, which is represented by a rigid transformation . Finally, we use an object-agnostic grasp planner to pick up the object and apply the computed robot action.

Iii-B General Formulation

For an arbitrary category level manipulation task we can represent an object using task-relevant semantic 3D keypoints. The task is then specified via geometric costs and constraints on these keypoints, which affords a flexible way of formulating the manipulation problem. The user selects keypoints, e.g. and in the example of Sec. III-A, together with costs and constraints, e.g. (1) and (2), which fully specify the task. Once we have chosen this as the problem specification, there exist natural formulations for each remaining piece of the manipulation pipeline. This allows us to factor the manipulation policy into 4 subproblems: 1) object instance segmentation 2) category level 3D keypoint detection, 3) a kinematic optimization problem to determine the manipulation action and 4) grasping the object and executing the desired manipulation action . An illustration of our complete manipulation pipeline is shown in Fig. 3. In the following sections, we describe each component of our manipulation pipeline in detail.

Instance Segmentation and Keypoint Detection As discussed in Section III-A the kPAM pipeline requires being able to detect category-level 3D keypoints from RGBD images of specific object instances. Here we present a specific approach we used to the keypoint detection problem, but note that any technique that can detect these 3D keypoints could be used instead.

We use the state-of-the-art integral network [21]

for 3D keypoint detection. For each keypoint, the network produces a probability heatmap and a depth prediction map as the raw outputs. The 2-D image coordinates and depth value are extracted using the integral operation 

[21]. The 3-D keypoints are recovered using the calibrated camera intrinsic parameters. These keypoints are then transformed into world frame using the camera extrinsics.

We collect the training data for keypoint detection using a pipeline similar to LabelFusion [11]. Given a scene containing the object of interest we first perform a 3D reconstruction. Then we manually label the keypoints on the 3D reconstruction. We note that this does not require pre-built object meshes. Keypoint locations in image space can be recovered by projecting the 3D keypoint annotations into the camera image using the known camera calibration. Training dataset statistics are provided in Fig. 7 (c). In total labeling our 117 training scenes took less than four hours of manual annotation time and resulted in over 100,000 labeled images. Even with this relatively small amount of human labeling time we were able to achieve centimeter accurate keypoint detections, enabling us to accomplish challenging tasks requiring high precision, see Section V.

The keypoint detection network [21] requires object instance segmentation as the input, and we integrate Mask R-CNN [6] into our manipulation pipeline to accomplish this step. The training data mentioned above for the keypoint detector [21] can also be used to train the instance segmentation network [6]. Please refer to the supplemental material for more detail.

kPAM Optimization The optimization used to find the desired robot action can in general be written as


where is a scalar cost function, and are the equality and inequality constraints, respectively. The robot action is the decision variable of the optimization problem, and the detected keypoint locations enter the optimization parametrically.

In addition to the constraints used in Sec. III-A, a wide variety of costs and constraints can be used in the optimization (4). This allows the user to flexibly specify a wide variety of manipulation tasks. In practice we found that this specification was rich enough to cover all of our desired use cases. Although an exhaustive list is infeasible, we present several costs/constraints used in our experiments:

  1. L2 distance cost between the transformed keypoint with its nominal target location:


    This is a relaxation of the target position constraint presented in Sec. III-A.

  2. Half space constraint on the keypoint:


    where and defines the separating plane of the half space. Using the mug in Sec. III-A as an example, this constraint can be used to ensure all the keypoints are above the table to avoid penetration.

  3. The point-to-plane distance cost of the keypoint


    where and defines the plane that the keypoint should be in contact with. By using this cost with keypoints that should be placed on the contact surface, for instance the of the mug in Sec. III-A, the optimization (4) can prevent the object from floating in the air.

  4. The robot action should be within the robot’s workspace and avoid collisions.

Robot Grasping Robotic grasping algorithms [14, 5, 9] can be used to apply the abstracted robot action produced by the kPAM optimization (4) to the manipulated object. If the object is rigid and the grasping is tight (no relative motion between the gripper and object), applying a rigid transformation to the robot gripper will apply the same transformation to the manipulated object. These grasping algorithms [14, 5, 9] are object-agnostic and can robustly generalize to novel instances within a given category.

For the purposes of this work we developed a grasp planner which uses the detected keypoints, together with local dense geometric information from a pointcloud, to find high quality grasps. This local geometric information is incorporated with an algorithm similar to the baseline method of [27]. In general the keypoints used to specify the manipulation task aren’t sufficient to determine a good grasp on the object. Thus incorporating local dense geometric information from a depth image or pointcloud can be advantageous. This geometric information is readily available from the RGBD image used for keypoint detection, and doesn’t require object meshes. Our grasp planner leverages the detected keypoints to reduce the search space of grasps, allowing us to focus our search on, for example, the heel of a shoe or the rim of a mug. Once we know which aspect of the local geometry to focus on, a high quality grasp can be found by any variety of geometric or learning-based grasping algorithms [14, 5, 8].

We stress that keypoints are a sparse representation of the object sufficient for describing the manipulation task. However grasping, which depends on the detailed local geoemetry, can benefit from denser RGBD and pointcloud data. This doesn’t detract from keypoints as an object representation for manipulation, but rather shows the benefits of different representations for different pieces of the manipulation pipeline.

Fig. 4: A pose representation cannot capture large intra-category variations. Here we show different alignment results from a shoe template (blue) to a boot observation (red). (a) and (b) are produced by [3] with variation on the random seed, and the estimated transformation consists of a rigid pose and a global scale. In (c), the estimated transformation is a fully non-rigid deformation field in [15]. In these examples, the shoe template and transformations can not capture the geometry of the boot observation. Additionally, there may exist multiple suboptimal alignments which make the pose estimator ambiguous. The subsequent robotic pick and place action from these estimations are different, despite these alignments being reasonable geometrically.
Fig. 5: A comparison of the keypoint based manipulation with pose based manipulation for two different tasks involving mugs. The first row considers the mug on rack task, where a mug must be hung on a rack by its handle. (a) Shows a reference mug in the goal state, (b) and (c) show a scaled down mug instance that could be encountered at test time. (b) uses keypoint based optimization with a constraint on the handle keypoint to find the target state for the mug. The optimized goal state successfully achieves the task of hanging the mug on the rack. In contrast (c) shows the scaled mug instance at the pose defined by (a), which leads to the handle of the mug completely missing the rack, a failure of the task. The second row shows the task of putting a mug on a table. Again (a) shows a reference mug in a goal state, (b) - (c) show a scaled up mug that could be encountered at test time. (b) uses keypoint based optimization with costs/constraints on the bottom and top keypoints to place the mug in a valid goal state. (c) directly uses the pose from (a) on the new mug instance which leads to an invalid goal state where the mug is penetrating the table.

Iv Comparison and Discussions

In this section we compare our approach, as outlined in Sec. III, to existing robotic pick and place methods that use pose as the object representation.

Iv-a Keypoint Representation vs Pose Representation

At the foundation of existing pose-estimation methods is the assumption that the geometry of the object can be represented as a parameterized transformation defined on a fixed template. Commonly used parameterized pose families include rigid, affine, articulated or general deformable. For a given observation (typically an RGBD image or pointcloud), these pose estimators produce a parameterized transformation that aligns the geometric template to the observation.

However, the pose representation is not able to capture large intra-category shape variation. An illustration is presented in Fig. 4, where we try to align a shoe template (blue) to a boot observation (red). Fig. 4 (a) and (b) are produced by  [3] where the estimated transformation consists of a rigid pose and a global scale. Fig. 4 (c) is produced by [15] and the estimated transformation is a fully non-rigid deformation field. In these examples, the shoe template and transformations cannot capture the geometry of the boot observation. Additionally, there may exist multiple suboptimal alignments which make the pose estimator ambiguous, as shown in Fig. 4. Feeding these ambiguous estimations into a pose-based manipulation pipeline will produce different pick and place actions and final configurations of the manipulated object.

In contrast, we use semantic 3D keypoints as a sparse but task-specific object representation for the manipulation task. Many existing works demonstrate accurate 3D keypoint detection that generalizes to novel instances within the category. We leverage these contributions to build a robust and flexible manipulation pipeline.

Conceptually, a pose representation can also be transformed into keypoint representation given keypoint annotations on the template. However, in practice the transformed keypoints can be inaccurate as the template and the pose cannot fully capture the geometry of new instances. Using the shoe keypoint annotation in Fig. 6 as an example, transforming the keypoints and to a boot using the shoe to boot alignment in Fig 4 would result in erroneous keypoint detections. A general non-rigid kinematic model (and the associated estimator) that can handle large variations of shape and topology, such as in the example of Fig. 4, remains an open problem. Our method avoids this problem by sidestepping the geometric alignment phase and directly detecting the 3D keypoint locations.

Iv-B Keypoint Target vs Pose Target

For existing pose-based pick and place pipelines, the manipulation task is defined as a target pose of the objects. For a given scene where the pose of each object has been estimated, these pipelines grasp the object in question and use the robot to move the objects from their current pose to the target pose.

The proposed method can be regarded as a generalization of the pose-based pick and place algorithms. If we detect 3 or more keypoints and assign their target positions as the manipulation goal, then this is equivalent to pose-based manipulation. In addition, our method can specify more flexible manipulation problems with explicit geometric constraints, such as the bottom of the cup must be on the table and its orientation must be aligned with the upright direction, see Sec.  III-A. The proposed method also naturally generalizes to other objects within the given category, as the keypoint representation ignores many task-irrelevant geometric details.

On the contrary a pose target is object-specific and defining a target pose at the category level can lead to manipulation actions that are physically infeasible. Consider the mug on table task from Section III-A. Fig. 5 (d) shows the target pose for the reference mug model. Directly applying this pose to the scaled mug instance in Fig.  5 (f) leads to physically infeasible state where the mug is penetrating the table. In contrast, using the optimization formulation of Section III results in the mug resting stably on the table, shown in Fig. 5 (e).

In addition to leading to states which are physically infeasible, pose-based targets at a category level can also lead to poses which are physically feasible but fail to accomplish the manipulation task. Figures 5 (a) - (c) show the mug on rack task. In this task the goal is to hang a mug on a rack by its handle. Fig. 5 (a) shows the reference model in the goal state. Fig. 5 (c) shows the result of applying the pose based target to the scaled down mug instance. As can be seen even though the pose unambiguously matches the target pose exactly, this state doesn’t accomplish the manipulation task since the mug handle completely misses the rack. Fig. 5 (b) shows the result of our kPAM approach. Simply by adding a constraint that handle center keypoint should be on the rack, a valid goal state is returned by the kPAM optimization.

Fig. 6: An overview of our experiments. (a) and (b) are the semantic keypoints we used for the manipulation of shoes and mugs. We use three manipulation tasks to evaluate our pipeline: (c) put shoes on a shelf; (d) put mugs on a mug shelf; (e) hang mugs on a rack by the mug handles. The video of these experiments are available on our project page.

V Results

In this section, we demonstrate a variety of pose-aware pick and place tasks using our keypoint-based manipulation pipeline. The particular novelty of these demonstrations is that our method is able to handle large intra-category variations without any instance-wise tuning or specification. We utilize a 7-DOF robot arm (Kuka IIWA LBR) mounted with a Schunk WSG 50 parallel jaw gripper. An RGBD sensor (Primesense Carmine 1.09) is also mounted on the end effector. The video demo on our project page best demonstrates our solution to these tasks. More details about the experimental setup are included in the supplemental material.

Fig. 7:

Quantitative results from the 3 hardware experiments. (a) and (b) show some of the test objects for the experiments. (c) statistics of the training data (d) We report the average heel and toe errors (along the horizontal direction) from their desired locations as well as the standard deviation. (e) The reported errors for the mug on shelf task are the distance from the bottom center keypoint to the target location of that keypoint in the optimization program. (f) reports success rates for the mug on rack task for different sized mugs. Mugs with handles having either height or width less than 2cm are classified as “small” (more details in supplementary material). A trial was deemed successful if the mug ended up hanging on the rack by the mug handle. Videos of the experiments are available on our

project page.

V-a Put shoes on a shoe rack

Task Description Our first manipulation task is to put shoes on a shoe rack, as shown in Fig. 6 (c). We use shoes with different appearance and geometry to evaluate the generality and robustness of our manipulation policy. The six keypoints used in this manipulation task are illustrated in Fig 6 (a), and the costs and constraints in the optimization (4) are

  1. The L2 distance cost (5) between keypoints , , and to their nominal target locations.

  2. The sole of the shoe should be in contact with the rack surface. In particular, the point-to-plane cost (7) is used to penalize the deviation of keypoints , and from the supporting surface.

  3. All the keypoints should be above the supporting surface to avoid penetration. A half-space constraint (6) is used to enforce this condition.

Experimental Results The shoe keypoint detection network was trained on a labeled dataset of 10 shoes, detailed in Figure 7 (c). Experiments were conducted with a held out test set of 20 shoes with large variations in shape, size and visual appearance (more details in the video and supplemental material). For each shoe we ran 5 trials of the manipulation task. Each trial consisted of a single shoe being placed on the table in front of the robot. Using the kPAM pipeline the robot would pick up the shoe and place it on a shoe rack. The shoe rack was marked so that the horizontal deviation of the shoe’s toe and heel bottom keypoints ( and respectively in Fig. 6) from their nominal target locations could be determined. Quantitative results are given in Fig. 7 (d). Out of 100 trials only twice did the pipeline fail to place the shoe on the rack. Both failures were due to inaccurate keypoint detections. One led to a failed grasping and another to an incorrect . For trials which ended up with the shoe on the rack average errors for the heel and toe keypoint locations are given in Fig. 7 (d). During the course of our experiments we noticed that the majority of these errors come from the fact that when the robot grasps the shoe by the heel the closing of the gripper often results in the object shifting from the position it was in when the RGBD image used for keypoint detection was captured. This accounts for the majority of the errors observed in the final heel and toe keypoint locations. The keypoint detections and resulting would have almost always results in heel and toe errors of less than 1 cm if we were able to exactly apply to the object. Since our experimental setup relies on a wrist mounted camera we are not able to re-perceive the object after grasping it. We believe that these errors could be further reduced by adding an external camera that would allow us to re-run our keypoint detection after grasping the object to account for any object movement during the grasp. Overall kPAM approach was very successful at the shoes on rack task with a greater than 97% success rate.

V-B Put mugs upright on a shelf

Task Description We also perform a real-world demonstration of the “put mugs upright on a shelf” task described in Sec. III-A, as shown in Fig. 6 (d). The keypoints used in this task are illustrated in Fig. 6 (b). The costs and constraints for this task include the target position constraint (1) and the axis alignment constraint (2). If a target orientation is also specified w.r.t the yaw axis of the mug, we also add an L2 cost (5) between the keypoint with its target location. This task is similar to the mugs task in [4].

Experimental Results The mug keypoint detection network was trained on a dataset of standard sized mugs, detailed in Fig. 7 (c). Experiments for the mug on shelf task were conducted using a held out test set of 40 mugs with large variations in shape, size and visual appearance (more details in the video and supplemental material). All mugs could be grasped when in the upright orientation, but due to the limited stroke of our gripper (7.5cm when fully open) only 19 of these mugs could be grasped when lying horizontally. For mugs in that could be grasped horizontally we ran two trials with the mug starting from a horizontal orientation, and two trials with the mug in a vertical orientation. For the remaining mugs we ran two trials for each mug with the mug starting in an upright orientation. Quantative performance was evaluated by recording whether the mug ended up upright on the shelf, and the distance of the mug’s bottom center keypoint to the target location. Results are shown in Fig. 7 (e). Overall our system was very reliable, managing to place the mug on the shelf within 5cm of the target location in all but 2 trials. In one of these failures the mug was placed upside down. Fig. 8 shows the RGB image used in keypoint detection along with the final object placement. In this case the keypoint detection mixed up the top and bottom of the mug, causing it to be placed upside down. The keypoint detection error is understandable in this case since it is very difficult to distinguish the top from the bottom of this mug in the single RGBD image. In addition this particular instance was a small kids sized mug, whereas all the training data for mugs contained only regular sized mugs.

Overall the accuracy in the mug on shelf task was very high, with of upright trials, and of horizontal trials resulting in bottom keypoint final location errors of less than 3cm. Qualitatively the majority of this error arose from the object moving slightly during the grasping process with the rest attributed to the keypoint detection.

Fig. 8: (a) The RGB image for the single failure trial of the mug on shelf task that led to the mug being put in an incorrect orientation. In this case the keypoint detection confused the top and bottom of the mug and it was placed upside down. (b) The resulting upside down placement of the mug.

V-C Hang the mugs on the rack by their handles

Task Description To demonstrate the accuracy and robustness of our method we tasked the robot with autonomously hanging mugs on a rack by their handle. An illustration of this task is provided in Fig. 6 (f). The relatively small mug handles (2-3 centimeters) challenge the accuracy of our manipulation pipeline. The costs and constraints in this task are

  1. The target location constraint (1) between to its target location on the rack axis.

  2. The keypoint L2 distance cost (5) from and to their nominal target locations.

Experimental Results For the mug on rack experiments we used the same keypoint detection network as for the mug on shelf experiments. Experiments were conducted using a held out test set of 30 mugs with large variation in shape, texture and topology. Of these 5 were very small mugs whose handles had a minimum dimension (either height or width) of less than 2cm (see the supplementary material for more details). We note that the training data did not contain any such “small” mugs. Each trial consisted of placing a single mug on the table in front of the robot. Then the kPAM pipeline was run and a trial was recorded as successful if the mug ended up hanging on the rack by its handle. 5 trials were run for each mug. Quantitative results are given in Fig. 7 (e). For regular sized mugs we were able to hang them on the rack with a 100% success rate. The small mugs were much more challenging but we still achieved a 50% success rate. The small mugs have very tiny handles, which stresses the accuracy of the entire system. In particular the total error of the keypoint detection, grasping and execution needed to successfully complete the task was on the order of 1-1.5 cm. Two main factors contributed to failures in the mug on rack task. The first, similar to the case of shoe on rack task, is that during grasping the closing of the gripper often moves the object from the location at which it was perceived. Even a small disturbance (i.e. 1cm) can lead to a failure in the mug on rack task since the required tolerances are very small. The second contributing factor to failures is inaccurate keypoint detections. Again an inaccurate detection of even 0.5-1cm can be sufficient for the mug handle to miss the rack entirely. As discussed previously, the movement of the object during grasping could be alleviated by the addition of an external camera that would allow us to re-perceive the object after grasping.

Vi Limitations and Future Work

Our current data collection pipeline in Sec. III-B requires human annotation, although 3D reconstruction alleviates this manual labor. In the future, we plan to train our keypoint detector using synthetic data, as demonstrated in [23, 25].

Representing the robot action with a rigid transformation is valid for robotic pick-and-place. However, this abstraction does not work for deformable objects or more dexterous manipulation actions on rigid objects, such as the in-hand manipulation in [1]. Combining these learning-based or model-based approaches with the keypoint representation to build a manipulation policy that generalizes to categories of objects would is a promising direction for future work.

Vii Conclusion

In this paper we contribute a novel formulation of category-level manipulation which uses semantic 3D keypoints as the object representation. Using keypoints to represent the object enables us to simply and interpretably specify the manipulation target as geometric costs and constraints on the keypoints, which flexibly generalizes existing pose-based manipulation methods. This formulation naturally allows us to factor the manipulation policy into the 3D keypoint detection, optimization-based robot action planning and grasping based action execution. By factoring the problem we are able to leverage advances in these sub-problems and combine them into a general and effective perception-to-action manipulation pipeline. Through extensive hardware experiments, we demonstrate that our pipeline is robust to large intra-category shape variation and can accomplish manipulation tasks requiring centimeter level precision.


The authors thank Ethan Weber (instance segmentation training data generation) and Pat Marion (visualization) for their help. This work was supported by: National Science Foundation, Award No. IIS-1427050; Draper Laboratory Incorporated, Award No. SC001-0000001002; Lockheed Martin Corporation, Award No. RPP2016-002; Amazon Research Award.