Learning to Generate 6-DoF Grasp Poses with Reachability Awareness

10/14/2019 ∙ by Xibai Lou, et al. ∙ 8

Motivated by the stringent requirements of unstructured real-world where a plethora of unknown objects reside in arbitrary locations of the surface, we propose a voxel-based deep 3D Convolutional Neural Network (3D CNN) that generates feasible 6-DoF grasp poses in unrestricted workspace with reachability awareness. Unlike the majority of works that predict if a proposed grasp pose within the restricted workspace will be successful solely based on grasp pose stability, our approach further learns a reachability predictor that evaluates if the grasp pose is reachable or not from robot's own experience. To avoid the laborious real training data collection, we exploit the power of simulation to train our networks on a large-scale synthetic dataset. To our best knowledge, this work is the first attempt to take into account the reachability when proposing feasible grasp poses. Experimental results in both simulation and real-world demonstrate that our reachability aware 3D CNN grasping outperforms several other approaches and achieves 82.5% grasping success rate on unknown objects.



There are no comments yet.


page 1

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Real-world applications demand robotic manipulation algorithms that are efficient in arbitrary workspace where objects may not be reachable. Fig. 1 illustrates a scenario where such an algorithm needs to 1) decide which of the sampled grasp pose candidates are more reachable and 2) grasp as many objects as possible from the dense clutter with minimal efforts.

The predominant top-down grasping is often restricted in narrowly prepared workspace [zeng2018learning], whereas practical problems are often in extended and obstacle-rich environments that require flexible 6-DoF grasp poses to reach objects. Albeit extensive researches have been conducted on this topic, the grasping reachability problem remains relatively unexplored. Current 6-DoF approaches grasp within restricted workspace and only predict successful grasps by analyzing grasp poses and object shapes [ten2017grasp]. When applied to unrestricted workspace, however, these approaches experience excessive planning failures that jeopardize grasping efficiency. We present a reachability aware 3D deep Convolutional Neural Network (3D CNN) that addresses these concerns by proposing feasible 6-DoF grasp poses that are both stable and reachable.

Our approach consists of a 3D CNN and a Reachability Predictor (RP). 3D CNN learns spatial information [maturana2015voxnet] which is effective in learning stable 6-DoF grasp poses and generalizing to novel objects [choi2018learning]

. Furthermore, the relatively small sim-to-real gap brought by depth is promising for direct real-world application. RP is effective in estimating the reachability of the sampled grasp poses without going through the computationally expensive motion planning algorithms. This attribute is jointly determined by the grasp pose and the kinematic constraints of a robot arm. Our approach discovers this intervened relationship and improves grasping efficiency by learning to approximate the grasping reachability from self-exploring experience. The immediate challenge here is how to train these models, which typically require hundreds of thousands of labeled data in order to generalize. Many learning-based grasping approaches suffered from insufficient training data since real robot data are notoriously expensive to collect. We exploit the power of a robot simulator and solve this problem with large-scale self-supervision.

Fig. 1: Example of searching for a feasible grasp pose. Two clusters of objects are randomly arranged on the table. The green masks represent reachable approaching directions whereas the red mask is for unreachable ones. Note that in the left figure the robot has reached its limit, wherein the chance of finding a stable and reachable grasp in clutter A is much greater than that in clutter B.

Our work is inspired by human behavior; we naturally prefer to grasp closer objects with appropriate hand and arm poses, whereas for distant objects we either adjust our hand pose or abort grasping. Likewise, we propose an approach to mimic such a highly efficient grasping strategy. We conducted several experiments and ablation studies in simulation as well as real-world where our approach outperforms several comparable approaches in densely cluttered settings and generalizes to novel objects. To the best of our knowledge, our grasping strategy is the first attempt to generate reachability aware 6-DoF poses in dense clutter using a voxel-based 3D CNN. The main contributions of our work are bi-folded:

  • 3D CNN-based grasp pose generation in 6-DoF takes advantage of our grasp pose sampling algorithm that uniformly samples over the entire 3-dimensional space. In order to predict feasible 6-DoF grasp pose and generalize to novel objects, we exploit large-scale synthetic data collection via self-supervision. Furthermore, the domain-invariant nature of 3D CNN facilitates the direct application to real robot.

  • Reachability Predictor learns the robot capability from extensive self-exploration and eliminates the need for human-imposed constraints such as workspace restrictions and approaching direction filtering. It is able to predict the reachability of the sampled grasp pose candidates, thus increase the grasping and planning success rate of 6-DoF grasp pose in unrestricted workspace. Since it is decoupled from grasp learning, RP is adaptable to other robots.

Ii Related Work

Ii-a Object Grasping

Though there are different taxonomies of robotic grasping [sahbani2012overview], [bohg2013data], the existing works of robotic grasping are commonly divided into two groups: traditional model-based and modern learning-based approaches.

Traditional model-based approaches often involve physical modeling of objects and thus require full knowledge of the objects such as shapes, weights, friction coefficients [miller2004graspit], [ferrari1992planning]. More recent approaches utilize grasp quality metrics to select the force closure grasps from pre-planned sets by analyzing the contact wrench space [weisz2012pose]. These approaches grasp efficiently with accurate measurements and models, however, the prerequisite efforts scale-up fast when implemented in real unstructured environment where novel objects are prevalent.

Recent learning-based approaches apply deep neural networks to robotic grasping [lenz2015deep], [Kappler2015LeveragingBD], [pinto2016supersizing], [mahler2017dex], [ten2017grasp], [choi2018learning], [zeng2018learning]. The majority of these approaches employ convolutional neural networks (CNNs) that take monocular RGB images or 2.5D depth image as input and map the extracted features to a less complicated 3-DoF grasp pose that includes a grasping point on 2D image plane and a corresponding wrist orientation [lenz2015deep], [pinto2016supersizing], [mahler2017dex], [zeng2018learning], [levine2018learning]. [gualtieri2016high] and [ten2017grasp] extend beyond the standard 3-DoF approaches. ten Pas et al. trained the networks with RGB-D images to evaluate a set of sampled grasp candidates. Since their sampling algorithm is based on geometric reasoning, they were able to find 6-DoF grasp poses. Due to the data hungry nature of learning-based approaches, generating large-scale training dataset is necessary. One approach is to label robot trails manually [levine2018learning], [choi2018learning]. A less laborious way is to collect data in simulation [pinto2016supersizing], [Kappler2015LeveragingBD], [mahler2017dex]. Although fast and scalable, the sim-to-real gap may render features learned in simulation inaccurate in real-world. Some re-train the network with real-world data [zeng2018learning], others solve the problem by either fine-tuning or domain adaptation [fang2018learning], [chebotar2018closing], [bousmalis2018using].

Originated from computer vision tasks such as object recognition

[maturana2015voxnet], 3D data segmentation [su2018splatnet], and scene completion [song2017semantic], voxel-based 3D CNN have also been applied to robotic grasping [choi2018learning]

. Choi et al. showed 3D CNN trained with real-robot data is effective in classifying discrete grasp poses of a soft hand for a single object. Our works differ in two ways. First, our problem is much more difficult in that we search in a 6-DoF space, which is necessary for non-compliant grippers. Second, generalization of 6-DoF grasping requires exponentially more data that are only possible to collect in simulation, where we develop a data collection framework and show our approach can be directly applied to real robot.

Ii-B Grasp Planning

Robotic manipulation such as grasping depends on solving an inverse kinematic problem to find a path for the manipulation tool. Modern approaches employ fast motion planning algorithms such as RRT [LaValle00rapidly-exploringrandom], and its variants [kuffner2000rrt] . Traditionally the grasping range of a given robot arm is determined by its maximum manipulability of the closed kinematic chain [park1998manipulability]. This problem is not trivial and an analytical solution is often not easy to find. Furthermore, due to the random exploring nature of the algorithms, a solution for valid grasp poses is not guaranteed. In practice, the majority of approaches in robotic grasping restricts testing objects in a reachable but restricted workspace [ten2017grasp], [choi2018learning], [zeng2018learning], forcing it to operate in a reduced capability. Our work is one of the early explorations of learning grasping reachability, which allows the robot to work at full capacity.

Iii Problem Formulation

Fig. 2: Grasping Pipeline. Object point cloud is obtained from planar segmentation of point cloud . For each of the sampled grasp poses , the object point cloud is voxelized to voxel grid and transformed by the corresponding grasp pose candidate . The input voxel grid is then passed to 3D CNN while the grasp candidate

is fed to RP for evaluation. The most probable grasp pose is chosen and executed by the robot manipulator.

We aim to find feasible 6-DoF grasp poses that are simultaneously stable and reachable. The problem is carried out in two stages. At the first stage, the robot finds a set of stable grasp candidates, some of which may not have valid motion plans due to invalid inverse-kinematic solution. At the second stage, the robot evaluates each grasp pose and excludes the unreachable ones. The problem can be formalized as follows:

Definition 1.

A grasp pose is stable if it is able to form a force closure grasp. This attribute is independent of kinematic constraints.

Definition 2.

A grasp pose is reachable if the given robot arm is able to achieve the pose without violating the physical limits of itself and the environment.

Definition 3.

Given a point cloud , the goal is to find a feasible grasp pose that is both stable and reachable.

The grasp pose is defined with respect to the robot coordinate frame. The point cloud is obtained via a depth sensor with known extrinsic parameters, by which the cloud is transformed from the sensor coordinate frame to the robot coordinate frame. An important assumption is:

Assumption 1.

The set of 6-DoF grasp candidates is randomly generated over the points in the point cloud , i.e., .

Given a point cloud and a grasp candidate , where denotes a set of uniformly generated 6-DoF grasp candidates, let {0,1} denote a binary-valued stability metric where indicates that the grasp is stable according to Definition 1. Our goal is to estimate the grasping stability

by self-supervised learning. To train the network more efficiently, we constrain our grasp sampling algorithm as follows:

Constraint 1.

The grasp pose is constrained in that the sampled grasp candidates are limited so as to approach the target object from the top hemisphere.

For an arbitrary grasp candidate , the wrist orientation has no influence over its reachability since the joint limit is . Therefore, a valid robot grasp is determined by the combination of grasp location and approaching direction . We construct the reachability determinant . To allow the robot fully explore the workspace, we assume that:

Assumption 2.

The robot workspace is unrestricted and can be anywhere within the camera observation space.

Let denote the robot and {0,1} a binary-valued reachability metric, where indicates that the grasp is reachable. We wish to learn to predict the grasping reachability from self-supervised exploration. It is important to notice that the stability metric of a pose is independent of its reachability metric, i.e., does not indicate and vice versa.

Iv Proposed Approach

Iv-a Generating Feasible Grasp Poses

The objective of our approach is to predict the most feasible grasp pose from a set of randomly sampled candidates in multiple settings. According to Definition 3, the selected grasp pose should be both stable and reachable. Training one generic model on this task leads to unsatisfactory performance, as the network may falsely ascribe the reason of a failed grasp to unstable grasp poses, while the true cause is poor reachability, and vice versa. As mentioned in the previous section, these two prerequisites of a feasible grasp pose is entirely independent of each other. This allows us to solve the credit assignment problem by decoupling the task into two independent sub-problems. The grasping success probability can be decomposed as follow:

We trained a 3D CNN grasp pose predictor and a reachability predictor to estimate the resulting grasping stability and grasping reachability respectively.

The grasping pipeline is described in Fig. 2. The system first obtains the object point cloud from planar segmentation of point cloud and samples grasp candidates over via the sampling algorithm described in Section IV-B. For each sampled grasp candidate, objects point cloud is voxelized to a 3D voxel grid , where each voxel in the grid is either 0 (not occupied) or 1 (occupied). The total physical edge length of the voxel grid is , equivalent to crop the object point cloud by a cubic box that centered at grasping point . We choose cropping rather than segmenting an object from the point cloud as it preserves surrounding geometry that helps avoid collisions in dense clutter. The object voxel grid may partially contain voxels of adjacent objects, which contributes to lower grasping stability predictions. Therefore, less surrounded objects, i.e., objects on the peripherals, are prioritized, resulting in an onion-peeling grasping pattern.

Our 3D CNN architecture is similar to those in [maturana2015voxnet] and [choi2018learning]. We embed the grasp pose within the voxel grid by transforming it with grasp candidate . The 3D CNN will predict the grasping stability based on the transformed voxel grid. The reachability predictor then extracts the reachability determinant from each randomly generated pose and estimates the grasping reachability . The grasping success probability is calculated by multiplying the two terms, Algorithm 1 explains this prediction procedure in detail.

Input: point cloud , 3D CNN Grasp model , reachability predictor
      Output: feasible grasp pose

3:for  do
Algorithm 1 Reachability Aware 3D CNN Grasping
Fig. 3: Examples of randomly sampled grasp candidates. The grasp candidates are sampled over a triangle shape. For each candidate, the wrist orientation is sampled in while approaching direction in . Different thresholds of the approaching direction are visualized here. Left figure is searching in full 6-DoF by setting = 0°and = 90°, the middle one shows a search constrained by = 30°and and on the right is a top grasp pose by setting both and to 0.

Iv-B Grasp Pose Sampling Algorithm

Unlike the geometric reasoning-based sampling algorithm proposed in [ten2017grasp], to ensure that the robot comprehensively explores the 6-DoF action space, we present a flexible algorithm that uniformly samples grasp candidates over the target objects within two threshold values. Given a desired number of samples

and approaching vector thresholds

, where , the algorithm uniformly selects grasp points from the input point cloud . For each grasp point, a random pose is associated. Fig. 3 shows three examples of the sampled grasp pose candidates. The sampling algorithm gives little constraints to the generated grasp candidates, so the network is able to learn the reasons of grasp failures from the unsuccessful data, such as grasping a corner or colliding with the object.

Iv-C Network Architecture

We borrow the network structure from [choi2018learning]; the first layer has 32 filters of size , the second layer has 32 filters of

. The features are condensed by a Max Pooling layer of

, followed by two dense layers of 128 and 1. A given grasp pose can either be 1 (stable grasp) or 0 (unstable grasp), so we use binary cross-entropy as the loss function. We use the sigmoid activation function in the final layer to predict the grasping stability for the voxel grid


Reachability predictor consists of an input layer, one hidden layer of size 16, one hidden layer of size 8 and a final output layer. The input to the neural network is a 6-dimensional reachability determinant , and the output is a binary classification result, where 1 denotes the grasp pose is reachable.

To integrate these two networks, we embed the grasping reachability into the grasping success probability by multiplying the two results. The final output indicates the grasping success probability of .

Iv-D Data Collection and Training

Fig. 4: Prediction accuracy w.r.t. the size of dataset. 3D CNN demands large-scale labeled data, tailored to the task, and such requirement is only attainable in simulation.

Training 3D CNN with multiple objects is ineffective because their shape varies for every grasp pose, preventing 3D CNN from generalizing object geometries. We used 8 primitive shapes from [zeng2018learning] as our training objects. The self-supervised robot interacts with a single object and collects 60,000 labeled voxel grids to train the 3D CNN and 10,000 data to train the RP. A testing dataset of 1000 data is reserved to evaluate the prediction accuracy of 3D CNN. Fig. 4 compiles this result with respect to the training data size from 1,000 to 60,000. It is clear that the prediction accuracy benefits from large-scale training data. We also show two prediction results of the RP in Fig. 5, as expected, an approaching direction toward the robot itself results in narrower reachable space. Our approach learns to select candidates with more vertical approaching direction as they are generally more stable and reachable in our training and testing environments.

Fig. 5: Visualization of Reachability Predictor. We visualize the predictions for two poses over an arbitrary workspace, colored in black. Dark blue indicates poor reachability.

V Experiments

Our experiments aim to answer the following four questions: i) How much does our approach outperform other approaches? ii) Can our approach generalize to cluttered objects and novel objects? iii) Is the reachability predictor effectively targeting reachable objects? iv) Do our networks trained in simulation work well in real robot?

To answer these questions, we conducted experiments in both simulated and real settings. We compare our approach with 3 other approaches: 1) RAND, a baseline that randomly generates a grasp pose without any learning, 2) GPD

, Grasp Pose Detection proposes 6-DoF grasp candidates based-on geometric reasoning and evaluates them with RGB-D images trained CNN

[ten2017grasp] and 3) VPG, Visual Pushing for Grasping [zeng2018learning]

learns the synergy between pushing and grasping with a reinforcement learning to clean cluttered objects.

(a) Standard
(b) Challenging
(c) Ablation
Fig. 6: Simulation experiments. Our approach is evaluated in (a) the standard scenario with randomly arranged 10 objects, (b) the challenging scenario with 30 randomly dropped objects and (c) ablation study.
Fig. 7: Performance in simulation. The grasping success rate (top) and completion rate (bottom) of each approach in three different settings. The plots clearly show the effectiveness of our approach, which achieves 71.5% grasping success rate in challenging scenario and 100% completion rate in all experiments.

V-a Simulation Experiments

We adopt the same testing objects and simulated environment as reported in [zeng2018learning] for a fair comparison. The V-REP [rohmer2013v] scene includes a UR5 robot arm equipped with an RG2 gripper, and the dynamics are simulated with Bullet [Coumans:2015:BPS:2776880.2792704] Physics engine 2.83. We noticed GPD under this one camera setup (GPD1c) can not perform well as their geometric-based grasp candidates search requires a more complete point cloud. We thus added another camera for GPD to reproduce the “passive point cloud” setup described in [ten2017grasp]. Gripper grasping point and orientation for each approach are calibrated respectively since their hand configurations are different. The experiments are illustrated in Fig. 6. For all the experiments, a grasp is successful only if the object is lifted by 15 cm. If the robot failed 10 times (including planning failures) consecutively or all the objects have been removed, this run is completed.

Fig. 7 presents the average grasping success rates and completion rates over 30 runs for each method, where grasping success rate and completion rate . We first compare our performance with others in a standard multiple objects grasping scenario. The goal is to grasp 10 objects that are randomly dropped to the center of the ground. We notice that the RP effectively penalizes approaching directions toward the robot when grasping point exceeds its learned thresholds, as shown in Fig. 5. To fairly compare our approach and GPD with VPG, we also report the top grasping success rate for each method. An interesting observation is that the performance of GPD improves significantly as top-down grasping poses are mostly stable and reachable in this setting. This indirectly corroborates the importance of reachability awareness when proposing 6-DoF poses. The challenging scenario compares our approach to the others in a densely cluttered setting where 30 objects are randomly dropped to the center. This triples the workspace density thus demonstrates our ability to generalize to more cluttered scenarios. With only one camera and no help of pushing to declutter the challenging scenario, our approach achieved the highest grasping success rate and completion rate. Our approach suggests feasible grasp poses as long as the object is within the view, while VPG may accidentally force objects out of its workspace.

Grasping Success Rate 13.33 32.67 82.67
Planning Success Rate 26.67 37.33 96.00
TABLE I: Ablation Study of Reachability Predictor

We report an ablation study of the reachability predictor by grasping from two clusters of objects, one of which is partially unreachable. We tested our approach with (OURS) and without (3DCNN) the RP to grasp the objects. Given only 5 chances for each run, the robot has to choose the most stable and reachable pose to succeed. We report the average grasping success rate and planning rate over 30 runs for RAND, 3DCNN and OURS in Table I, for a total of 150 grasps. Planning efficiency . Our reachability aware 3D CNN is able to achieve 96% planning rate, an improvement of 58.67% compared to 3D CNN only.

Grasping Success Rate 13.10 31.23 64.76 75.20
Completion Rate 14.58 12.75 94.33 100.0
TABLE II: Challenging Scenario in Real-world
Fig. 8: Real-world novel objects. We use five novel objects to show the generalization capability of our approach. The shape, texture and size of the testing objects are different from the training objects in simulation as well.
Fig. 9: Grasping generalizability. The grasping success rate of four approaches on the 5 test objects. The average grasping success rate of our approach is 82.5%, surpassing all others on novel objects.
Fig. 10: Real-world experiment pictures. The voxel-based 3D CNN approach minimizes the gap between simulation and real-world. Our simulation trained network is able to clean dense clutters (top), differentiate reachable grasps (middle) and grasp an object in a constrained environment which is not feasible by the top grasping approaches, such as VPG (bottom).

V-B Real Robot Experiments

We evaluate our approach on a Franka Emika Panda robot arm without any fine-tuning. Fig. 10 shows the real robot experiments. We use 30 toy blocks to reproduce the simulated challenging scenario and ablation study, five random household objects to test the generalization capability and a chair positioned over the object to show the flexibility of our approach.

We compare our approach with RAND, GPD, and VPG in the challenging scenario. Due to hardware limitations, we were not able to implement GPD with two cameras. Table II compiles the performance of each method from averaging the results of 10 runs. Our approach is able to perform consistently in real-world and achieves the best grasping success rate and completion rate. Despite the robustness of 3D CNN to sensor noise [choi2018learning], our grasp sampling algorithm may propose a vacant grasping point that contributes to a misaligned pose. VPG performs efficiently as long as the object is not forced out of the constrained workspace. GPD1c suffers from predicting unreachable poses and incomplete point cloud.

In real robot ablation study, our approach demonstrates an efficiency improvement as seen previously, but the gap between the simulated UR5 and the real Panda arm deteriorates the performance. Quantitative results are shown in Table III.

Grasping Success Rate 3.33 23.33 66.67
Planning Efficiency 26.67 34.67 88.67
TABLE III: Reachability Predictor in Real-world

To test our real-world generalizability, we further experiment with objects that have never been seen during training, shown in Fig. 8. We ran 10 tests on each object, the result is summarized in Fig. 9. Our approach is able to extract 3D geometric features from novel shapes and select grasp pose candidates accordingly. We noticed that when grasping considerably larger objects such as the Perrier bottle, our approach prefers smaller shapes such as the bottle neck and cap. This is due to the limited physical voxel grid size that can only fit in shapes that are similar in size to our training objects. We also give an example of our system’s grasping flexibility by positioning an object under a chair to reflect real-world challenges. As shown in Fig. 10, the robot is able to complete the task by selecting a feasible grasp which is not possible with the top grasping approaches such as VPG.

Vi Conclusions

In this work, we presented a deep learning approach to generate 6-DoF grasp poses with reachability awareness. A 3D CNN model that estimates grasping stability was trained with a large-scale dataset obtained from simulated self-supervision. A reachability predictor that improves reachability awareness of 3D CNN was trained similarly. We outperformed several comparable deep learning approaches in both simulation and real-world. Furthermore, our method achieved 82.5% grasping success rate on unknown objects. Ablation study showed the RP significantly increased the planning efficiency of 3D CNN by 54% in real-world experiments.

The limitations of our work suggest two directions for future work. First, our sampling algorithm selects grasping points from the point cloud, which is susceptible to sensor noise. It is of our interest to investigate whether a voxel-based method is able to enhance its robustness. Second, we only applied reachability predictor to grasping and would like to explore if RP can improve other manipulation tasks.