I Introduction
Realworld applications demand robotic manipulation algorithms that are efficient in arbitrary workspace where objects may not be reachable. Fig. 1 illustrates a scenario where such an algorithm needs to 1) decide which of the sampled grasp pose candidates are more reachable and 2) grasp as many objects as possible from the dense clutter with minimal efforts.
The predominant topdown grasping is often restricted in narrowly prepared workspace [zeng2018learning], whereas practical problems are often in extended and obstaclerich environments that require flexible 6DoF grasp poses to reach objects. Albeit extensive researches have been conducted on this topic, the grasping reachability problem remains relatively unexplored. Current 6DoF approaches grasp within restricted workspace and only predict successful grasps by analyzing grasp poses and object shapes [ten2017grasp]. When applied to unrestricted workspace, however, these approaches experience excessive planning failures that jeopardize grasping efficiency. We present a reachability aware 3D deep Convolutional Neural Network (3D CNN) that addresses these concerns by proposing feasible 6DoF grasp poses that are both stable and reachable.
Our approach consists of a 3D CNN and a Reachability Predictor (RP). 3D CNN learns spatial information [maturana2015voxnet] which is effective in learning stable 6DoF grasp poses and generalizing to novel objects [choi2018learning]
. Furthermore, the relatively small simtoreal gap brought by depth is promising for direct realworld application. RP is effective in estimating the reachability of the sampled grasp poses without going through the computationally expensive motion planning algorithms. This attribute is jointly determined by the grasp pose and the kinematic constraints of a robot arm. Our approach discovers this intervened relationship and improves grasping efficiency by learning to approximate the grasping reachability from selfexploring experience. The immediate challenge here is how to train these models, which typically require hundreds of thousands of labeled data in order to generalize. Many learningbased grasping approaches suffered from insufficient training data since real robot data are notoriously expensive to collect. We exploit the power of a robot simulator and solve this problem with largescale selfsupervision.
Our work is inspired by human behavior; we naturally prefer to grasp closer objects with appropriate hand and arm poses, whereas for distant objects we either adjust our hand pose or abort grasping. Likewise, we propose an approach to mimic such a highly efficient grasping strategy. We conducted several experiments and ablation studies in simulation as well as realworld where our approach outperforms several comparable approaches in densely cluttered settings and generalizes to novel objects. To the best of our knowledge, our grasping strategy is the first attempt to generate reachability aware 6DoF poses in dense clutter using a voxelbased 3D CNN. The main contributions of our work are bifolded:

3D CNNbased grasp pose generation in 6DoF takes advantage of our grasp pose sampling algorithm that uniformly samples over the entire 3dimensional space. In order to predict feasible 6DoF grasp pose and generalize to novel objects, we exploit largescale synthetic data collection via selfsupervision. Furthermore, the domaininvariant nature of 3D CNN facilitates the direct application to real robot.

Reachability Predictor learns the robot capability from extensive selfexploration and eliminates the need for humanimposed constraints such as workspace restrictions and approaching direction filtering. It is able to predict the reachability of the sampled grasp pose candidates, thus increase the grasping and planning success rate of 6DoF grasp pose in unrestricted workspace. Since it is decoupled from grasp learning, RP is adaptable to other robots.
Ii Related Work
Iia Object Grasping
Though there are different taxonomies of robotic grasping [sahbani2012overview], [bohg2013data], the existing works of robotic grasping are commonly divided into two groups: traditional modelbased and modern learningbased approaches.
Traditional modelbased approaches often involve physical modeling of objects and thus require full knowledge of the objects such as shapes, weights, friction coefficients [miller2004graspit], [ferrari1992planning]. More recent approaches utilize grasp quality metrics to select the force closure grasps from preplanned sets by analyzing the contact wrench space [weisz2012pose]. These approaches grasp efficiently with accurate measurements and models, however, the prerequisite efforts scaleup fast when implemented in real unstructured environment where novel objects are prevalent.
Recent learningbased approaches apply deep neural networks to robotic grasping [lenz2015deep], [Kappler2015LeveragingBD], [pinto2016supersizing], [mahler2017dex], [ten2017grasp], [choi2018learning], [zeng2018learning]. The majority of these approaches employ convolutional neural networks (CNNs) that take monocular RGB images or 2.5D depth image as input and map the extracted features to a less complicated 3DoF grasp pose that includes a grasping point on 2D image plane and a corresponding wrist orientation [lenz2015deep], [pinto2016supersizing], [mahler2017dex], [zeng2018learning], [levine2018learning]. [gualtieri2016high] and [ten2017grasp] extend beyond the standard 3DoF approaches. ten Pas et al. trained the networks with RGBD images to evaluate a set of sampled grasp candidates. Since their sampling algorithm is based on geometric reasoning, they were able to find 6DoF grasp poses. Due to the data hungry nature of learningbased approaches, generating largescale training dataset is necessary. One approach is to label robot trails manually [levine2018learning], [choi2018learning]. A less laborious way is to collect data in simulation [pinto2016supersizing], [Kappler2015LeveragingBD], [mahler2017dex]. Although fast and scalable, the simtoreal gap may render features learned in simulation inaccurate in realworld. Some retrain the network with realworld data [zeng2018learning], others solve the problem by either finetuning or domain adaptation [fang2018learning], [chebotar2018closing], [bousmalis2018using].
Originated from computer vision tasks such as object recognition
[maturana2015voxnet], 3D data segmentation [su2018splatnet], and scene completion [song2017semantic], voxelbased 3D CNN have also been applied to robotic grasping [choi2018learning]. Choi et al. showed 3D CNN trained with realrobot data is effective in classifying discrete grasp poses of a soft hand for a single object. Our works differ in two ways. First, our problem is much more difficult in that we search in a 6DoF space, which is necessary for noncompliant grippers. Second, generalization of 6DoF grasping requires exponentially more data that are only possible to collect in simulation, where we develop a data collection framework and show our approach can be directly applied to real robot.
IiB Grasp Planning
Robotic manipulation such as grasping depends on solving an inverse kinematic problem to find a path for the manipulation tool. Modern approaches employ fast motion planning algorithms such as RRT [LaValle00rapidlyexploringrandom], and its variants [kuffner2000rrt] . Traditionally the grasping range of a given robot arm is determined by its maximum manipulability of the closed kinematic chain [park1998manipulability]. This problem is not trivial and an analytical solution is often not easy to find. Furthermore, due to the random exploring nature of the algorithms, a solution for valid grasp poses is not guaranteed. In practice, the majority of approaches in robotic grasping restricts testing objects in a reachable but restricted workspace [ten2017grasp], [choi2018learning], [zeng2018learning], forcing it to operate in a reduced capability. Our work is one of the early explorations of learning grasping reachability, which allows the robot to work at full capacity.
Iii Problem Formulation
We aim to find feasible 6DoF grasp poses that are simultaneously stable and reachable. The problem is carried out in two stages. At the first stage, the robot finds a set of stable grasp candidates, some of which may not have valid motion plans due to invalid inversekinematic solution. At the second stage, the robot evaluates each grasp pose and excludes the unreachable ones. The problem can be formalized as follows:
Definition 1.
A grasp pose is stable if it is able to form a force closure grasp. This attribute is independent of kinematic constraints.
Definition 2.
A grasp pose is reachable if the given robot arm is able to achieve the pose without violating the physical limits of itself and the environment.
Definition 3.
Given a point cloud , the goal is to find a feasible grasp pose that is both stable and reachable.
The grasp pose is defined with respect to the robot coordinate frame. The point cloud is obtained via a depth sensor with known extrinsic parameters, by which the cloud is transformed from the sensor coordinate frame to the robot coordinate frame. An important assumption is:
Assumption 1.
The set of 6DoF grasp candidates is randomly generated over the points in the point cloud , i.e., .
Given a point cloud and a grasp candidate , where denotes a set of uniformly generated 6DoF grasp candidates, let {0,1} denote a binaryvalued stability metric where indicates that the grasp is stable according to Definition 1. Our goal is to estimate the grasping stability
by selfsupervised learning. To train the network more efficiently, we constrain our grasp sampling algorithm as follows:
Constraint 1.
The grasp pose is constrained in that the sampled grasp candidates are limited so as to approach the target object from the top hemisphere.
For an arbitrary grasp candidate , the wrist orientation has no influence over its reachability since the joint limit is . Therefore, a valid robot grasp is determined by the combination of grasp location and approaching direction . We construct the reachability determinant . To allow the robot fully explore the workspace, we assume that:
Assumption 2.
The robot workspace is unrestricted and can be anywhere within the camera observation space.
Let denote the robot and {0,1} a binaryvalued reachability metric, where indicates that the grasp is reachable. We wish to learn to predict the grasping reachability from selfsupervised exploration. It is important to notice that the stability metric of a pose is independent of its reachability metric, i.e., does not indicate and vice versa.
Iv Proposed Approach
Iva Generating Feasible Grasp Poses
The objective of our approach is to predict the most feasible grasp pose from a set of randomly sampled candidates in multiple settings. According to Definition 3, the selected grasp pose should be both stable and reachable. Training one generic model on this task leads to unsatisfactory performance, as the network may falsely ascribe the reason of a failed grasp to unstable grasp poses, while the true cause is poor reachability, and vice versa. As mentioned in the previous section, these two prerequisites of a feasible grasp pose is entirely independent of each other. This allows us to solve the credit assignment problem by decoupling the task into two independent subproblems. The grasping success probability can be decomposed as follow:
We trained a 3D CNN grasp pose predictor and a reachability predictor to estimate the resulting grasping stability and grasping reachability respectively.
The grasping pipeline is described in Fig. 2. The system first obtains the object point cloud from planar segmentation of point cloud and samples grasp candidates over via the sampling algorithm described in Section IVB. For each sampled grasp candidate, objects point cloud is voxelized to a 3D voxel grid , where each voxel in the grid is either 0 (not occupied) or 1 (occupied). The total physical edge length of the voxel grid is , equivalent to crop the object point cloud by a cubic box that centered at grasping point . We choose cropping rather than segmenting an object from the point cloud as it preserves surrounding geometry that helps avoid collisions in dense clutter. The object voxel grid may partially contain voxels of adjacent objects, which contributes to lower grasping stability predictions. Therefore, less surrounded objects, i.e., objects on the peripherals, are prioritized, resulting in an onionpeeling grasping pattern.
Our 3D CNN architecture is similar to those in [maturana2015voxnet] and [choi2018learning]. We embed the grasp pose within the voxel grid by transforming it with grasp candidate . The 3D CNN will predict the grasping stability based on the transformed voxel grid. The reachability predictor then extracts the reachability determinant from each randomly generated pose and estimates the grasping reachability . The grasping success probability is calculated by multiplying the two terms, Algorithm 1 explains this prediction procedure in detail.
IvB Grasp Pose Sampling Algorithm
Unlike the geometric reasoningbased sampling algorithm proposed in [ten2017grasp], to ensure that the robot comprehensively explores the 6DoF action space, we present a flexible algorithm that uniformly samples grasp candidates over the target objects within two threshold values. Given a desired number of samples
and approaching vector thresholds
, where , the algorithm uniformly selects grasp points from the input point cloud . For each grasp point, a random pose is associated. Fig. 3 shows three examples of the sampled grasp pose candidates. The sampling algorithm gives little constraints to the generated grasp candidates, so the network is able to learn the reasons of grasp failures from the unsuccessful data, such as grasping a corner or colliding with the object.IvC Network Architecture
We borrow the network structure from [choi2018learning]; the first layer has 32 filters of size , the second layer has 32 filters of
. The features are condensed by a Max Pooling layer of
, followed by two dense layers of 128 and 1. A given grasp pose can either be 1 (stable grasp) or 0 (unstable grasp), so we use binary crossentropy as the loss function. We use the sigmoid activation function in the final layer to predict the grasping stability for the voxel grid
.Reachability predictor consists of an input layer, one hidden layer of size 16, one hidden layer of size 8 and a final output layer. The input to the neural network is a 6dimensional reachability determinant , and the output is a binary classification result, where 1 denotes the grasp pose is reachable.
To integrate these two networks, we embed the grasping reachability into the grasping success probability by multiplying the two results. The final output indicates the grasping success probability of .
IvD Data Collection and Training
Training 3D CNN with multiple objects is ineffective because their shape varies for every grasp pose, preventing 3D CNN from generalizing object geometries. We used 8 primitive shapes from [zeng2018learning] as our training objects. The selfsupervised robot interacts with a single object and collects 60,000 labeled voxel grids to train the 3D CNN and 10,000 data to train the RP. A testing dataset of 1000 data is reserved to evaluate the prediction accuracy of 3D CNN. Fig. 4 compiles this result with respect to the training data size from 1,000 to 60,000. It is clear that the prediction accuracy benefits from largescale training data. We also show two prediction results of the RP in Fig. 5, as expected, an approaching direction toward the robot itself results in narrower reachable space. Our approach learns to select candidates with more vertical approaching direction as they are generally more stable and reachable in our training and testing environments.
V Experiments
Our experiments aim to answer the following four questions: i) How much does our approach outperform other approaches? ii) Can our approach generalize to cluttered objects and novel objects? iii) Is the reachability predictor effectively targeting reachable objects? iv) Do our networks trained in simulation work well in real robot?
To answer these questions, we conducted experiments in both simulated and real settings. We compare our approach with 3 other approaches: 1) RAND, a baseline that randomly generates a grasp pose without any learning, 2) GPD
, Grasp Pose Detection proposes 6DoF grasp candidates basedon geometric reasoning and evaluates them with RGBD images trained CNN
[ten2017grasp] and 3) VPG, Visual Pushing for Grasping [zeng2018learning]learns the synergy between pushing and grasping with a reinforcement learning to clean cluttered objects.
Va Simulation Experiments
We adopt the same testing objects and simulated environment as reported in [zeng2018learning] for a fair comparison. The VREP [rohmer2013v] scene includes a UR5 robot arm equipped with an RG2 gripper, and the dynamics are simulated with Bullet [Coumans:2015:BPS:2776880.2792704] Physics engine 2.83. We noticed GPD under this one camera setup (GPD1c) can not perform well as their geometricbased grasp candidates search requires a more complete point cloud. We thus added another camera for GPD to reproduce the “passive point cloud” setup described in [ten2017grasp]. Gripper grasping point and orientation for each approach are calibrated respectively since their hand configurations are different. The experiments are illustrated in Fig. 6. For all the experiments, a grasp is successful only if the object is lifted by 15 cm. If the robot failed 10 times (including planning failures) consecutively or all the objects have been removed, this run is completed.
Fig. 7 presents the average grasping success rates and completion rates over 30 runs for each method, where grasping success rate and completion rate . We first compare our performance with others in a standard multiple objects grasping scenario. The goal is to grasp 10 objects that are randomly dropped to the center of the ground. We notice that the RP effectively penalizes approaching directions toward the robot when grasping point exceeds its learned thresholds, as shown in Fig. 5. To fairly compare our approach and GPD with VPG, we also report the top grasping success rate for each method. An interesting observation is that the performance of GPD improves significantly as topdown grasping poses are mostly stable and reachable in this setting. This indirectly corroborates the importance of reachability awareness when proposing 6DoF poses. The challenging scenario compares our approach to the others in a densely cluttered setting where 30 objects are randomly dropped to the center. This triples the workspace density thus demonstrates our ability to generalize to more cluttered scenarios. With only one camera and no help of pushing to declutter the challenging scenario, our approach achieved the highest grasping success rate and completion rate. Our approach suggests feasible grasp poses as long as the object is within the view, while VPG may accidentally force objects out of its workspace.
RAND  3DCNN  OURS  

Grasping Success Rate  13.33  32.67  82.67 
Planning Success Rate  26.67  37.33  96.00 
We report an ablation study of the reachability predictor by grasping from two clusters of objects, one of which is partially unreachable. We tested our approach with (OURS) and without (3DCNN) the RP to grasp the objects. Given only 5 chances for each run, the robot has to choose the most stable and reachable pose to succeed. We report the average grasping success rate and planning rate over 30 runs for RAND, 3DCNN and OURS in Table I, for a total of 150 grasps. Planning efficiency . Our reachability aware 3D CNN is able to achieve 96% planning rate, an improvement of 58.67% compared to 3D CNN only.
RAND  GPD1c  VPG  OURS  

Grasping Success Rate  13.10  31.23  64.76  75.20 
Completion Rate  14.58  12.75  94.33  100.0 
VB Real Robot Experiments
We evaluate our approach on a Franka Emika Panda robot arm without any finetuning. Fig. 10 shows the real robot experiments. We use 30 toy blocks to reproduce the simulated challenging scenario and ablation study, five random household objects to test the generalization capability and a chair positioned over the object to show the flexibility of our approach.
We compare our approach with RAND, GPD, and VPG in the challenging scenario. Due to hardware limitations, we were not able to implement GPD with two cameras. Table II compiles the performance of each method from averaging the results of 10 runs. Our approach is able to perform consistently in realworld and achieves the best grasping success rate and completion rate. Despite the robustness of 3D CNN to sensor noise [choi2018learning], our grasp sampling algorithm may propose a vacant grasping point that contributes to a misaligned pose. VPG performs efficiently as long as the object is not forced out of the constrained workspace. GPD1c suffers from predicting unreachable poses and incomplete point cloud.
In real robot ablation study, our approach demonstrates an efficiency improvement as seen previously, but the gap between the simulated UR5 and the real Panda arm deteriorates the performance. Quantitative results are shown in Table III.
RAND  3DCNN  OURS  

Grasping Success Rate  3.33  23.33  66.67 
Planning Efficiency  26.67  34.67  88.67 
To test our realworld generalizability, we further experiment with objects that have never been seen during training, shown in Fig. 8. We ran 10 tests on each object, the result is summarized in Fig. 9. Our approach is able to extract 3D geometric features from novel shapes and select grasp pose candidates accordingly. We noticed that when grasping considerably larger objects such as the Perrier bottle, our approach prefers smaller shapes such as the bottle neck and cap. This is due to the limited physical voxel grid size that can only fit in shapes that are similar in size to our training objects. We also give an example of our system’s grasping flexibility by positioning an object under a chair to reflect realworld challenges. As shown in Fig. 10, the robot is able to complete the task by selecting a feasible grasp which is not possible with the top grasping approaches such as VPG.
Vi Conclusions
In this work, we presented a deep learning approach to generate 6DoF grasp poses with reachability awareness. A 3D CNN model that estimates grasping stability was trained with a largescale dataset obtained from simulated selfsupervision. A reachability predictor that improves reachability awareness of 3D CNN was trained similarly. We outperformed several comparable deep learning approaches in both simulation and realworld. Furthermore, our method achieved 82.5% grasping success rate on unknown objects. Ablation study showed the RP significantly increased the planning efficiency of 3D CNN by 54% in realworld experiments.
The limitations of our work suggest two directions for future work. First, our sampling algorithm selects grasping points from the point cloud, which is susceptible to sensor noise. It is of our interest to investigate whether a voxelbased method is able to enhance its robustness. Second, we only applied reachability predictor to grasping and would like to explore if RP can improve other manipulation tasks.
Comments
There are no comments yet.