Grasping is among the most fundamental and long-lasting problems in robotics study. While classical model-based methods using mechanical analysis tools [2, 21, 8] can already grasp objects of known geometry, it remains an open problem of how to grasp generic objects in complex scenes.
Recently, data-driven approaches have shed light to addressing the generic grasp problem using machine learning tools[15, 10, 31, 27]. In order to readily generalize to unseen objects and layouts, a large body of recent works have focused on solving 3/4 DoF(degree of freedom) grasping, where the gripper is forced to approach objects from above vertically [14, 13]. Although this has greatly simplified the problem for picking and placing tasks, it has also inevitably restricted ways to interact with objects. For example, such grasping is unable to grab a horizontally placed plate. Worse still, top-down grasping often encounters difficulties in cluttered scenes with casually heaped objects, which requires extra hand freedoms for grasping buried objects. The limitation of 3/4 DoF grippers thus motivates the study of 6-DoF grippers to approach the object from arbitrary directions. We note that 6-DoF end-effector is essential to allow dexterous object manipulation tasks [29, 9].
This paper studies the 6-DoF grasping problem in a realistic yet challenging setting, assuming that a set of household objects from unknown categories are casually scattered on a table. A commodity depth camera is mounted with a fixed pose to capture this scene from only a single viewpoint, which gives a partial point cloud of the scene. The grasp is performed by a parallel gripper.
The setting is highly challenging for both perception and planning: First, the scene clutters limit viable grasp poses and may even fail the motion planning algorithms to achieve certain grasps. This challenge keeps us from considering 3/4-DoF grasp detection and restricts us to the more powerful yet sophisticated 6-DoF detection approach. Second, we make no assumptions of object categories. This open set setting puts us in a different category from existing semantic grasping method, such as DexNet . We require higher-level of generalizability based on better representation of the perceived content. Most existing methods can only work in simpler scenarios, by introducing high-quality and expensive 3D sensors for accurate scene capturing, or sensing the complete environment with multiple cameras, or assuming a scene of only a single object. This challenge demands that our grasp detection has to be noise-resistant and amodal, i.e., being able to make an educated guess of the viable grasp from only a partial point cloud.
We address the challenges in a learning-based framework. At the high level, we rely on a single-shot grasp proposal network, trained with synthetic data and tested in real-world scenarios. Our design involves (1) a single-shot neural network architecture for amodal grasp proposal; and (2) a scene-level training data synthesis pipeline leveraging an innovative gripper contact model.
By its single-shot nature, our grasp proposal network enjoys better efficiency and accuracy compared with existing deep networks in the 6-DoF grasping literature. Existing work, such as , samples grasp candidates from
following some heuristics and assess their quality using networks. However, the running time goes up quickly as the number of sampled grasps increases, which makes the grasp optimization too slow. Unlike these approaches, we propose to directlyregress 6-DoF grasps from the entire scene point cloud in one pass. Specifically, We are the first to propose a per-point scoring and pose regression method for 6-DoF grasp.
3D data from low-cost commercial depth sensors are partial, noisy and corrupted. To handle the imperfection of input 3D data, SG is trained by hallucinated point clouds of similar patterns, and it learns to extract robust features for grasp prediction from the corrupted data. We propose a simple yet effective gripper contact model to generate good grasps and associate these grasps to the point cloud. At inference time, we select high quality grasps based on the proposals of the network. Note that we are the first to generate a synthetic scene of many objects, rather than a single object, in the 6-DoF grasping literature.
The core novel insight of our SG is that we learn to propose possible grasps in this space by regression. We believe learning to regress grasp proposals would be the trend: For another problem of similar setting, object detection, the community has evolved from sliding windows to learning to generate object proposals. A second novelty is that, instead of generating training data by scenes of only a single object, we include multiple objects in the scene, with grasp proposals analyzed using a gripper contact model that considers touching area shape and size. Supplementary videos and code are at https://sites.google.com/view/s4ggrapsing.
2 Related work
Deep Learning based Grasping Methods
Caldera et al. 
gave a thorough survey of deep learning methods for robotic grasping, which demonstrates the effectiveness of deep learning on this task. In our paper, we focus on the problem of 6-DoF grasp proposal.Collet et al. , Zeng et al. , Mousavian et al.  tackled this problem by fitting the object model to the scan point cloud to retrieve the 6-DoF pose. Although it has shown promising results in industrial applications, the feasibility is limited in generic robotic application scenarios, e.g. house-holding robots, where the exact 3D models of numerous objects are not accessible. ten Pas et al.  proposed to generate grasp hypotheses only based on local geometry prior and attained better generalizability on novel objects, which was further extended by Liang et al. 
by replacing multi-view projection features with direct point cloud representation. Because potential viable 6-DoF grasp poses are infinite, these methods guide the sampling process by constructing a Darboux frame aligned with the estimated surface normal and principal curvature and searching in its 6D neighbourhood. However, they may fail finding feasible grasps for thin structures, such as plates or bowls, where computing normals analytically from partial and noisy observation is challenging. In contrast to these sampling approaches, our framework is a single-shot grasp proposal framework[6, 3]–a direct regression approach for predicting viable grasp frames–which could handle flawed input well due to the network’s knowledge. Moreover, by jointly analyzing local and global geometry information, our method not only considers the object of interest, but also its surroundings, which allows the generation of collision-free grasps in dense clutters.
Training Data Synthesis for Grasping
Deep learning methods require an enormous volume of labelled data for the training process , however manually annotating 6-DoF grasp poses is not practical. Therefore, analytic grasp synthesis  is indispensable for ground truth data generation. These advanced models have provided guaranteed measurements of grasp properties with the availability of complete and precise geometric models of objects. In practice, the observation from sensors are partial and noisy, which undermines the metric accuracy. In the service of our single-shot grasp detection framework, we first use analytic methods to generate viable grasps for each single object, and reject unfeasible grasps in densely clutter scenes. To the best of our knowledge, the dataset we generated is the first large-scale synthetic 6-DoF grasp dataset for dense clutters.
Deep Learning on 3D Data
Qi et al. [24, 26] proposed PointNet and PointNet++, a novel 3D deep learning network architecture capable of extracting useful representations from 3D point clouds. Compared with other architectures [20, 25], PointNets are robust to varying sampling densities, which is important to real robotic applications. In this paper, we utilize PointNet++ as the backbone of our single-shot grasp detection and demonstrate its effectiveness.
3 Problem Setting
We denote the single-view point cloud by and the gripper description by . A parallel gripper can be parameterized by the frame whose origin lies at the middle of the line segment connecting two figure tips and orientation aligns with the gripper axes. We therefore denote a grasp configuration as , where and is a score measuring the quality of .
4 Training Data Generation
To train our SG, a large scale dataset capturing cluttered scenes, with viable grasps and quality scores as groundtruth, is indispensable. Fig. 2 illustrates the training data generation pipeline. We use the YCB object dataset  for our data generation. Since SG directly takes a single-view point cloud from the depth sensor as input and outputs collision-free grasps in a densely-cluttered environment, we need to generate such scenario with complete scene point cloud and corresponding partially observed point cloud. Each point in the point cloud is assigned with serval grasps which will be introduced in Sec 4.3 and each ground truth grasp has a pose, an antipodal score, a collision score, an occupancy score, and a robustness score, which we will introduce later. On the other hand, the scene point cloud does not interact with the network explicitly, but it serves as a reference to evaluate grasps in the point cloud.
4.1 Gripper Contact Model
Vast literature exists to find regions suitable to grasp by analyzing the 3D geometry . Among these methods, force closure has been widely used to synthesize grasps and can be reduced into calculating angles between face normals, known as antipodal grasp [23, 5]. Here we introduce our gripper contact model based on force closure analysis to find feasible grasps.
To be more specific, we first detect all possible contact pairs with high antipodal score , where is the angle between the outward normal and the line connecting two contact points. As illustrated in Fig. 3, for each contact pair (, ), the normal at point is smoothed with radius mm. Note that this step is important to grasp objects of rugged surface with high-frequency normal variation. However, we do not directly use a ball query to query its neighbors, which will lead to undesirable results at corners and edges. Instead, we remove the neighbors which has a distance along the normal direction ( calculated as ) larger than mm in the query ball of radius for normal calculation, where is the -th neighbor of point .
These two hyper-parameters have definite physical meaning, which is distinct from the approach to obtain the gripper contact model hyper-parameters in GPD  through extensive parameter tuning. As shown in Fig. 3
, our gripper will only interact with the object by its soft rubber pad, which allows deformation withinmm. And the normal smoothing radius is set as the gripper width mm.
In fact, our gripper model has clear advantage over Darboux frame based methods, especially at rugged surfaces and flat surfaces. For rugged surfaces, there is no principled way to decide the radius for normal smoothing, since the radius is not only relevant to the gripper, but also to the object to grasp. For flat surfaces, the principal curvature directions are under-determined. In practice, we do observe issues for these cases. For example, for plates and mugs, Darboux frame based method will likely to fail in generating a successful grasp pose for the thin wall.
Besides the direction of contact force, we also consider the stability of the grasp. The occupancy score , which represents the volume of object within the gripper closing region , is calculated by
where is the number of points within closing region. If is small, the gripper contact analysis will be unreliable. To make sure that the point cloud occupancy can correctly represent the volume, we down-sample the point cloud using voxel grid filter with a leaf size of mm.
4.2 Physically-plausible Scene Synthesis from Objects
Since our network is trained on synthesis data and directly applied to real world scenarios, it is necessary to generate training data closer to reality both physically and visually.
We need physically-plausible layouts of various scenes where each object should be in equilibrium under gravity and contact force. Therefore, we adopt MuJoCo engine  and V-HACD to generate scenes where each object is in equilibrium. Objects initialized with random elevation and poses fall onto a table in the simulator and converge to static equilibrium due to friction. We record the poses and positions of objects and reconstruct the 3D scene.(Fig. 2)
Beside scene point cloud, we also need to generate viewed point clouds that will feed into the neural network. To simulate the noise of depth sensor, we apply a noise model on the distance from camera optical center to each point as , where is the noiseless distance captured by a ray tracer and is the distance used to generate viewed point clouds. We employ in this paper.
4.3 Robustness Grasp Generation by Scene Analysis
Given the scene point cloud, we can do collision detection for each grasp configurations. Collision score is a scene-specific boolean mask indicating the occurrence of collision between the proposed gripper pose and the complete scene. As shown in our experiment, our network can better predict collision with invisible parts.
It is a common case that robot end-effector can not move precisely to a given pose due to sensor noise, hand-eye calibration error and mechanical-transmission noise. To perform a successful grasp under imperfect condition, the proposal grasp should be robust enough against gripper’s pose uncertainty. In this paper, we add a small perturbation to the grasp pose and evaluate the antipodal score, occupancy score and collision score for the perturbed pose. The final scalar score of each grasp can be derived as:
where is the pose perturbation and is the exponential mapping. The final viewed point cloud with ground truth grasps and scores will serve as training data for our SG.
5 Single-Shot Grasp Generation
5.1 PointNet++ based Grasp Proposal
We design the single-shot grasp proposal network based on the segmentation version of PointNet++, which has demonstrated state-of-the-art accuracy and strong robustness over clutter, corruption, non-uniform point density , and adversarial attacks .
Figure. 4 demonstrates the architecture of SG, which takes the single-view point cloud as input, and assigns each point two attributes. The first attribute is a good grasp (if exists) associated to the point by inverse indexing, and the second attribute is the quality score of the stored grasp. The generation of the grasp and quality score can be found in Sec. 4.3.
The hierarchical architecture not only allows us to extract local features and predict reasonable local frames when the observation is partial and noisy, but also combines local and global features to effectively infer the geometry relationship between objects in the scene.
Compared with sampling and grasp classification [29, 16], the single-shot 6-DoF grasp direct regression task is more challenging for networks to learn, because widely adopted rotation representations such as quaternions and Euler angles are discontinuous. In this paper, we use a 6D representation of the 3D rotation matrix because of its continuity : for every , it is represented by , , such that the mapping is
where denotes the normalization function. Because the gripper is symmetric with respect to rotation around the
axis, we use a loss function which handles the ambiguity by considering both correct rotation matrices as ground truth options. Given the groundtruth rotation matrix, we define the rotation loss function as
The prediction of translation vectors is treated as a regression task and theloss is applied. By dividing the groundtruth score into multiple levels, the grasp quality score prediction is treated as a multi-class classification task, and a weighted cross-entropy loss is applied to handle the unbalance between positive and negative data. We only supervise the pose prediction for those points assigned with viable grasps and the total loss is defined as:
where represent the point set with viable grasps and the whole scene point cloud, respectively. are set to 5.0, 20.0, 1.0 in experiments.
5.2 Non-maximum Suppression and Grasp Sampling
Algorithm. 1 describes the strategy to choose one grasp execution h from the network prediction .
Because the network generates one grasp for each point, there are numerous similar grasps in each grasp’s neighborhood and we use non-maximum suppression (NMS) to select grasps h with local maximum to generate executable grasp set . Then weighted random sampling is applied to sample one grasp to execute according to its grasp quality score.
6.1 Implementation Details
The input point cloud is first preprocessed, including workspace filtering, outliers removal, and voxel grid down-sampling. For training and validation, we samplepoints from the point set with viable grasps, from the remaining point set, and integrate them as the input of the network. For evaluation, we sample points at random from the preprocessed point cloud.
is set to 25600 in our experiments. We implement our network in PyTorch, and train it using Adam
as the optimizer for 100 epochs with the initial learning rate, which is decreased by every epochs.
6.2 Superiority of grasp
We first evaluated the grasp quality performance of our proposed network on simulated data. To demonstrate the superiority of grasp over 3/4 DoF grasp, here we give a quantitative analysis over 6k scene with around 2.6M generated grasps (Fig. 5). In our experiments, grasps are uniformly divided into 6 groups according to the angle between the approach vector and vertical direction in the range of (). We use the recall rate as metric which are defined as the percentage of objects that can be grasped using grasps between vertical and certain angle. We evaluate the recall rate at scenes of three different densities: simple (1-5 objects presented in the scene), semi-dense (6-10 objects) and dense (11-15) objects. The overall recall rate is the weighted average of the three scenes. We find that only objects can be grasped by nearly vertical grasps (). With the increase of scene complexity, the advantage of grasp becomes more remarkable.
6.3 Simulation Experiments
|w/o Noise||w/ Noise|
|Antipodal Score||Collision-free||Antipodal Score||Collision-free|
|GPD (3 channels)||0.5947||47.07%||0.5802||40.00%|
|GPD (12 channels)||0.5883||45.27%||0.5946||40.44%|
adopt Darboux frame analysis to sample grasp poses and train a classifier to evaluate their quality, which achieved state-of-the-art performance in 6D grasp detection. We choose GPD(3 channels), GPD(12 channels), and PointGPD as our baseline methods. For training baseline methods, we adopt their grasp sampling strategy to generate grasp candidates for each scene until we get 300 collision-free grasps. We generate grasps over 6.5k scenes and get more than 2M grasps, which is larger than the 300K grasps in the original paper. Note that the scene used to generate training data for baseline method is exactly the same as our method.
For evaluation of baseline methods, we first sample 1000 points at random from the point cloud and calculate the Darboux frame for grasp candidates, which are then classified and ranked. The top 10 grasps are evaluated for both baseline methods and our methods.
To evaluate our method in finding collision-free grasps, we compare two metrics that affect the final grasping success rate: (1) antipodal score, which describes the force closure property of grasps, (2)probability of collision with other objects not observable to the depth sensor. The evaluation is performed in simulator with 2 settings: (1) No noise, where the point cloud from the depth sensor simulator aligns with the complete point cloud perfectly; (2) With noise, where the noise of the depth simulator is proportional to the depth. Please note that for noise setting and real-world experiment, both baselines and our method is trained on noisy data. Table. 1 shows the comparison results. Since the 6-DoF grasp pose is regressed by our SG instead of being computed from local normals and curvatures, it is less sensitive to partial and noisy depth observations; also, our SG is able to generate more collision-free grasps by inferring from local and global geometry information jointly.
6.4 Robotic Experiments
We validate the effectiveness and reliability of our methods in real robotic experiments. We carried out all the experiments on Kinova MOVO, a mobile manipulator with a Jaco2 arm attached with a 2-finger gripper (Fig. 1 (a)). In order to be close to real domestic robot application scenarios, we use one KinectV2 depth sensor that is mounted on the head of the manipulator, which makes the observation heavily occluded and raises the difficulty of experiments. 30 objects of various shapes and weight (see Supplymentary Materials) are used, which are absent in the training dataset.
The experiment procedure is as follows: (1) Choose 10 out of the 30 objects at random and put them on the table to form a cluttered scene; (2) The robot attempts multiple grasps, until all objects are grasped or 15 grasps have been attempted; (3) Step (1) and (2) are repeated for 4 times for each method. More details are presented in the supplementary material. Note that all the objects selected in real robot experiments are out of the training data.
As illustrated in Table 2, our method outperforms baseline methods in terms of success rate, completion rate, and time efficiency, which suggests that the single-shot regressed 6-DoF grasps have better force closure quality than sampled grasps from baselines, as demonstrated in Fig. 6. Not needed by us, the baseline methods also need to detect collision and extract local geometry for every sampled grasp, which takes around 20 seconds, so they are much more time-consuming than our method.
Our experiment setting is much more challenging than the baseline papers. In the original paper, GPD uses two depth sensors at both sides of the arena to capture the nearly complete point cloud in the original paper, but in our experiments, only one depth sensor is used. In both baselines, grasps are sampled in the neighbourhood of Darboux frame. It performs well on convex objects (box and ball) but poorly on non-convex or thin-structure objects, such as mug and bowl as in our experiments, because their heuristic sampling method requires accurate normals and curvatures but estimating those surface normals from noisy point cloud is challenging. On the contrary, Point-Net++ has been demonstrated to be robust against adversarial changes to the input data , which can better capture the geometric structure under noise.
|Success rate||Completion rate||Processing||Inference||Total|
|GPD (3 channels)||40.0%||60.0%||24106 ms||1.50 ms||24108 ms|
|GPD (12 channels)||33.3%||50.0%||27195ms||1.70ms||27197ms|
|Ours||77.1%||92.5%||5804ms||12.60 ms||5817 ms|
are used as the evaluation metrics, which represent the accuracy and completeness respectively.
We studied the problem of 6-DoF grasping by a parallel gripper in a cluttered scene captured using a commodity depth sensor from a single viewpoint. Our learning based approach trained in a synthetic scene can work well in real-world scenarios, with improved speed and success rate compared with state-of-the-arts. The success shows that our design choices, including a single-shot grasp proposal and a novel gripper contact model, are effective.
We would like to acknowledge the National Science Founding for the grant RI-1764078 and Qualcomm for the generous support. We especially thank Jiayuan Gu for the discussion on network architecture design and Fanbo Xiang for the idea on using single object cache to accelerate training data generation.
-  (2000) Robotic grasping and contact: a review. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), Vol. 1, pp. 348–353. Cited by: §2.
-  (2013) Data-driven grasp synthesis—a survey. IEEE Transactions on Robotics 30 (2), pp. 289–309. Cited by: §1.
-  (2018) Review of deep learning methods in robotic grasp detection. Multimodal Technologies and Interaction 2 (3), pp. 57. Cited by: §2.
-  (2015) The ycb object and model set: towards common benchmarks for manipulation research. In 2015 international conference on advanced robotics (ICAR), pp. 510–517. Cited by: §A.2, §4.
-  (1993) Finding antipodal point grasps on irregularly shaped objects. IEEE transactions on Robotics and Automation 9 (4), pp. 507–512. Cited by: §4.1.
-  (2018) Real-world multiobject, multigrasp detection. IEEE Robotics and Automation Letters 3 (4), pp. 3355–3362. Cited by: §2.
The moped framework: object recognition and pose estimation for manipulation. The International Journal of Robotics Research 30 (10), pp. 1284–1306. Cited by: §2.
-  (2012) Semantic grasping: planning robotic grasps functionally suitable for an object manipulation task. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1311–1317. Cited by: §1.
-  (2018) Learning 6-dof grasping and pick-place using attention focus. arXiv preprint arXiv:1806.06134. Cited by: §1.
-  (2018) Grasp2vec: learning object representations from self-supervised grasping. arXiv preprint arXiv:1811.06964. Cited by: §1.
-  (2016) Deep learning a grasp function for grasping under gripper pose uncertainty. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4461–4468. Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.1.
Robotic grasp detection using deep convolutional neural networks. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 769–776. Cited by: §1, §2.
-  (2015) Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34 (4-5), pp. 705–724. Cited by: §1.
-  (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research 37 (4-5), pp. 421–436. Cited by: §1.
-  (2019) PointNetGPD: detecting grasp configurations from point sets. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §2, §5.1, §6.3.
-  (2019) Extending adversarial attacks and defenses to deep 3d point cloud classifiers. arXiv preprint arXiv:1901.03006. Cited by: §5.1, §6.4.
-  (2019) Learning ambidextrous robot grasping policies. Science Robotics 4 (26), pp. eaau4984. Cited by: §1.
-  (2009) A simple and efficient approach for 3d mesh approximate convex decomposition. In 2009 16th IEEE international conference on image processing (ICIP), pp. 3501–3504. Cited by: §4.2.
-  (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §2.
-  (2004) Graspit! a versatile simulator for robotic grasping. Cited by: §1.
-  (2019) 6-dof graspnet: variational grasp generation for object manipulation. arXiv preprint arXiv:1905.10520. Cited by: §2.
-  (1988) Constructing force-closure grasps. The International Journal of Robotics Research 7 (3), pp. 3–16. Cited by: §4.1.
-  (2016) PointNet: deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593. Cited by: §2.
-  (2016) Volumetric and multi-view cnns for object classification on 3d data. In , pp. 5648–5656. Cited by: §2.
-  (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5099–5108. Cited by: §2, Figure 4, §5.1.
Deep reinforcement learning for vision-based robotic grasping: a simulated comparative evaluation of off-policy methods. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6284–6291. Cited by: §1.
-  (2012) An overview of 3d object grasp synthesis algorithms. Robotics and Autonomous Systems 60 (3), pp. 326–336. Cited by: §4.1.
-  (2017) Grasp pose detection in point clouds. The International Journal of Robotics Research 36 (13-14), pp. 1455–1473. Cited by: §1, §1, §1, §2, §4.1, §5.1, §6.3.
-  (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §4.2.
-  (2019) Multi-modal geometric learning for grasping and manipulation. In 2019 International Conference on Robotics and Automation (ICRA), pp. 7339–7345. Cited by: §1.
-  (2017) Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1386–1383. Cited by: §2.
-  (2018) On the continuity of rotation representations in neural networks. arXiv preprint arXiv:1812.07035. Cited by: §5.1.
Appendix A Supplementary Material
a.1 Network Details
We use 3 point set abstract layers, each of which is a 3-layer MLP, containing , ,
units, respectively. ReLU is used as the activation function. Farthest Point Sampling(FPS) is adopted for better and more uniform coverage, where a subset of points are chosen from the input point set such that each point in the subset is the most distant point from points in the set. Compared with random sampling, FPS has better coverage of the entire point set. It is performed iteratively to get the centroids for grouping from the former stage.