1 Introduction
Grasp generation is one of the most important problems in robot manipulation. Here, a robot observes an object and needs to decide where to move its gripper in order to pickup the object (see Fig. 1). Object grasping is mainly formulated in two different approaches: Planar grasping and 6-DOF grasping. In planar grasping, the representation of each grasp utilizes the fact that the view is perpendicular to the support surface and as a result each grasp can be recovered from the 2D location of the center of the grasp in image space and the planar rotation of grasps. While the grasp representation is more compact in the case of planar grasping, it limits the workspace of the robot because the grasp representation is only valid in the area where the ray corresponding to the grasp center is perpendicular to the surface. In 6-DOF grasping, the grasp representation is more complex and is defined by the 3D rotation and 3D translation. Such representation is more versatile and does not constraint the robot workspace but makes the problem more challenging.
Grasp generation is complex since the stability of grasps depends on object and gripper geometry, object mass distribution, and surface frictions. The geometry around an object poses additional constraints on which grasp points are reachable without causing the robot manipulator to collide with other objects in a scene (see Fig. 2
). Typically, this problem is approached by geometry-inspired heuristics to select promising grasp points around an object, possibly followed by a more in-depth geometric analysis of the stability and reachability of a sampled grasp
[30]. Many of these approaches rely on the availability of complete 3D models of an object, which is a severe limitation in realistic scenarios where a robot only observes a scene with a noisy depth camera, for example. To overcome this limitation, one could move the camera to generate a full object model or perform shape completion, followed by geometry-based grasp analysis. However, moving the camera might be impossible in constrained spaces, and shape completion might not be sufficiently accurate for grasp generation and evaluation.
![]() |
![]() |
![]() |
Recently, several groups have introduced deep learning techniques to evaluate the quality of grasps from raw point cloud data
[20, 18, 30, 15]. While these approaches provide good grasp assessments, they still use manually designed heuristics to sample grasps for evaluation or rely on black-box optimization techniques such as CEM [18, 34]. Additionally, they do not provide efficient means for improving sampled grasps. In this paper, we introduce the first learning-based framework for efficiently generating diverse sets of stable grasps for unknown objects. Our approach introduces two network architectures that sample, evaluate, and improve grasps. The key contributions of this paper are:-
A variational auto-encoder (VAE) that can be trained to map the partial point cloud of an observed object to a diverse set of grasps for the object. Importantly, our VAE provides high coverage of all possible, functioning grasps while generating only a small number of failing grasps.
-
To improve the precision of the VAE samples, we introduce a grasp evaluator network that maps a point cloud of the observed object and the robot gripper to a quality assessment of the 6D gripper pose. Crucially, we show that the gradient of this network can be used to improve grasp samples, for instance moving gripper out of collision or ensuring that the gripper is well aligned with the object.
-
We demonstrate that our approach outperforms previous approaches and enables a robot to pickup 17 objects with a success rate of 88%. Generating diverse grasps is quite important because not all the grasps are kinematically feasible for the robot to execute. We furthermore show that our approach generates diverse sets of grasp samples while maintaining high success rate.
The paper is organized as follows. We first contrast related approaches to grasping that use deep learning, and then explain the different components of our approach: grasp sampling, evaluation, and refinement. Finally, we evaluate our method on a real robotic platform and show the effect of different hyperparameters in various ablation studies.
2 Related Work
Learning 6-DOF Grasps
The prevailing approaches to solve the robot grasping problem are data-driven [2]
. While earlier methods were based on hand-crafted feature vectors
[26, 1, 7], recent methods exploit convolutional architectures to operate on raw visual measurements [13, 24, 20, 18, 14]. Most of these grasp synthesis approaches are enabled by representing the grasp as an oriented rectangle in the image [8]. This 3-DOF representation constrains the gripper pose to be parallel to the image plane. The drawbacks of such a representation are manifold: Since it limits the grasp diversity, picking up an object might be impossible given additional constraints imposed by the arm or task. In case of a static image sensor it also leads to a severely restricted workspace [18].Our approach tackles the problem of predicting the full 6-DOF pregrasp pose. This is challenging due to occluded object parts that affect grasp success. Yan et al. [34] circumvent this problem by including the auxiliary task of reconstructing the geometry of the target object. The main task of predicting the 6-DOF grasp outcome can then use local geometry that is not part of the measurement. Similar to our evaluator network, Zhou et al. [36] learn a grasp score function which they also for grasp refinement. In contrast to our approach, both methods [34, 36] are only evaluated in simulation.
Few methods formulate the problem as a regression to a single best grasp pose [27, 16]. They inherently lack the ability to predict a diverse distribution of possible grasps. Choi et al. [4]classify 24 pre-defined orientations to chose a 6-DOF pre-grasp pose. Such a coarse resolution of will necessarily lead to a limited diversity of the predicted grasps. In contrast, the grasp point detection method (GPD) [30, 15]
uses a more dense sampling of candidate grasps: A point in the observed point cloud is sampled randomly and a Darboux frame is constructed which is aligned with the estimated surface normal and the local direction of the principal curvature. Although this heuristic creates a quite diverse set of candidate grasps, it fails generating grasps along thin structures such as rims of mugs, plates, or bowls since estimating those surface normals from noisy measurements is challenging. Our learned grasp sampler does not suffer from such bias. As a result our proposed method finds grasps where GPD is not able to (see Sec.
4.2).Apart from using supervised learning, grasping has also been formulated as a reinforcement learning problem
[9, 35] or approximations of it [14]. The learned grasp policies are more expressive than describing only the final grasp pose. Still, the action space of these methods is usually , limiting the diversity to top-down grasps.Deep Neural Networks for Learning from 3D Data
The success of deep learning on 3D point cloud data started much later than its huge success on RGB images. In the early days, 3D data were represented as 3D voxels [19] or as extracting features from 2.5 depth images [6]
and process them similar to RGB image using convolutional neural networks which oftentimes lead to marginal improvements. Qi et al.
[22, 23] introduced a new architecture, called PointNet and PointNet++, that is capable of representing the 3D data and extract the representation efficiently. The success of PointNet lead to the introduction of different variations of network architectures [32, 29] that represent 3D data, showing significant improvement on 3D object pose estimation, semantic segmentation, and part segmentation [29, 23, 21, 33]. In order to estimate a successful gasp, the 6-DOF pose of the grasps needs to be accurate. Operating on a single RGB image does not provide the required accuracy since the input and output are not in the same domain. Therefore, we use 3D point cloud data to generate grasps. We use PointNet++ [23] for learning the representation to generate and evaluate grasps in .Variational Autoencoders
Variational autoencoders [10] (VAE) are one of the main categories of deep generative models. VAEs can be trained in an unsupervised manner to maximize the likelihood of the training data. They have been applied to a variety of tasks such as future prediction [12, 31], generating novel view points [11] and object segmentation [28]. In this work, we use a VAE to sample a diverse set of grasps in .
The overall architecture of our model is similar to GANs [5]. The generator module is a VAE that is based on different samples from a latent space and the observed point cloud . It generates different grasp proposals and the evaluation network (discriminator) accepts or rejects them based on how likely it is that they are successful. Both generator and discriminator are taking the 3D point cloud of the object as part of the input.
![]() |
![]() |
(a) Grasp coordinate frame | (b) Overview of our method |
3 6-DOF Grasp Pose Generation
We formulate grasp pose generation as the process of producing sets of robot gripper poses such that closing the gripper at any of these poses results in a stable grasp of an object. Furthermore, the process should generate diverse sets of poses that ultimately cover all possible ways an object could be grasped. Robot gripper poses are given in , specifying the 3D translation and 3D orientation of the gripper. Here, we focus on generating grasp poses for single objects, additional constraints due to a manipulator’s reach and due to other objects in a scene are beyond the scope of this work and can be handled by trajectory optimization techniques. Grasp pose generation is challenging due to the narrow subspace of successful grasps in the space of all possible grasps. Small perturbations in the pose of a grasp can transform a successful grasp into a failure. To generate diverse sets of stable grasps, our approach samples grasp poses using a variational auto-encoder network followed by an iterative evaluation and refinement process. The input to our approach is a point cloud of the object the robot should pickup.
Specifically, we aim to learn the posterior distribution , where represents the space of all successful grasps and is the partial point cloud of the object observed by a camera. Each grasp is represented by where and are the rotation and translation of grasp . Grasps are defined in the object reference frame, whose origin is , the center of mass of the observed point cloud. Its axes are parallel to those of the camera frame (see Fig. 3-a). The distribution of successful grasps can be complex and discontinuous. For example, the distribution of for a mug has multiple modes along the rim, handle, and bottom. Within each mode, the space of successful grasps is continuous but grasps of different modes can be separated from each other. The total number of separate modes for each object category varies based on the shape and scale of objects.
Since the number of modes of is not known beforehand, we propose to learn a generator module that maximizes the likelihood of successful grasps . Since the generator only observes successful grasps during training, it is possible that it also generates failed grasps . In order to detect and refine these negative grasps, an evaluation module is trained to predict
, i.e., the probability of success for a grasp
and the observed point cloud . Applied to a sampled grasp, the evaluation module predicts grasp success and propagates the success gradient back through the network to generate an improved grasp pose. This process can be repeated. Discarding all grasps that remain below a threshold provides the final set of high quality grasps. The overview of our method is shown in Fig. 3-b.3.1 Variational Grasp Sampler
The grasp sampler, shown in Fig. 4, is a generative model that maximizes , the likelihood of a set of pre-defined successful grasps . Given a point cloud and a latent variable , the sampler is a deterministic function that predicts a grasp. It is assumed that
, the probability density function of the latent space, is known and chosen beforehand. In our approach, we use
. Given a point cloud , different grasps are generated by sampling different from . The likelihood of the generated grasps can be written as follows:(1) |
Optimizing Eq. (1) for each positive grasp requires to integrate over all the values of the latent space, which is intractable. In order to make Eq. (1) tractable, the encoder maps each pair of point cloud and grasp to a small subspace in the latent space . Given the sampled , the decoder reconstructs the grasp . During training, the encoder and decoder are optimized to minimize the reconstruction loss between ground truth grasps and the reconstructed grasps . Furthermore, the KL-divergence between the distribution
and the normal distribution
is minimized to ensure a normal distributed latent space with unit variance. The loss function is defined as follows:
(2) |
Eq. (2
) is optimized using stochastic gradient descent. For each mini-batch, the point cloud
is sampled for an object observed from a random viewpoint. For the sampled point cloud , grasps are sampled from the set of ground truth grasps using stratified sampling. To combine the orientation and translation loss, we define the reconstruction loss as follows:(3) |
where is the transformation of a set of predefined points on the robot gripper. During training, the decoder learns to decode the latent value that is sampled from and generates grasps while the encoder learns to output such that it contains enough information to reconstruct the grasp pose while maintaining the normal distribution. During inference, the encoder is removed and latent values are sampled from .

Both encoder and decoder are based on the PointNet++ [23] architecture. In this architecture, each point has a 3D coordinate and a feature vector. The features at each layer are computed based on the features of each point and the 3D relation of the points with respect to each other. The features of each input point are concatenated to . In the decoder, each point feature is concatenated with the latent variable . The encoder learns to compress the relative information of point cloud and latent variable grasp in such a way that it can be reconstructed by the decoder. Fig. 5 qualitatively shows that the latent space has strong correlation with grasp pose.
![]() |
![]() |
3.2 Grasp Pose Evaluation
The grasp sampler trains the continuous posterior distribution using only positive grasps. As a result, it might contain failed grasps that are in between the modes of the distribution. These transitional grasps and other false positives need to be identified and pruned out. To do so, we need a grasp evaluation network that assigns a probability of success to each grasp. This network needs to reason about grasps relative to the observed point cloud , but it must also be able to extrapolate to unobserved parts of the object. Other methods learn to classify grasps based only on local observed parts of an object [30, 18]. In practice, the observed point cloud of an object has imperfections such as missing or noisy depth values. To mitigate this problem, previous methods resort to use high quality depth sensors [18] or using multiple views [30] which limits the deployment of the system outside of controlled environments. In this work, we classify each grasp using only the imperfect observed point cloud of the object.
Success of a grasp pose depends on the relative pose of the grasp with respect to the object. The inputs to the evaluator network are point cloud and grasp . Similar to the Grasp Sampler, we use the PointNet [22] architecture for the Grasp Evaluator. There are multiple ways for classifying grasps. The first, simple approach is to associate the 6D pose of the grasp to the features of each point in the first layer. Our experiments showed that such a representation leads to poor accuracy in grasp classification. Instead, we propose to represent a grasp in a way more closely tied to the object point cloud: We approximate the robot gripper by a point cloud, , rendered according to the 6D grasp pose . The object point cloud and gripper point cloud are combined into a single point cloud by using an extra binary feature that indicates whether a point belongs to the object or to the gripper. In the PointNet architecture, the features for each point are functions of features of the point itself and its neighbors plus the relative spatial relation of the points. Using the unified point cloud makes it natural to use all the relative information between grasp pose and object point cloud for classifying the grasps. The grasp evaluator is optimized using the cross-entropy loss by optimizing
(4) |
where is the ground truth binary label of the grasps indicating whether the grasp is successful or not and is the predicted probability of success by the evaluator.
In order to train a robust evaluator, the model needs to be trained with both positive and negative grasps. Since the space of all possible 6D grasp poses is combinatorially large, it is not possible to sample all the negative grasps. Instead, we do hard negative mining to sample negative grasps. The set of hard negative grasps is defined as the grasps that have similar pose to a positive grasp but that are either in collision with the object or are too far from the object to grasp the object. More formally is defined as:
(5) |
where is defined in Eq. (3). During training, is sampled from a set of pre-generated negative grasps and by randomly perturbing positive grasps to make the mesh of the gripper either collide with the object mesh or to move the gripper mesh far from the object.
![]() |
![]() |
![]() |
![]() |
3.3 Iterative Grasp Pose Refinement
Although the evaluation network rejects implausible grasps, a large portion of the rejected grasps can be close to successful ones. This insight can be exploited by searching for a transformation that turns an unsuccessful grasp into a successful one. More formally, we are looking for a refining transformation that increases the probability of success, i.e., . The evaluation network represents a differentiable function of success based on the point cloud and grasp . The refinement transformation that leads to maximum improvement in success probability can be computed by taking the derivative of success with respect to the grasp transformation: . The partial derivative provides the transformation for each point in the gripper point cloud so as to increase the probability of success. Since the derivative is computed with respect to each point on the gripper independently, it can lead to non-rigid transformations for . To enforce the rigidity constraint, the transformed gripper point cloud is defined as a function of orientation of the grasp defined in Euler angles and translation
. Using the chain rule,
is computed as follows:(6) |
Since the partial derivative is only a valid approximation in the local neighborhood, we use hyper-parameter to limit the magnitude of updates at each step. In practice, we choose in such a way that the maximum translation update of the grasp does not exceed . Fig. 6 shows the refinement of estimated grasps at different iterations.
4 Experiments
Training Data for Grasping
To generate reference sets of successful grasps, we use the physics simulation FleX [17], which provides realistic simulation of grasps for arbitrary object shapes. Candidate grasps are sampled based on the object geometry. We sample random points on the object mesh surface and align the gripper’s z-axis (see Fig. 3
-a) with the surface normal. The distance between the gripper and the object surface is sampled uniformly between zero and the gripper’s finger length. The orientation around the z-axis is also drawn from a uniform distribution. We only simulate grasps that are not in collision and whose closing volume between the fingers intersects the object. In total we use 206 objects from six categories in ShapeNet
[3]: boxes and cylinders (randomly generated), as well as bowls, bottles and mugs. A total of 10,816,720 candidate grasps are sampled of which we simulate 7,074,038 (65.4%), i.e., those that pass the non-empty closing volume test. The simulation consists of a free-floating parallel-jaw gripper and the free-floating object without gravity (similar to [36]). Surface friction and object density are kept constant. After closing its fingers the gripper executes a predefined shaking motion. A grasp is labeled successful if the object is kept between both fingers. Overall, we generate 2,104,894 successful grasps (19.4%). The resulting positive grasp labels are densely distributed as shown in the examples in Fig. 7.![]() |
![]() |
Training
Both grasp generator and evaluator networks are using PointNet++[23] and have similar architectures. Both modules consist of three set-abstraction layers followed by fully connected layers. Each batch of the training data for the generator network consists of a rendering of the object from a random view and 64 grasps that are sampled using stratified sampling to make sure the sampled grasps have enough diversity. The weight for KL-Divergence loss ( in Eq. (2)) is set to . Each batch of training data for the evaluator network consists of positive, negative grasps, and hard negative grasps. Hard negative grasps are selected from perturbed positive grasps by applying radians in each axis and to translation. Both models are trained with the Adam optimizer using a learning rate of . All the grasps are generated in simulation and no real data was used to train any of the models (see Sec. 4).
Network Architecture Details
Both grasp generator and evaluator are based on PointNet++ architecture. Both models consists of three set abstraction layers. Each set abstraction layer samples 128, 32, and all the points. Each set abstraction layer samples the points that are within , , and radius of the sampled points. Each set abstraction layer uses 3 fully connected layers to compute the features. The number of channels for each set abstraction layer are , and respectively. The set abstraction layers are followed by two fully connected layers with units. The grasp generator network outputs a rotation that is represented as unit quaternion and a translation . Quaternions are generated by applying L2-Norm on a linear fully connected layer. No normalization is done for the translation
. The evaluator network predicts the score for each grasp using a softmax layer.
Evaluation Metrics
We used two metrics to quantitatively evaluate grasping methods: success rate and coverage rate. Success rate is the ratio of successful grasps among all predicted grasps. This metric only considers the grasp that is executed and does not contain any information about the other grasps. Predicting only one grasp is not suitable for 3D grasping, because the predicted grasp may lead to collision of the robot with other objects in the environment or there may not be any possible valid robot joint configuration that can reach the predicted grasp. In order to achieve an executable successful grasp, we need to generate a diverse set of grasps from different translations and directions to check for kinematic feasibility and collision avoidance. As a result, we introduce the coverage rate which captures the diversity of the grasps and measures how well the space of positive grasps is covered by the generated grasps. A positive grasp is covered by the set of predicted grasps, , if there exists a grasp that is at most away from the grasp . Positive grasps that have similar translation in object frame, have similar orientation. As a result, we chose to use the distance in translation of the grasps as the criteria for evaluating whether a grasp is covered or not. Since grasps are defined in , is uncountably infinite. As a result, is approximated by sampling grasps while generating data. Success rate and coverage rate
are analogous to precision and recall in the context of binary classification. Similar to precision-recall curves, we use the curves of
success rate and coverage rate for analyzing and evaluating our method. We use the AUC of success-coverage rate for ablation studies and analysis.4.1 Analysis and Ablation Studies
We evaluate the effect of different parameters and modules quantitatively using the same physics simulation as in the generation of training data (Sec 4). For the ablation studies, we generate 86 object point cloud observations for 10 different objects that are held out during training. For each point cloud, 200 latent values are sampled and refined over 10 iterations, resulting in 2200 grasps per view point and 182,600 grasps in total.
Dimensionality of the Latent Space
There is an inherent tension when deciding the dimensionality of the latent space which affects the quality of the generated grasps. The latent space needs to have enough capacity to allow the VAE to reconstruct the grasps. At the same time, a high-dimensional latent space leads to over-fitting and requires significantly more training data to be covered. It also deteriorates the quality of sampled grasps during inference, especially when the sampled latent values during inference are not seen by the generator network during training. To analyze this effect, we evaluate the generator network with increasing numbers of dimensions for the latent space. Fig. 8 shows the resulting success-coverage curves. As can be seen, a dimensionality of one has the least AUC because the latent space does not have enough capacity. Although 3-dimensional and 4-dimensional latent spaces lead to a slightly better on the training data they perform worse during inference because the VAE cannot cover the latent space densely during training. Given these results, we choose a two-dimensional latent space for all subsequent evaluations.

Effect of Refinement on Grasp Quality
While the grasp refinement increases the probability of success based on the evaluator network, it does not necessarily mean that the refined grasps succeed during test time. To analyze the actual improvement induced by each refinement step, we evaluate the grasps in simulation. Fig 9 shows the success-coverage curve of the grasps that are computed at each refinement iteration. As is shown, not only does the success rate of the generated grasps increase, the coverage rate increases as well. This is because when grasps are improved they get closer to the sampled positive grasps in . The AUC of the curves plateaus after the 10th iteration of refinement.

Effect of Sampled Grasps on Coverage
In previous sections, we conducted ablation studies using 200 random latents because that was the maximum batch size that fits in GPU memory and it is the same setting that we used for our robot experiments. Since the number of sampled latents were limited, the coverage rate in Fig 9 was less than 0.5 even after 10 refinement steps. In order to investigate how the number of sampled grasps effect the coverage, 2000 grasps are sampled in 10 different batches on the same point clouds that were used in previous ablation studies. Fig 10 shows how more samples increase the coverage rate.

4.2 Robot Experiments
Box | Cylinder | Bowl | Mug | Average Success Rate | Success Rate | |
---|---|---|---|---|---|---|
6-DOF GraspNet | 83% | 89% | 100% | 86% | 90% | 88% |
GPD [30] | 50% | 78% | 78% | 6% | 52% | 47% |
The ultimate test of the generated grasps is to execute them in the real world and deal with imperfect perception, robot joint limits, control errors, and physical phenomena such as friction that are difficult to model. We want to show that: (1) our method scales to the real world despite being trained purely in simulation; (2) the generated grasp distribution is diverse enough to find successful grasps even after discarding those that violate robot kinematics and collision constraints; (3) our method’s diverse grasp sampling leads to higher success rate in comparison to a state-of-the-art 6-DOF grasp planner [30] (GPD).

All experiments are done using a 7-DOF Franka Panda manipulator with an Intel RealSense D415 camera mounted on its parallel-jaw gripper. We choose a set of commonly-used objects that are challenging visually and physically. The weights of the objects are between (pepper shaker) and (mustard bottle). The hardware setup and object test set is shown in Fig. 11.
Protocol
Each object is placed in three different stable poses on a table in front of the robot. The robot’s end-effector is moved such that the hand-mounted depth camera has an unobstructed view of the table top. A grasp is considered successful if the robot can lift the object without dropping it.
We filter the measured point cloud, remove the table plane and cluster the remaining points [25]. This extracted object point cloud serves as an input to our approach and also for GPD. Both methods return a list of scored grasps. We use a motion planner to check for a collision-free path to each grasp pose and execute the one with the highest score. If no grasp in the returned set can be executed we consider the trial a failure. In total we run 51 trials per method.
Results
Table 1 shows that our method outperforms GPD [30] on success rate across all objects. One of the reasons is that our method generates diverse grasps which facilitates finding kinematically feasible ones. In contrast, GPD does not generate many different grasps which sometimes leads to situations in which no kinematically feasible grasp can be found. Mugs are particularly difficult for GPD because it does not generate any grasps from the rim (see Fig 12). Another source of problem comes from the grasps where the finger gripper is tangent to one of the surfaces of the object. In these cases, slightest error in executing the grasp can change the grasp from grabbing the object to pushing it. The detailed outcomes of the experiments are shown in Table 2.
![]() |
![]() |
Categories | Objects | 6-DOF Grasp Net | GPD [30] | ||||||
Trials | Success Rate | Trials | Success Rate | ||||||
#1 | #2 | #3 | #1 | #2 | #3 | ||||
Box | Jello Chocolate | ✓ | ✓ | ✓ | 3/3 | ✓ | ✓ | ✓ | 3/3 |
Spam Meat Can | ✓ | ✗ | ✓ | 2/3 | ✗ | ✓ | ✗ | 1/3 | |
Jello Strawberry | ✓ | ✓ | ✓ | 3/3 | ✓ | ✓ | ✗ | 2/3 | |
Sponge | ✓ | ✓ | ✓ | 3/3 | ✓ | ✗ | ✗ | 1/3 | |
Cheezit Box | ✗ | ✗ | ✓ | 1/3 | ✓ | ✗ | ✗ | 1/3 | |
Sugar Box | ✓ | ✓ | ✓ | 3/3 | ✗ | ✗ | ✓ | 1/3 | |
Overall | 15/18 | 9/18 | |||||||
Cylinder | Pepper Shaker | ✓ | ✓ | ✗ | 2/3 | ✓ | ✓ | ✓ | 3/3 |
Piroluine | ✓ | ✓ | ✓ | 3/3 | ✓ | ✓ | ✓ | 3/3 | |
Mustard | ✓ | ✓ | ✓ | 3/3 | ✓ | ✗ | ✗ | 1/3 | |
Overall | 8/9 | 7/9 | |||||||
Bowl | White Bowl | ✓ | ✓ | ✓ | 3/3 | ✗ | ✗ | ✓ | 1/3 |
Green Bowl | ✓ | ✓ | ✓ | 3/3 | ✓ | ✓ | ✓ | 3/3 | |
Red Bowl | ✓ | ✓ | ✓ | 3/3 | ✓ | ✓ | ✓ | 3/3 | |
Overall | 9/9 | 7/9 | |||||||
Mug | White Mug | ✓ | ✓ | ✓ | 3/3 | ✗ | ✗ | ✗ | 0/3 |
Blue Mug | ✓ | ✓ | ✗ | 2/3 | ✗ | ✗ | ✗ | 0/3 | |
Orange Mug | ✓ | ✓ | ✓ | 3/3 | ✗ | ✗ | ✓ | 1/3 | |
Big Red Mug | ✗ | ✓ | ✓ | 2/3 | ✗ | ✗ | ✗ | 0/3 | |
Red YCB Mug | ✓ | ✓ | ✓ | 3/3 | ✗ | ✗ | ✗ | 0/3 | |
Overall | 13/15 | 1/15 |
5 Conclusions
In this work, we introduced 6-DOF GraspNet for generating diverse sets of grasps for unknown objects. Our method consists of a trained VAE that samples a variety of grasps for an object. While the VAE is able to capture the complex distributions of successful grasp poses, it does not quite provide the accuracy required for highly robust grasp generation. To overcome this limitation, we additionally introduce a grasp evaluator network that assesses grasp quality and can refine grasps in an iterative process. To the best of our knowledge, neither a learned grasp sampler nor a gradient-based refinement process have been introduced before.
The training of our model is done using synthetic grasp data generated by a physics simulator. Therefore, our model can scale to large sets of objects without requiring the collection of data in the real world. We demonstrated that our method can transfer to the real world on objects with unknown 3D models by deploying the method on a real robot platform and an on-board RGB-D camera. We performed robot experiments on 17 objects with unknown 3D models and achieved state-of-the-art results in 3D grasping. We also performed a thorough analysis of the generated grasps in terms of success rate and coverage via ablation studies in a realistic physics simulator.
This approach opens up a number of interesting directions in computer vision and robotics. In our method, all the latent values are sampled uniformly and then grasps are removed based on collision checks and kinematically feasible solutions. Potential extensions are to train the sampler or the evaluator in a way that not only considers the object of interest but also considers the surrounding objects to directly avoid generating colliding or infeasible grasps. Other interesting directions are toward using the evaluator not only to refine sampled grasps but to provide real-time feedback guidance for a manipulator approaching an object. Our experiments provide evidence that our gradient-based approach could succeed in moving a manipulator closer and closer to a successful grasp.
References
- [1] J. Bohg and D. Kragic. Learning grasping points with shape context. Robotics and Autonomous Systems, 58(4):362–377, 2010.
- [2] J. Bohg, A. Morales, T. Asfour, and D. Kragic. Data-driven grasp synthesis—a survey. IEEE Transactions on Robotics, 30(2):289–309, 2014.
- [3] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- [4] C. Choi, W. Schwarting, J. DelPreto, and D. Rus. Learning object grasping for soft robot hands. IEEE Robotics and Automation Letters, 3(3):2370–2377, 2018.
- [5] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial networks. Neural Information Processing Systems (NeurIPS), 2014.
- [6] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning rich features from RGB-D images for object detection and segmentation. In European Conference on Computer Vision (ECCV), 2014.
- [7] A. Herzog, P. Pastor, M. Kalakrishnan, L. Righetti, T. Asfour, and S. Schaal. Template-based learning of grasp selection. In 2012 IEEE International Conference on Robotics and Automation, pages 2379–2384. IEEE, 2012.
- [8] Y. Jiang, S. Moseson, and A. Saxena. Efficient grasping from rgbd images: Learning using a new rectangle representation. In ICRA. IEEE, 2011.
-
[9]
D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen,
E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine.
Scalable deep reinforcement learning for vision-based robotic
manipulation.
In A. Billard, A. Dragan, J. Peters, and J. Morimoto, editors, Proceedings of The 2nd Conference on Robot Learning, volume 87 of
Proceedings of Machine Learning Research
, pages 651–673. PMLR, 29–31 Oct 2018. - [10] D. P. Kingma and M. Welling. Auto-encoding variational bayes. International Conference on Learning Representations (ICLR), 2014.
- [11] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28. 2015.
-
[12]
N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. S. Torr, and M. Chandraker.
Desire: Distant future prediction in dynamic scenes with interacting
agents.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2017. - [13] I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps. The International Journal of Robotics Research, 34(4-5):705–724, 2015.
- [14] S. Levine, P. P. Sampedro, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. 2017.
- [15] H. Liang, X. Ma, S. Li, M. Görner, S. Tang, B. Fang, F. Sun, and J. Zhang. Pointnetgpd: Detecting grasp configurations from point sets. ICRA, 2019.
- [16] M. Liu, Z. Pan, K. Xu, K. Ganguly, and D. Manocha. Generating Grasp Poses for a High-DOF Gripper Using Neural Networks. arXiv e-prints, page arXiv:1903.00425, Mar 2019.
- [17] M. Macklin, M. Müller, N. Chentanez, and T.-Y. Kim. Unified particle physics for real-time applications. ACM Transactions on Graphics (TOG), 33(4):153, 2014.
- [18] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. 2017.
- [19] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In IEEE/RSJ International Conference on Intelligent Robots and Systems, page 922 – 928, September 2015.
- [20] L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 3406–3413, May 2016.
- [21] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum pointnets for 3d object detection from rgb-d data. Computer Vision and Pattern Recognition (CVPR), 2018.
- [22] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Computer Vision and Pattern Recognition (CVPR), 2016.
- [23] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Neural Information Processing Systems (NeurIPS), 2017.
- [24] J. Redmon and A. Angelova. Real-time grasp detection using convolutional neural networks. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1316–1322. IEEE, 2015.
- [25] R. B. Rusu and S. Cousins. 3d is here: Point cloud library (pcl). 2011.
- [26] A. Saxena, L. L. Wong, and A. Y. Ng. Learning grasp strategies with partial shape information. In AAAI, volume 3, pages 1491–1494, 2008.
- [27] P. Schmidt, N. Vahrenkamp, M. Wächter, and T. Asfour. Grasping of unknown objects using deep convolutional neural networks based on depth images. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6831–6838. IEEE, 2018.
- [28] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems 28. 2015.
- [29] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M.-H. Yang, and J. Kautz. SPLATNet: Sparse lattice networks for point cloud processing. In IEEE Conference o Computer Vision and Pattern Recognition, pages 2530–2539, 2018.
- [30] A. ten Pas, M. Gualtieri, K. Saenko, and R. Platt. Grasp pose detection in point clouds. The International Journal of Robotics Research, 36(13-14):1455–1473, 2017.
- [31] J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using variational autoencoders. In European Conference on Computer Vision, 2016.
- [32] Y. Wang, Y. Sun, Z. Liu, S. Sarma, M. Bronstein, and J. Solomon. Dynamic graph cnn for learning on point clouds. Conference Computer Vision and Pattern Recognition (CVPR), 2018.
- [33] D. Xu, J. A, and A. D. Pointfusion: Deep sensor fusion for 3d bounding box estimation. IEEE Conference o Computer Vision and Pattern Recognition (CVPR), 2018.
- [34] X. Yan, J. Hsu, M. Khansari, Y. Bai, A. Pathak, A. Gupta, J. Davidson, and H. Lee. Learning 6-dof grasping interaction via deep geometry-aware 3d representations. arXiv preprint arXiv:1708.07303, 2017.
- [35] A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser. Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4238–4245. IEEE, 2018.
- [36] Y. Zhou and K. Hauser. 6dof grasp planning by optimizing a deep learning scoring function. In Robotics: Science and Systems (RSS) Workshop on Revisiting Contact-Turning a Problem into a Solution, 2017.