6-DOF GraspNet: Variational Grasp Generation for Object Manipulation

by   Arsalan Mousavian, et al.

Generating grasp poses is a crucial component for any robot object manipulation task. In this work, we formulate the problem of grasp generation as sampling a set of grasps using a variational auto-encoder and assess and refine the sampled grasps using a grasp evaluator model. Both Grasp Sampler and Grasp Refinement networks take 3D point clouds observed by a depth camera as input. We evaluate our approach in simulation and real-world robot experiments. Our approach achieves 88% success rate on various commonly used objects with diverse appearances, scales, and weights. Our model is trained purely in simulation and works in the real world without any extra steps. The video of our experiments can be found at: https://youtu.be/KNnDpGEE_NE



page 1

page 2

page 4

page 5

page 6

page 8

page 9


GPR: Grasp Pose Refinement Network for Cluttered Scenes

Object grasping in cluttered scenes is a widely investigated field of ro...

6-DoF Contrastive Grasp Proposal Network

Proposing grasp poses for novel objects is an essential component for an...

GraspME – Grasp Manifold Estimator

In this paper, we introduce a Grasp Manifold Estimator (GraspME) to dete...

Learning to Regrasp by Learning to Place

In this paper, we explore whether a robot can learn to regrasp a diverse...

I Know What You Draw: Learning Grasp Detection Conditioned on a Few Freehand Sketches

In this paper, we are interested in the problem of generating target gra...

Accelerating Grasp Learning via Pretraining with Coarse Affordance Maps of Objects

Self-supervised grasp learning, i.e., learning to grasp by trial and err...

Learning to Model the Grasp Space of an Underactuated Robot Gripper Using Variational Autoencoder

Grasp planning and most specifically the grasp space exploration is stil...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Grasp generation is one of the most important problems in robot manipulation. Here, a robot observes an object and needs to decide where to move its gripper in order to pickup the object (see Fig. 1). Object grasping is mainly formulated in two different approaches: Planar grasping and 6-DOF grasping. In planar grasping, the representation of each grasp utilizes the fact that the view is perpendicular to the support surface and as a result each grasp can be recovered from the 2D location of the center of the grasp in image space and the planar rotation of grasps. While the grasp representation is more compact in the case of planar grasping, it limits the workspace of the robot because the grasp representation is only valid in the area where the ray corresponding to the grasp center is perpendicular to the surface. In 6-DOF grasping, the grasp representation is more complex and is defined by the 3D rotation and 3D translation. Such representation is more versatile and does not constraint the robot workspace but makes the problem more challenging.

Grasp generation is complex since the stability of grasps depends on object and gripper geometry, object mass distribution, and surface frictions. The geometry around an object poses additional constraints on which grasp points are reachable without causing the robot manipulator to collide with other objects in a scene (see Fig. 2

). Typically, this problem is approached by geometry-inspired heuristics to select promising grasp points around an object, possibly followed by a more in-depth geometric analysis of the stability and reachability of a sampled grasp 

[30]. Many of these approaches rely on the availability of complete 3D models of an object, which is a severe limitation in realistic scenarios where a robot only observes a scene with a noisy depth camera, for example. To overcome this limitation, one could move the camera to generate a full object model or perform shape completion, followed by geometry-based grasp analysis. However, moving the camera might be impossible in constrained spaces, and shape completion might not be sufficiently accurate for grasp generation and evaluation.

Figure 1: The Franka Panda manipulator used in our experiments. Our approach is able to efficiently generate diverse sets of grasps that lead to successful pickups of unknown objects.
Figure 2: Visualization of the predicted grasps for the mug. (middle) All the grasps that are generated by our method. (right) Grasps that are both kinematically feasible and collision free color-coded by the predicted scores. Green is the highest and red is the lowest.

Recently, several groups have introduced deep learning techniques to evaluate the quality of grasps from raw point cloud data 

[20, 18, 30, 15]. While these approaches provide good grasp assessments, they still use manually designed heuristics to sample grasps for evaluation or rely on black-box optimization techniques such as CEM [18, 34]. Additionally, they do not provide efficient means for improving sampled grasps. In this paper, we introduce the first learning-based framework for efficiently generating diverse sets of stable grasps for unknown objects. Our approach introduces two network architectures that sample, evaluate, and improve grasps. The key contributions of this paper are:

  • A variational auto-encoder (VAE) that can be trained to map the partial point cloud of an observed object to a diverse set of grasps for the object. Importantly, our VAE provides high coverage of all possible, functioning grasps while generating only a small number of failing grasps.

  • To improve the precision of the VAE samples, we introduce a grasp evaluator network that maps a point cloud of the observed object and the robot gripper to a quality assessment of the 6D gripper pose. Crucially, we show that the gradient of this network can be used to improve grasp samples, for instance moving gripper out of collision or ensuring that the gripper is well aligned with the object.

  • We demonstrate that our approach outperforms previous approaches and enables a robot to pickup 17 objects with a success rate of 88%. Generating diverse grasps is quite important because not all the grasps are kinematically feasible for the robot to execute. We furthermore show that our approach generates diverse sets of grasp samples while maintaining high success rate.

The paper is organized as follows. We first contrast related approaches to grasping that use deep learning, and then explain the different components of our approach: grasp sampling, evaluation, and refinement. Finally, we evaluate our method on a real robotic platform and show the effect of different hyperparameters in various ablation studies.

2 Related Work

Learning 6-DOF Grasps

The prevailing approaches to solve the robot grasping problem are data-driven [2]

. While earlier methods were based on hand-crafted feature vectors 

[26, 1, 7], recent methods exploit convolutional architectures to operate on raw visual measurements [13, 24, 20, 18, 14]. Most of these grasp synthesis approaches are enabled by representing the grasp as an oriented rectangle in the image [8]. This 3-DOF representation constrains the gripper pose to be parallel to the image plane. The drawbacks of such a representation are manifold: Since it limits the grasp diversity, picking up an object might be impossible given additional constraints imposed by the arm or task. In case of a static image sensor it also leads to a severely restricted workspace [18].

Our approach tackles the problem of predicting the full 6-DOF pregrasp pose. This is challenging due to occluded object parts that affect grasp success. Yan et al. [34] circumvent this problem by including the auxiliary task of reconstructing the geometry of the target object. The main task of predicting the 6-DOF grasp outcome can then use local geometry that is not part of the measurement. Similar to our evaluator network, Zhou et al. [36] learn a grasp score function which they also for grasp refinement. In contrast to our approach, both methods [34, 36] are only evaluated in simulation.

Few methods formulate the problem as a regression to a single best grasp pose [27, 16]. They inherently lack the ability to predict a diverse distribution of possible grasps. Choi et al. [4]classify 24 pre-defined orientations to chose a 6-DOF pre-grasp pose. Such a coarse resolution of will necessarily lead to a limited diversity of the predicted grasps. In contrast, the grasp point detection method (GPD) [30, 15]

uses a more dense sampling of candidate grasps: A point in the observed point cloud is sampled randomly and a Darboux frame is constructed which is aligned with the estimated surface normal and the local direction of the principal curvature. Although this heuristic creates a quite diverse set of candidate grasps, it fails generating grasps along thin structures such as rims of mugs, plates, or bowls since estimating those surface normals from noisy measurements is challenging. Our learned grasp sampler does not suffer from such bias. As a result our proposed method finds grasps where GPD is not able to (see Sec. 


Apart from using supervised learning, grasping has also been formulated as a reinforcement learning problem 

[9, 35] or approximations of it [14]. The learned grasp policies are more expressive than describing only the final grasp pose. Still, the action space of these methods is usually , limiting the diversity to top-down grasps.

Deep Neural Networks for Learning from 3D Data

The success of deep learning on 3D point cloud data started much later than its huge success on RGB images. In the early days, 3D data were represented as 3D voxels [19] or as extracting features from 2.5 depth images [6]

and process them similar to RGB image using convolutional neural networks which oftentimes lead to marginal improvements. Qi et al. 

[22, 23] introduced a new architecture, called PointNet and PointNet++, that is capable of representing the 3D data and extract the representation efficiently. The success of PointNet lead to the introduction of different variations of network architectures [32, 29] that represent 3D data, showing significant improvement on 3D object pose estimation, semantic segmentation, and part segmentation [29, 23, 21, 33]. In order to estimate a successful gasp, the 6-DOF pose of the grasps needs to be accurate. Operating on a single RGB image does not provide the required accuracy since the input and output are not in the same domain. Therefore, we use 3D point cloud data to generate grasps. We use PointNet++ [23] for learning the representation to generate and evaluate grasps in .

Variational Autoencoders

Variational autoencoders [10] (VAE) are one of the main categories of deep generative models. VAEs can be trained in an unsupervised manner to maximize the likelihood of the training data. They have been applied to a variety of tasks such as future prediction [12, 31], generating novel view points [11] and object segmentation [28]. In this work, we use a VAE to sample a diverse set of grasps in .

The overall architecture of our model is similar to GANs [5]. The generator module is a VAE that is based on different samples from a latent space and the observed point cloud . It generates different grasp proposals and the evaluation network (discriminator) accepts or rejects them based on how likely it is that they are successful. Both generator and discriminator are taking the 3D point cloud of the object as part of the input.

(a) Grasp coordinate frame (b) Overview of our method
Figure 3: (a) Grasps are estimated with respect to the center of mass of the object point cloud, . The axes of the grasp coordinate frame are parallel to those of the camera. (b) The object point cloud  is extracted from a depth image using plane fitting. The Grasp Sampler Network takes the point cloud and proposes different grasps. The evaluator network assesses the grasps based on the object point cloud and the proposed grasp. Grasps are improved iteratively using the gradient of the evaluator network

3 6-DOF Grasp Pose Generation

We formulate grasp pose generation as the process of producing sets of robot gripper poses such that closing the gripper at any of these poses results in a stable grasp of an object. Furthermore, the process should generate diverse sets of poses that ultimately cover all possible ways an object could be grasped. Robot gripper poses are given in , specifying the 3D translation and 3D orientation of the gripper. Here, we focus on generating grasp poses for single objects, additional constraints due to a manipulator’s reach and due to other objects in a scene are beyond the scope of this work and can be handled by trajectory optimization techniques. Grasp pose generation is challenging due to the narrow subspace of successful grasps in the space of all possible grasps. Small perturbations in the pose of a grasp can transform a successful grasp into a failure. To generate diverse sets of stable grasps, our approach samples grasp poses using a variational auto-encoder network followed by an iterative evaluation and refinement process. The input to our approach is a point cloud of the object the robot should pickup.

Specifically, we aim to learn the posterior distribution , where represents the space of all successful grasps and is the partial point cloud of the object observed by a camera. Each grasp  is represented by  where and are the rotation and translation of grasp . Grasps are defined in the object reference frame, whose origin is , the center of mass of the observed point cloud. Its axes are parallel to those of the camera frame (see Fig. 3-a). The distribution of successful grasps  can be complex and discontinuous. For example, the distribution of for a mug has multiple modes along the rim, handle, and bottom. Within each mode, the space of successful grasps is continuous but grasps of different modes can be separated from each other. The total number of separate modes for each object category varies based on the shape and scale of objects.

Since the number of modes of  is not known beforehand, we propose to learn a generator module that maximizes the likelihood of successful grasps . Since the generator only observes successful grasps during training, it is possible that it also generates failed grasps . In order to detect and refine these negative grasps, an evaluation module is trained to predict 

, i.e., the probability of success for a grasp 

and the observed point cloud . Applied to a sampled grasp, the evaluation module predicts grasp success and propagates the success gradient back through the network to generate an improved grasp pose. This process can be repeated. Discarding all grasps that remain below a threshold provides the final set of high quality grasps. The overview of our method is shown in Fig. 3-b.

3.1 Variational Grasp Sampler

The grasp sampler, shown in Fig. 4, is a generative model that maximizes , the likelihood of a set of pre-defined successful grasps . Given a point cloud  and a latent variable , the sampler is a deterministic function that predicts a grasp. It is assumed that 

, the probability density function of the latent space, is known and chosen beforehand. In our approach, we use

. Given a point cloud , different grasps are generated by sampling different  from . The likelihood of the generated grasps can be written as follows:


Optimizing Eq. (1) for each positive grasp requires to integrate over all the values of the latent space, which is intractable. In order to make Eq. (1) tractable, the encoder maps each pair of point cloud and grasp to a small subspace in the latent space . Given the sampled , the decoder reconstructs the grasp . During training, the encoder and decoder are optimized to minimize the reconstruction loss between ground truth grasps and the reconstructed grasps . Furthermore, the KL-divergence between the distribution

and the normal distribution

is minimized to ensure a normal distributed latent space with unit variance. The loss function is defined as follows:


Eq. (2

) is optimized using stochastic gradient descent. For each mini-batch, the point cloud 

is sampled for an object observed from a random viewpoint. For the sampled point cloud , grasps  are sampled from the set of ground truth grasps  using stratified sampling. To combine the orientation and translation loss, we define the reconstruction loss as follows:


where is the transformation of a set of predefined points on the robot gripper. During training, the decoder learns to decode the latent value  that is sampled from and generates grasps while the encoder learns to output  such that it contains enough information to reconstruct the grasp pose while maintaining the normal distribution. During inference, the encoder  is removed and latent values are sampled from .

Figure 4: During training, the encoder maps each grasp to a point in a latent space. The distribution of the latent space is minimized toward a normal distribution. The decoder takes the point cloud and latent values and reconstructs the 6D grasps, visualized here as gripper poses.

Both encoder and decoder are based on the PointNet++ [23] architecture. In this architecture, each point has a 3D coordinate and a feature vector. The features at each layer are computed based on the features of each point and the 3D relation of the points with respect to each other. The features of each input point  are concatenated to . In the decoder, each point feature is concatenated with the latent variable . The encoder learns to compress the relative information of point cloud and latent variable grasp in such a way that it can be reconstructed by the decoder. Fig. 5 qualitatively shows that the latent space has strong correlation with grasp pose.

Figure 5: Relation between grasp pose and corresponding latent value: Each grasp is colored by the corresponding latent value. Red and blue channels are set to the normalized latent value and green channel is set to 0. The smoothness of color transition between each grasp shows the strong correlation between the latent space and the grasp pose.

3.2 Grasp Pose Evaluation

The grasp sampler trains the continuous posterior distribution  using only positive grasps. As a result, it might contain failed grasps that are in between the modes of the distribution. These transitional grasps and other false positives need to be identified and pruned out. To do so, we need a grasp evaluation network that assigns a probability of success to each grasp. This network needs to reason about grasps relative to the observed point cloud , but it must also be able to extrapolate to unobserved parts of the object. Other methods learn to classify grasps based only on local observed parts of an object [30, 18]. In practice, the observed point cloud of an object has imperfections such as missing or noisy depth values. To mitigate this problem, previous methods resort to use high quality depth sensors [18] or using multiple views [30] which limits the deployment of the system outside of controlled environments. In this work, we classify each grasp using only the imperfect observed point cloud of the object.

Success of a grasp pose depends on the relative pose of the grasp with respect to the object. The inputs to the evaluator network are point cloud and grasp . Similar to the Grasp Sampler, we use the PointNet [22] architecture for the Grasp Evaluator. There are multiple ways for classifying grasps. The first, simple approach is to associate the 6D pose of the grasp  to the features of each point in the first layer. Our experiments showed that such a representation leads to poor accuracy in grasp classification. Instead, we propose to represent a grasp  in a way more closely tied to the object point cloud: We approximate the robot gripper by a point cloud, , rendered according to the 6D grasp pose . The object point cloud  and gripper point cloud  are combined into a single point cloud by using an extra binary feature that indicates whether a point belongs to the object or to the gripper. In the PointNet architecture, the features for each point are functions of features of the point itself and its neighbors plus the relative spatial relation of the points. Using the unified point cloud  makes it natural to use all the relative information between grasp pose  and object point cloud  for classifying the grasps. The grasp evaluator is optimized using the cross-entropy loss by optimizing


where is the ground truth binary label of the grasps indicating whether the grasp is successful or not and is the predicted probability of success by the evaluator.

In order to train a robust evaluator, the model needs to be trained with both positive and negative grasps. Since the space of all possible 6D grasp poses is combinatorially large, it is not possible to sample all the negative grasps. Instead, we do hard negative mining to sample negative grasps. The set of hard negative grasps is defined as the grasps that have similar pose to a positive grasp but that are either in collision with the object or are too far from the object to grasp the object. More formally is defined as:


where is defined in Eq. (3). During training, is sampled from a set of pre-generated negative grasps and by randomly perturbing positive grasps to make the mesh of the gripper either collide with the object mesh or to move the gripper mesh far from the object.

Figure 6: Iterative Grasp Refinement: (left) Image of the object. (right) Grasps colored according to refinement iteration. Dark blue are grasps initially generated from the VAE and yellow are final, refined grasps. Note that even though there are no points between the gripper fingers for the initial bowl grasp (blue), the evaluation network is able to push the gripper to a successful grasp pose.

3.3 Iterative Grasp Pose Refinement

Although the evaluation network rejects implausible grasps, a large portion of the rejected grasps can be close to successful ones. This insight can be exploited by searching for a transformation  that turns an unsuccessful grasp into a successful one. More formally, we are looking for a refining transformation  that increases the probability of success, i.e., . The evaluation network represents a differentiable function of success  based on the point cloud  and grasp . The refinement transformation that leads to maximum improvement in success probability can be computed by taking the derivative of success with respect to the grasp transformation: . The partial derivative  provides the transformation for each point in the gripper point cloud  so as to increase the probability of success. Since the derivative is computed with respect to each point on the gripper independently, it can lead to non-rigid transformations for . To enforce the rigidity constraint, the transformed gripper point cloud  is defined as a function of orientation of the grasp defined in Euler angles and translation

. Using the chain rule,

is computed as follows:


Since the partial derivative is only a valid approximation in the local neighborhood, we use hyper-parameter to limit the magnitude of updates at each step. In practice, we choose in such a way that the maximum translation update of the grasp does not exceed . Fig. 6 shows the refinement of estimated grasps at different iterations.

4 Experiments

Training Data for Grasping

To generate reference sets of successful grasps, we use the physics simulation FleX [17], which provides realistic simulation of grasps for arbitrary object shapes. Candidate grasps are sampled based on the object geometry. We sample random points on the object mesh surface and align the gripper’s z-axis (see Fig. 3

-a) with the surface normal. The distance between the gripper and the object surface is sampled uniformly between zero and the gripper’s finger length. The orientation around the z-axis is also drawn from a uniform distribution. We only simulate grasps that are not in collision and whose closing volume between the fingers intersects the object. In total we use 206 objects from six categories in ShapeNet 

[3]: boxes and cylinders (randomly generated), as well as bowls, bottles and mugs. A total of 10,816,720 candidate grasps are sampled of which we simulate 7,074,038 (65.4%), i.e., those that pass the non-empty closing volume test. The simulation consists of a free-floating parallel-jaw gripper and the free-floating object without gravity (similar to [36]). Surface friction and object density are kept constant. After closing its fingers the gripper executes a predefined shaking motion. A grasp is labeled successful if the object is kept between both fingers. Overall, we generate 2,104,894 successful grasps (19.4%). The resulting positive grasp labels are densely distributed as shown in the examples in Fig. 7.

Figure 7: We use training data generated with a physics simulator. The colored dots around the objects depict successful grasps for a bowl (left) and a box (right). For each continuous grasp subspace an exemplary gripper pose is shown.


Both grasp generator and evaluator networks are using PointNet++[23] and have similar architectures. Both modules consist of three set-abstraction layers followed by fully connected layers. Each batch of the training data for the generator network consists of a rendering of the object from a random view and 64 grasps that are sampled using stratified sampling to make sure the sampled grasps have enough diversity. The weight for KL-Divergence loss ( in Eq. (2)) is set to . Each batch of training data for the evaluator network consists of  positive,  negative grasps, and  hard negative grasps. Hard negative grasps are selected from perturbed positive grasps by applying radians in each axis and to translation. Both models are trained with the Adam optimizer using a learning rate of . All the grasps are generated in simulation and no real data was used to train any of the models (see Sec. 4).

Network Architecture Details

Both grasp generator and evaluator are based on PointNet++ architecture. Both models consists of three set abstraction layers. Each set abstraction layer samples 128, 32, and all the points. Each set abstraction layer samples the points that are within , , and radius of the sampled points. Each set abstraction layer uses 3 fully connected layers to compute the features. The number of channels for each set abstraction layer are , and respectively. The set abstraction layers are followed by two fully connected layers with units. The grasp generator network outputs a rotation  that is represented as unit quaternion and a translation . Quaternions are generated by applying L2-Norm on a linear fully connected layer. No normalization is done for the translation 

. The evaluator network predicts the score for each grasp using a softmax layer.

Evaluation Metrics

We used two metrics to quantitatively evaluate grasping methods: success rate and coverage rate. Success rate is the ratio of successful grasps among all predicted grasps. This metric only considers the grasp that is executed and does not contain any information about the other grasps. Predicting only one grasp is not suitable for 3D grasping, because the predicted grasp may lead to collision of the robot with other objects in the environment or there may not be any possible valid robot joint configuration that can reach the predicted grasp. In order to achieve an executable successful grasp, we need to generate a diverse set of grasps from different translations and directions to check for kinematic feasibility and collision avoidance. As a result, we introduce the coverage rate which captures the diversity of the grasps and measures how well the space of positive grasps is covered by the generated grasps. A positive grasp is covered by the set of predicted grasps, , if there exists a grasp that is at most away from the grasp . Positive grasps that have similar translation in object frame, have similar orientation. As a result, we chose to use the distance in translation of the grasps as the criteria for evaluating whether a grasp is covered or not. Since grasps are defined in , is uncountably infinite. As a result, is approximated by sampling grasps while generating data. Success rate and coverage rate

are analogous to precision and recall in the context of binary classification. Similar to precision-recall curves, we use the curves of

success rate and coverage rate for analyzing and evaluating our method. We use the AUC of success-coverage rate for ablation studies and analysis.

4.1 Analysis and Ablation Studies

We evaluate the effect of different parameters and modules quantitatively using the same physics simulation as in the generation of training data (Sec 4). For the ablation studies, we generate 86 object point cloud observations for 10 different objects that are held out during training. For each point cloud, 200 latent values are sampled and refined over 10 iterations, resulting in 2200 grasps per view point and 182,600 grasps in total.

Dimensionality of the Latent Space

There is an inherent tension when deciding the dimensionality of the latent space which affects the quality of the generated grasps. The latent space needs to have enough capacity to allow the VAE to reconstruct the grasps. At the same time, a high-dimensional latent space leads to over-fitting and requires significantly more training data to be covered. It also deteriorates the quality of sampled grasps during inference, especially when the sampled latent values during inference are not seen by the generator network during training. To analyze this effect, we evaluate the generator network with increasing numbers of dimensions for the latent space. Fig. 8 shows the resulting success-coverage curves. As can be seen, a dimensionality of one has the least AUC because the latent space does not have enough capacity. Although 3-dimensional and 4-dimensional latent spaces lead to a slightly better on the training data they perform worse during inference because the VAE cannot cover the latent space densely during training. Given these results, we choose a two-dimensional latent space for all subsequent evaluations.

Figure 8: Effect of latent space dimensionality on success rate and coverage of the grasps. Number in the box provide AUC values.

Effect of Refinement on Grasp Quality

While the grasp refinement increases the probability of success based on the evaluator network, it does not necessarily mean that the refined grasps succeed during test time. To analyze the actual improvement induced by each refinement step, we evaluate the grasps in simulation. Fig 9 shows the success-coverage curve of the grasps that are computed at each refinement iteration. As is shown, not only does the success rate of the generated grasps increase, the coverage rate increases as well. This is because when grasps are improved they get closer to the sampled positive grasps in . The AUC of the curves plateaus after the 10th iteration of refinement.

Figure 9: Effect of number of refinement steps on improving the accuracy and coverage of generated grasps. Each curve is calculated over 16,600 grasps.

Effect of Sampled Grasps on Coverage

In previous sections, we conducted ablation studies using 200 random latents because that was the maximum batch size that fits in GPU memory and it is the same setting that we used for our robot experiments. Since the number of sampled latents were limited, the coverage rate in Fig 9 was less than 0.5 even after 10 refinement steps. In order to investigate how the number of sampled grasps effect the coverage, 2000 grasps are sampled in 10 different batches on the same point clouds that were used in previous ablation studies. Fig 10 shows how more samples increase the coverage rate.

Figure 10: Effect of number of sampled grasps on coverage rate.

4.2 Robot Experiments

Box Cylinder Bowl Mug Average Success Rate Success Rate
6-DOF GraspNet 83% 89% 100% 86% 90% 88%
GPD [30] 50% 78% 78% 6% 52% 47%
Table 1: Grasping results in real world experiments.

The ultimate test of the generated grasps is to execute them in the real world and deal with imperfect perception, robot joint limits, control errors, and physical phenomena such as friction that are difficult to model. We want to show that: (1) our method scales to the real world despite being trained purely in simulation; (2) the generated grasp distribution is diverse enough to find successful grasps even after discarding those that violate robot kinematics and collision constraints; (3) our method’s diverse grasp sampling leads to higher success rate in comparison to a state-of-the-art 6-DOF grasp planner [30] (GPD).

Figure 11: Each object is evaluated on three different poses. The 3D models of these objects are unknown. The training data consists of mugs, bowls, boxes, cylinder, and bottles with random scales. See the supplements for videos of the grasping trials.

All experiments are done using a 7-DOF Franka Panda manipulator with an Intel RealSense D415 camera mounted on its parallel-jaw gripper. We choose a set of commonly-used objects that are challenging visually and physically. The weights of the objects are between (pepper shaker) and (mustard bottle). The hardware setup and object test set is shown in Fig. 11.


Each object is placed in three different stable poses on a table in front of the robot. The robot’s end-effector is moved such that the hand-mounted depth camera has an unobstructed view of the table top. A grasp is considered successful if the robot can lift the object without dropping it.

We filter the measured point cloud, remove the table plane and cluster the remaining points [25]. This extracted object point cloud serves as an input to our approach and also for GPD. Both methods return a list of scored grasps. We use a motion planner to check for a collision-free path to each grasp pose and execute the one with the highest score. If no grasp in the returned set can be executed we consider the trial a failure. In total we run 51 trials per method.


Table 1 shows that our method outperforms GPD [30] on success rate across all objects. One of the reasons is that our method generates diverse grasps which facilitates finding kinematically feasible ones. In contrast, GPD does not generate many different grasps which sometimes leads to situations in which no kinematically feasible grasp can be found. Mugs are particularly difficult for GPD because it does not generate any grasps from the rim (see Fig 12). Another source of problem comes from the grasps where the finger gripper is tangent to one of the surfaces of the object. In these cases, slightest error in executing the grasp can change the grasp from grabbing the object to pushing it. The detailed outcomes of the experiments are shown in Table 2.

Figure 12: Visualization of generated grasps by our method vs GPD [30] using 200 samples. (left) Generated grasps using 6-DOF GraspNet on a mug. (right) Generated grasps by GPD. Note that our method generates significantly more samples along the mug rim (and handle in other views). The object would slide out of the gripper for the side grasps.
Categories Objects 6-DOF Grasp Net GPD [30]
Trials Success Rate Trials Success Rate
#1 #2 #3 #1 #2 #3
Box Jello Chocolate 3/3 3/3
Spam Meat Can 2/3 1/3
Jello Strawberry 3/3 2/3
Sponge 3/3 1/3
Cheezit Box 1/3 1/3
Sugar Box 3/3 1/3
Overall 15/18 9/18
Cylinder Pepper Shaker 2/3 3/3
Piroluine 3/3 3/3
Mustard 3/3 1/3
Overall 8/9 7/9
Bowl White Bowl 3/3 1/3
Green Bowl 3/3 3/3
Red Bowl 3/3 3/3
Overall 9/9 7/9
Mug White Mug 3/3 0/3
Blue Mug 2/3 0/3
Orange Mug 3/3 1/3
Big Red Mug 2/3 0/3
Red YCB Mug 3/3 0/3
Overall 13/15 1/15
Table 2: Detailed outcome of the robot experiments. : the trial was successful, : generated grasp was not successful, : none of the generated grasps are kinematically feasible.

5 Conclusions

In this work, we introduced 6-DOF GraspNet for generating diverse sets of grasps for unknown objects. Our method consists of a trained VAE that samples a variety of grasps for an object. While the VAE is able to capture the complex distributions of successful grasp poses, it does not quite provide the accuracy required for highly robust grasp generation. To overcome this limitation, we additionally introduce a grasp evaluator network that assesses grasp quality and can refine grasps in an iterative process. To the best of our knowledge, neither a learned grasp sampler nor a gradient-based refinement process have been introduced before.

The training of our model is done using synthetic grasp data generated by a physics simulator. Therefore, our model can scale to large sets of objects without requiring the collection of data in the real world. We demonstrated that our method can transfer to the real world on objects with unknown 3D models by deploying the method on a real robot platform and an on-board RGB-D camera. We performed robot experiments on 17 objects with unknown 3D models and achieved state-of-the-art results in 3D grasping. We also performed a thorough analysis of the generated grasps in terms of success rate and coverage via ablation studies in a realistic physics simulator.

This approach opens up a number of interesting directions in computer vision and robotics. In our method, all the latent values are sampled uniformly and then grasps are removed based on collision checks and kinematically feasible solutions. Potential extensions are to train the sampler or the evaluator in a way that not only considers the object of interest but also considers the surrounding objects to directly avoid generating colliding or infeasible grasps. Other interesting directions are toward using the evaluator not only to refine sampled grasps but to provide real-time feedback guidance for a manipulator approaching an object. Our experiments provide evidence that our gradient-based approach could succeed in moving a manipulator closer and closer to a successful grasp.


  • [1] J. Bohg and D. Kragic. Learning grasping points with shape context. Robotics and Autonomous Systems, 58(4):362–377, 2010.
  • [2] J. Bohg, A. Morales, T. Asfour, and D. Kragic. Data-driven grasp synthesis—a survey. IEEE Transactions on Robotics, 30(2):289–309, 2014.
  • [3] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  • [4] C. Choi, W. Schwarting, J. DelPreto, and D. Rus. Learning object grasping for soft robot hands. IEEE Robotics and Automation Letters, 3(3):2370–2377, 2018.
  • [5] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial networks. Neural Information Processing Systems (NeurIPS), 2014.
  • [6] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning rich features from RGB-D images for object detection and segmentation. In European Conference on Computer Vision (ECCV), 2014.
  • [7] A. Herzog, P. Pastor, M. Kalakrishnan, L. Righetti, T. Asfour, and S. Schaal. Template-based learning of grasp selection. In 2012 IEEE International Conference on Robotics and Automation, pages 2379–2384. IEEE, 2012.
  • [8] Y. Jiang, S. Moseson, and A. Saxena. Efficient grasping from rgbd images: Learning using a new rectangle representation. In ICRA. IEEE, 2011.
  • [9] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine. Scalable deep reinforcement learning for vision-based robotic manipulation. In A. Billard, A. Dragan, J. Peters, and J. Morimoto, editors, Proceedings of The 2nd Conference on Robot Learning, volume 87 of

    Proceedings of Machine Learning Research

    , pages 651–673. PMLR, 29–31 Oct 2018.
  • [10] D. P. Kingma and M. Welling. Auto-encoding variational bayes. International Conference on Learning Representations (ICLR), 2014.
  • [11] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28. 2015.
  • [12] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. S. Torr, and M. Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2017.
  • [13] I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps. The International Journal of Robotics Research, 34(4-5):705–724, 2015.
  • [14] S. Levine, P. P. Sampedro, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. 2017.
  • [15] H. Liang, X. Ma, S. Li, M. Görner, S. Tang, B. Fang, F. Sun, and J. Zhang. Pointnetgpd: Detecting grasp configurations from point sets. ICRA, 2019.
  • [16] M. Liu, Z. Pan, K. Xu, K. Ganguly, and D. Manocha. Generating Grasp Poses for a High-DOF Gripper Using Neural Networks. arXiv e-prints, page arXiv:1903.00425, Mar 2019.
  • [17] M. Macklin, M. Müller, N. Chentanez, and T.-Y. Kim. Unified particle physics for real-time applications. ACM Transactions on Graphics (TOG), 33(4):153, 2014.
  • [18] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. 2017.
  • [19] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In IEEE/RSJ International Conference on Intelligent Robots and Systems, page 922 – 928, September 2015.
  • [20] L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 3406–3413, May 2016.
  • [21] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum pointnets for 3d object detection from rgb-d data. Computer Vision and Pattern Recognition (CVPR), 2018.
  • [22] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Computer Vision and Pattern Recognition (CVPR), 2016.
  • [23] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Neural Information Processing Systems (NeurIPS), 2017.
  • [24] J. Redmon and A. Angelova. Real-time grasp detection using convolutional neural networks. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1316–1322. IEEE, 2015.
  • [25] R. B. Rusu and S. Cousins. 3d is here: Point cloud library (pcl). 2011.
  • [26] A. Saxena, L. L. Wong, and A. Y. Ng. Learning grasp strategies with partial shape information. In AAAI, volume 3, pages 1491–1494, 2008.
  • [27] P. Schmidt, N. Vahrenkamp, M. Wächter, and T. Asfour. Grasping of unknown objects using deep convolutional neural networks based on depth images. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6831–6838. IEEE, 2018.
  • [28] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems 28. 2015.
  • [29] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M.-H. Yang, and J. Kautz. SPLATNet: Sparse lattice networks for point cloud processing. In IEEE Conference o Computer Vision and Pattern Recognition, pages 2530–2539, 2018.
  • [30] A. ten Pas, M. Gualtieri, K. Saenko, and R. Platt. Grasp pose detection in point clouds. The International Journal of Robotics Research, 36(13-14):1455–1473, 2017.
  • [31] J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using variational autoencoders. In European Conference on Computer Vision, 2016.
  • [32] Y. Wang, Y. Sun, Z. Liu, S. Sarma, M. Bronstein, and J. Solomon. Dynamic graph cnn for learning on point clouds. Conference Computer Vision and Pattern Recognition (CVPR), 2018.
  • [33] D. Xu, J. A, and A. D. Pointfusion: Deep sensor fusion for 3d bounding box estimation. IEEE Conference o Computer Vision and Pattern Recognition (CVPR), 2018.
  • [34] X. Yan, J. Hsu, M. Khansari, Y. Bai, A. Pathak, A. Gupta, J. Davidson, and H. Lee. Learning 6-dof grasping interaction via deep geometry-aware 3d representations. arXiv preprint arXiv:1708.07303, 2017.
  • [35] A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser. Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4238–4245. IEEE, 2018.
  • [36] Y. Zhou and K. Hauser. 6dof grasp planning by optimizing a deep learning scoring function. In Robotics: Science and Systems (RSS) Workshop on Revisiting Contact-Turning a Problem into a Solution, 2017.