UniGrasp: Learning a Unified Model to Grasp with N-Fingered Robotic Hands

10/24/2019 ∙ by Lin Shao, et al. ∙ 25

To achieve a successful grasp, gripper attributes including geometry and kinematics play a role equally important to the target object geometry. The majority of previous work has focused on developing grasp methods that generalize over novel object geometry but are specific to a certain robot hand. We propose UniGrasp, an efficient data-driven grasp synthesis method that considers both the object geometry and gripper attributes as inputs. UniGrasp is based on a novel deep neural network architecture that selects sets of contact points from the input point cloud of the object. The proposed model is trained on a large dataset to produce contact points that are in force closure and reachable by the robot hand. By using contact points as output, we can transfer between a diverse set of N-fingered robotic hands. Our model produces over 90 percent valid contact points in Top10 predictions in simulation and more than 90 percent successful grasps in the real world experiments for various known two-fingered and three-fingered grippers. Our model also achieves 93 percent and 83 percent successful grasps in the real world experiments for a novel two-fingered and five-fingered anthropomorphic robotic hand, respectively.



There are no comments yet.


page 1

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The ability of a robot to grasp a variety of objects is of tremendous importance for many application domains such as manufacturing or warehouse logistics. It remains a challenging problem to find suitable grasps for arbitrary objects and grippers; especially when objects are only partially observed through noisy sensors.

The most successful approaches in recent literature are data-driven methods where large-capacity models are trained on labeled training data to predict suitable grasp poses, e.g. [2, 31, 16, 12, 26, 20, 21, 17, 11, 38, 33, 40, 19]. The main objective of these types of approaches is to generalize over novel objects that were not part of the training data. All of these models are trained for one particular end-effector, which is typically a two-finger gripper. There are only few exceptions that consider more dexterous hands e. g. [12, 38, 33, 40, 18, 19]. Of those, [38, 40, 19] go beyond a small set of pre-grasp shapes. [21] is the only approach we are aware of that considers two different grippers in one model but still trains separate grasp synthesis models for each one.

Fig. 1: UniGrasp takes as input a kinematic description of a robotic hand and a point cloud of an object. Given this input, UniGrasp is trained on a large dataset to produce contact points on the object surface that are in force closure and reachable by the robotic hand. UniGrasp produces valid contact point sets not only on novel objects but also generalizes to new n-fingered hands that it has not been trained on.

In this paper, we go beyond generalization to novel objects. We propose UniGrasp, a model that additionally generalizes to novel gripper kinematics and geometries with more than two fingers (see Fig. 1). Such a model enables a robot to cope with multiple, interchangeable or updated grippers without requiring retraining. UniGrasp takes as input a point cloud of the object and a robot hand specification in the Unified Robot Description Format (URDF) [37]. It outputs a set of contact points of the same number as there are fingers on the gripper. The model is trained to select contact points that are in force closure and reachable by the robot hand. This contact-based grasp representation allows to infer precision grasps that exploit the dexterity of the multi-fingered hand. It also alleviates the need to define a finite set of grasp pre-shapes and approach directions, e.g. from the top.

This work makes the following contributions: (1) Given a point cloud from an object and a URDF model of a gripper, we map gripper and object features into separate lower-dimensional latent spaces. The resulting features are concatenated and form the input to the part of the model that generates sets of contact points. (2) We propose a novel, multi-stage model that selects sets of contact points from an object point cloud. These contact points are in force closure and reachable by the gripper to produce a 6D grasp. (3) In simulation and the real world, we show that both, our mapping scheme and multi-stage model adapts well to a varying number of fingers. The selected contact point sets are typically clustered in a local neighborhood and are therefore well suited for execution on a real robot for which it may be hard to precisely align its fingers with an isolated point. (4) Finally, we publish a new large-scale dataset comprised of tuples of annotated objects with contact points for various grippers, covering many popular commercial grippers.

Ii Related Work

In the following, we review related work that shares some of the key assumptions with our work, e.g. the object to be grasped is novel and only observed through a noisy depth camera. For a broader review of the field on data-driven grasp synthesis, we refer to [2]. Furthermore, we briefly discuss learning-based methods for processing point clouds.

Ii-a Data-Driven Grasp Synthesis for Novel Objects

Learning to grasp novel objects from visual data has been an active field of research since the seminal paper by  Saxena et al. [32]. Since then, many approaches have been developed that use more sophisticated function approximators (CNNs) to map from input data to a good grasp [16, 12, 26, 20, 21, 17, 11, 38, 33, 23, 18, 15]. These approaches differ in the input data format (e.g. 2D images, depth maps or point clouds), the grasp representation (e.g. grasping points, oriented 2D rectangles, a 2D position and wrist orientation or a full 6D pose) and whether they learn to grasp from real world examples or synthetic data.

In our work, we take as input a point cloud of a segmented object and generate a set of contact points, one per finger of the hand. We train our model on synthetically generated data. The key difference to previous work is that we train one model that generalizes over many robot hands, even if they were not part of the training data. Specifically, UniGrasp takes as input a kinematic and geometric description of a robot hand and generates contact points that are in force closure and reachable by this specific hand. All of the aforementioned prior work is trained for one particular hand. Typically, this is a two-fingered gripper with exceptions including [12, 38, 18, 33, 40, 19]. [21] is the only approach we are aware of that considers two different grippers (a suction cup and a parallel jaw gripper). However, the authors still train separate grasp synthesis models for each gripper. Another key difference to previous work is that we adopt contact points to represent a grasp. Contact points are rarely used in recent work because it is difficult to precisely make contact with these points under noise in perception and control [2]. For this reason, the dominant grasp representation in related work has been the pose of the end-effector, e.g. [12, 17, 21, 33]. This representation allows to reduce the complexity of precise control to simply close the fingers after having reached a particular end-effector pose. However, this requires to define a finite set of pre-grasp shapes which limits the dexterity of a multi-fingered robot hand, e.g. [12, 33, 40, 18, 19]. Also the resulting grasps are typically power grasps which are great for pick-and-place. However, for assembly or similar tasks, precision grasps are more desirable. A model that outputs contact points per finger basically defines such a precision grasp. Through inverse kinematics, it yields an end-effector pose in and a finger configuration that exploits the full dexterity of the hand. Of the aforementioned related works, [38, 40, 19] use a multi-fingered robot hand and go beyond a small set of pre-grasp shapes. [38] suggest to use a CNN that produces heatmaps which indicate suitable locations for fingertip and palm contacts. These contacts are used as a seed for a grasp planner to determine the final grasp pose for the reconstructed object. In follow-up work [39], a CNN completes object shapes from partial point cloud data to provide the grasp planner with a more realistic object model. [40] propose to train a deep conditional variational auto-encoder to predict the contact locations and normals for multi-fingered grasps given an RGB-D image of an object. [19]

propose a method that directly optimizes grasp configuations of a multi-fingered hand by backpropagating through the network with respect to these parameters. All these methods are work for a specific hand while our method not only generalizes over novel objects but also over novel robotic grippers.

Ii-B Representation Learning for Point Clouds

Recently, various approaches for processing sparse point clouds have been proposed in related work. [28] propose PointNet to learn object features for classification and segmentation. PointNet++ [29] applies PointNet hierarchically for better capturing local structures. In our approach, we adopt PointNet [28] to extract features of robotic hands and PointNet++ to extract object features. [1]

train an auto-encoder network to reconstruct point clouds. The learned representations enable shape editing via linear interpolation inside the embedding space. We train a similar auto-encoder network to learn a representation of various robotic hands.

Iii Technical Approach

Given a point cloud of an object and a URDF of a gripper, we aim to select a set of contact points on the surface of the object such that these contact points satisfy the force closure condition and are reachable by the gripper without collisions as shown in Fig. 2.

If the gripper has fingers, our model UniGrasp selects a set of contact points from the object point cloud. Due to the diversity in possible hand designs and object shapes, this is a challenging problem that goes beyond what has previously been considered. Related work focuses on generalizing over novel objects, not over novel grippers. In this section, we first discuss how we learn a uniform representation of -fingered robotic hands. This representation can be constructed from a URDF file which describes the kinematics and geometry of the hand. Furthermore, we describe how contact points are generated on the object surface. We conclude this section by describing the training of the neural network model named Point Set Selection Network (PSSN). We report how we generate a large scale annotated data set with objects being grasped by a diverse set of -fingered robotic hands in Sec.IV.

Fig. 2: Overview of UniGrasp for a 3-fingered gripper. First, we individually map gripper and object features into a lower-dimensional latent space. Their representations are concatenated and fed into our multi-stage Point Set Selection Network (PSSN) that generates contact points. The contact points are in force closure and yield a collision-free grasp for the 3-fingered gripper.

Iii-a Gripper Representation

We aim to find a compact representation of the gripper geometry and kinematics as input to UniGrasp. In this section, we first describe how we encode the geometry of robotic hands. Then we describe how we project a novel robot hand into the learned latent space and construct the input feature (see Fig. 3).

Iii-A1 Unsupervised Learning of a Gripper Representation

We use an autoencoder to learn a low-dimensional latent space that encodes the geometry of robotic hands in a specific joint configuration. For training, we synthesize 20k point clouds of 2000 procedurally generated grippers each in different joint configurations. Each point cloud consists of 2048 points. We use PointNet 


as encoder without the transformation modules. This yields a k-dimensional feature vector that forms the basis for the latent space. The decoder transforms the latent vector using 3 fully connected layers. The first two layers use a ReLU activation function. The last layer produces a 2048

3 output, the reconstructed point cloud of the gripper. We use the Chamfer distance [7] to measure the discrepancy between the input and reconstructed point cloud.

Iii-A2 Representing a Novel Robotic Hand

The autoencoder described in the previous section allows us to encode a robot hand in a specific joint configuration. However, we also want to encode the gripper kinematics and range of joints. In the following, we describe how we parse the URDF [37] file and construct a gripper feature for the example of a 3-fingered hand with 3 DoF.

Let us assume, each finger of the hand has one revolute joint. We denote the joints as , and . Each joint has joint limits. Let and represent the minimum and maximum joint angle. There are joint configurations that outline the boundaries of the configuration space of the hand, e.g. . We refer to these as boundary configurations. Let . Then describes the joint configuration of the hand where each joint angle takes the mean value of joint limits. We refer to this as central configuration. We generate point clouds sampled on the surface of the gripper model under all boundary and the central configurations. For the above example of a 3DoF gripper, there are 9 point clouds in total. They represent the kinematic and geometry attribute of this specific robotic hand.

Fig. 3: Top: We train an autoencoder to learn a lower-dimensional representation of our grippers and their random configurations by minimizing the Chamfer distance between the reconstructed and the ground truth gripper point clouds. Bottom: The trained encoder is used to generate a feature representation of the geometry and kinematics of the input gripper.

We feed the 9 point clouds into the autoencoder to extract features (see Fig. 3). Note that robotic hands with a different number of DoF have a different number of boundary configurations. To get a fixed-size feature, we apply three pooling operations (max-, min-, and mean-pooling) among the batch dimension. The output features of those three operations are concatenated to get a final robotic hand feature.

Iii-B Contact Point Set Selection

Fig. 4: The above scheme describes the multi-stage process shown in Fig. 2 in more detail. Given a feature matrix as input (comprised of gripper and object features), PSSN successively expands into stages to generate the set of contact point coordinates of an -fingered gripper. The two top rows describe the general flow of information from left to right by repeatedly using the modules from the bottom row. Solid lines indicate inputs to modules while dotted lines signify a read (no edge label) or write (update-feat) access to previous stage elements. Stage One is primarily used to find a representation of the subset of features that are most promising contact point sets. This representation is used as a template for the following stages (here: Two and Three), which each maintain their own copy of it. These subsequent stages inflate their input before deflating it into feature coordinate vectors (in purple). The coordinates indicate a) which template-copy elements should form the input for the next stage and more importantly b) the indices of data points in the point cloud representing the final grasp locations. Each stage adds one coordinate.

We now have a feature describing an N-fingered robotic hand and a point cloud representing the object where denotes the number of points. We aim to select N points from this point cloud such that these N points form force-closure and are reachable by the robotic hand without collisions. The task of training a neural network for point set selection is new and challenging.

Fig. 2 shows the process in which PSSN selects N points in a multi-stage way. In Stage One, our model first selects one point from the original point cloud . Conditioned on the point , the network continues to select the second point in Stage Two. If the robotic hand is two-fingered, the point selection procedure ends. The two points are the two contact points to grasp. For a three fingered robotic hand, the third point will be selected during Stage Three. For N-fingered robotic hands, there will be N stages. In each Stage , the neural network will select the next point conditioned on the previously selected points.

For one object and an N-fingered gripper, we denote the list of all valid point sets as . An item in this list is a set containing N points. A set is said to be valid if it forms force closure and is reachable by the associated robotic hand. However, it is challenging to find these valid point sets among the vast number of possible contact point sets in the object point cloud. To add robustness to the point selection procedure, we adopt an approach akin to beam search [41] where during each stage the network selects multiple, most promising candidate points.

The point cloud of an object has dimension and forms the input to the point set selection network. We use PointNet++ [29] to extract features and predict a normal per point in this point cloud. We denote these predicted normal vectors . The extracted features are of dimension .

In practice, we find it is easier to train the PSSN

when rejecting a large number of invalid point sets. To this end, we leverage simple heuristics 

[3] to reject invalid point sets based on point positions and the predicted point normals. For two-fingered robotic hands, the angle between the two contact point normals needs to be larger than to be in force closure [24]. For three-fingered robotic hands, three contact points form a triangle. We set a constraint that each side of the triangle needs to be larger than 1 cm to prevent our model from selecting points which are too close to each other. In addition, we prefer contact points that form a more regular triangle and we set another two constraints. The maximum internal angle of the triangle needs to be less than . For each point, the angles between its normal vector and two incident edge directions are less than . It indicates that the point normal is pointing outwards of the triangle. Generally three contact points in force closure fulfill those constraints above.

The detailed architecture of PSSN is shown in Fig. 4. In the following, we describe each stage in detail.

Stage One

We first select points denoted as from the original point cloud as follows. Object features with a shape of

are fed into a 1d convolution and soft-max layer to calculate the probability of each point being valid. The model then selects the

points with the highest scores and gathers the corresponding features denoted as with a dimension of .

Stage Two

This stage aims to select points from that conditioned on the points in yield valid point sets. As a first step, Stage Two selects points from in the same way as Stage One. This set of points is referred to as . The network collects the corresponding point features in which is of dimension . Then, feature maps and are put into a reshape-1d-cnn layer visualized in Fig. 4. The network performs a pairwise copy and concatenation to produce a combined feature map of dimension . The feature map is then sent to 1d convolution and soft-max layer to get the final score matrix of dimension . We collect the top elements denoted as . For two fingered robotic hands, these two points are the contact points predicted by PSSN.

Stage Three

For the N-fingered robotic hand (N ), the network has to select a third contact point. For this purpose, the network accesses the features of each element in the set . Each index in the set refers to two previously selected points of Stage One and Stage Two. The network accesses the corresponding features and concatenates them. The new feature map has a dimension of and is fed into a 1d convolutional layer with an output of dimension denoted as . We select the third point index from the already reduced set . For this purpose, the network simultaneously gathers the feature map with dimension and feature map with dimension . These two feature maps are fed into a reshape-1d-cnn layer to get a pairwise concatenation with a dimension of denoted as . The feature map is input to a 1d convolution and soft-max to calculate the score matrix of dimension . Each element in this matrix corresponds to the element in and the point in set . Note that the element in the set refers to two points selected in the previous two stages.

Based on the score, the network determines the top contact points sets. These form the candidate grasps for the three-fingered robotic hand. In this paper, we only implemented a network for a maximum of three fingers. For N-fingered robotic hand, there would be N-3 more stages.

Iii-C Loss Function and Optimization

Before PSSN we train the neural network to predict point normals . Given the ground truth point normals , the loss of predicted point normals is defined to be

where represents the dot product.

[13] found that formulating grasp success prediction as a ranking problem yields higher accuracy than formulating it as a binary classification problem. We therefore formulate point selection at every stage as a learning-to-rank [25] problem. For an object point cloud and a gripper, every point set which is in force closure and reachable by this specific robot hand is annotated as positive. All other point sets are labeled negative. Therefore, each point set has a binary label . In Stage n (where n N), the network predicts a list of point sets and their corresponding probability of whether this set is valid or not. Based on their predicted probabilities, we have a ranked list of these point sets. We use a variant of the ListNet [5] loss to increase the of point sets with positive labels.

We use the Adam optimizer [14] for training the network, set the learning rate to , and split the data into 80/20 for training and test sets. We train the neural network stage by stage. First, we only train Stage One. Only the loss of Stage One is computed and the gradients are back-propagated to the weights of the layers in Stage Two. After training Stage Two, we fix the weights in Stage One and continue to train the layers in Stage Two. Only the loss of Stage Two is computed. The training procedure continues until Stage N.

Iv Grasp Dataset Generation

Fig. 5: Overview of our dataset generation process. Point clouds of different Bullet [6] object meshes were generated based on images rendered from 8 different viewpoints around the objects. All point sets passing a force-closure check were then added to the list of valid points. If these point sets are also reachable and collision-free by a specific gripper, then it is added to the final dataset used to train UniGrasp.

To train UniGrasp model, we require data that consists of object point clouds annotated with sets of contact points that are in force closure and reachable by specific grippers. We generate this data set in simulation as commonly done for other data-driven approaches [12, 22, 20, 21]. To construct this dataset, we select 1000 object models that are available in Bullet [6] and scale each object up to five different sizes to yield 3275 object instances. We use 11 different robotic hands. Eight of these hands are two-fingered robotic grippers and three of them are three-fingered 111We use commercially available grippers such as the Sawyer [30] and Panda [8] parallel-yaw gripper, Robotiq 3-Finger [10], Kinova KG-3 [9], and Barrett BH8-282 gripper [35]. The data generation process is visualized in Fig. 5.

We place the object on a horizontal plane and render the object from eight viewpoints to generate RGB-D images. We reconstruct a point cloud from those RGB-D images and down-sample them to 2048 points. We find 2048 points are dense enough to represent the geometry of the object given GPU memory constraints.

Given a robotic hand with fingers (N= 2,3,…), we generate all possible sets of N points that are in force closure and reachable by the given robotic hand. For an object represented by a point cloud of 2048 points and a three-fingered robotic hand, the total number of combinations is . We first utilize the heuristics described in Section III-B to reject large amounts of point sets and then use the FastGrasp [27] to efficiently evaluate the remaining point sets. We compute the grasp quality score assuming a friction coefficient of for two-fingered grippers and of for three-fingered grippers. We use a polygonal approximation of the friction cone with 16 faces. We label a point set to be in force closure, if .

If a set of points is in force closure, we use inverse kinematics for each -fingered hand to determine (a) whether the points are reachable and (b) whether the grasp is collision-free - both specific to the hand. All sets of contact points that are in force closure, reachable and collision-free are labeled as positive for a specific gripper and negative otherwise.

V Experiments

In this section, we extensively evaluate the proposed grasp synthesis method in terms of the accuracy of contact point selection for various known N-fingered robotic hands both in simulation and in the real world experiments. We demonstrate the validity of the learned gripper embedding space and show that our model can produce valid contact points for different novel robotic hands, both in simulation and the real world experiments.

V-a Robotic Hand Presentation Learning

For training the autoencoder to represent the gripper, we split the gripper point clouds described in Sec.III into training and test set with a ratio of nine to one. We use five consecutive layers of 1D convolution in PointNet [28]. The feature channel dimensions of each convolution layers are (64,64,128,128,256,256) starting from input layers. The final feature embedding space has a dimension of 256. The training and test Chamfer Distance losses [7] are and respectively.

To demonstrate that the learned representation is able to capture robotic hand attributes like prismatic and revolute joint movement as well as variety in geometry, we show the linear interpolation ability between two points in the latent space. We take two point clouds and feed them into the autoencoder. We extract the corresponding features from the encoder denoted as and . We generate new features and by interpolation between and . features are put into the decoder to reconstruct point clouds. The results are shown in Fig. 6 and indicate the meaningful interpolation not only in configuration space but also in shape space.

Fig. 6: Interpolation in Gripper Feature Space. The top row indicates prismatic joint movement, the middle row represents revolute joint movement. The bottom row shows the geometry changes between two types of 3-fingered robotic hands.
Fig. 7: Examples of contact point selection results on one object with two-fingered grippers and Kinova-kg3. The black point cloud represents the object in top view. Red, yellow and green spheres indicate the contact point predicted by our PPSN in Stage One, Two and Three respectively. Blue points represent the gripper. For two-fingered grippers, left gripper represents the full-closed status and right gripper represents the full-open status. Our method generates contact point sets that are adapted to robotic hand attributes.

V-B Grasp Point Set Selection Performance

Accuracy Stage One Train Stage One Test Stage Two Train Stage Two Test Stage Three Train Stage Three Test Neighbor
Top1 Top10 Top1 Top10 Top1 Top10 Top1 Top10 Top1 Top10 Top1 Top10
T Sawyer 99.8 100.0 99.6 100.0 93.6 97.6 92.8 96.8 - - - - 53
U Sawyer - - 97.4 100.0 - - 86.6 95.4 - - - - 43
T Franka 99.0 99.8 98.7 100.0 84.7 95.6 83.7 91.0 - - - - 43
T Kinova 96.6 99.7 96.8 99.8 91.6 96.3 91.6 96.3 76.3 89.5 77.1 90.7 26
T Robotiq 98.2 99.8 98.4 99.6 95.6 98.3 96.3 98.4 86.6 94.2 86.9 95.1 36
T Barrett 98.4 100.0 98.7 99.8 95.6 98.3 97.1 98.9 90.6 96.4 89.6 96.3 40
Fig. 8: Summary of our grasping evaluation in simulation for two-fingered and three-fingered grippers. We report prediction accuracies of PSSN at different stages for different grippers. We distinguish between two different evaluation strategies and denote them by T and U respectively. A gripper prefixed with T (e.g. T Sawyer) indicates the gripper was trained and evaluated on train-test-splits dataset while both splits contained samples of that gripper (among other grippers). In contrast, U-prefixed grippers denote that samples of the gripper were not used during training.

In simulation PPSN selects in Stage One, and in Stage Two and

in Stage Three. We run six experiments to demonstrate the strength of our proposed method in predicting valid contact points for up to three-fingered robot hands. We adopt two evaluation metrics to reflect prediction accuracy.

Top1 refers to the prediction accuracy of the highest ranked point to be within 5mm of a valid point set , i.e. of a point set that forms a force closure grasp, is reachable and collision-free by the input hand. Top10

refers to the percentage of test cases where at least one valid point set of the top 10 predicted ones is within 5mm of a valid grasp. We allow the 5mm relaxation for two reasons. a) There is always annotation noise e.g. in the point normal or force closure estimation. b) In the real world, fingertips often have a width of 10mm. With a 5mm tolerance the fingertip still covers the valid contact point. A similar relaxation is also used in previous work 

[15]. Neighbor reports the ratio of valid point sets over all possible point sets within a local neighborhood of our prediction. We set the radius of this local neighborhood around each contact point to be 5mm. A high number shows that there are more dense valid points around a predicted contact point. This is helpful during grasp acquisition where noise in the point clouds and in robot actuation may lead to alignment errors between the robot fingers and contact points.

To test our model’s generalization ability to novel objects, we train a neural network on the training dataset and report its performance on the test dataset. The results are given in Fig. 8. We parse the URDF of various grippers and generate their features as described in Sec. III. During each training stage, our model selects point sets and ranks them. For example, given a two-fingered robotic hand, the network has two prediction stages.

For two-fingered grippers, our model achieve 92.8% and 83.7% Top1 accuracies for Saywer and Franka, respectively. We also test the grasping performance for three-fingered robotic hands. We train our point set selection network on the objects with annotation of Kinova, Robotiq and Barrett hand, respectively. Our model achieves 77.1%, 86.9% and 89.6% Top1 test accuracy.

To test our model’s generalization ability to novel robotic hands, we first train the neural network model on the dataset annotated with all two-fingered grippers except the Sawyer. In the test stage, we first parse the URDF file of Sawyer and extract corresponding gripper features. Then we feed the Sawyer gripper feature into our neural network model and evaluate the grasp performance on the test dataset annotated for the Sawyer gripper. Our model achieves 86.6% accuracy on the Top1 prediction. There are on average 43% valid points that are Neighbors of the predicted contact point. This shows that our model clusters valid contact point sets in regions on the object surface. This is beneficial for real-world grasp executions under perception and actuation noise.

Fig. 7 illustrates that our model generates distinct contact point sets on the same example object for various robotic hands. This shows that our model generates contact points that are specific to the robotic hand attributes.

V-C Real World Experiments

We evaluate the performance of our model on 22 objects shown in the corresponding video. The object set includes eight adversarial objects [20] and three objects from the YCB object data set [4]. Our hand-eye camera setup uses an RGB-D camera to capture the depth images of a given object. The depth images are inpainted [36] using OpenCV to remove invalid values. Utilizing the camera matrix, we can reconstruct 3D point clouds. We feed the gripper and object point clouds into our UniGrasp model. UniGrasp outputs each finger’s contact point. We solve the inverse kinematics for the robotic gripper and command the robot to approach the object using a Cartesian space controller. We close the fingers and lift the object, both open-loop.

We apply our model in the real world using Kinova KG-3 and Robotiq-3F grippers. To test the generalization of our model over grippers, we remove one finger of Kinova KG-3 in the URDF file to create a novel two-fingered gripper. We also apply UniGrasp to the Schunk SVH five-fingered robotic hand [34]. Our UniGrasp model produces three contact points for the thumb, index and ring finger. After solving inverse kinematics for those three fingers, we control the five fingers by having the middle finger mimic the thumb and the pinkie mimic the ring finger. The results are shown in Fig. 9. Our method achieve 92% and 95% successful grasps for Kinova-KG3 and Robotiq-3F grippers. It also generates 93% for the novel two-fingered gripper. Although in our experiment the thumb of Schunk hand is broken, our model achieves 83% successful grasps.

Trials SuccessRate(%) RunTime(s)
T Kinova-KG3 64 92 0.13
T Robotiq-3F 65 95 0.20
U Kinova-2F 60 93 0.06
U Schunk SVH 60 83 0.21
Fig. 9: Summary of our grasping evaluation in the real world for various grippers. The prefix T and U are the same as in Fig. 8. RunTime represents computation time spent on PPSN.

Vi Conclusion

We present a novel data-driven grasp synthesis approach named UniGrasp that generates valid grasps for a wide range of grippers from two-fingered parallel-yaw grippers to articulated multi-fingered grippers. UniGrasp takes point clouds of the object and the URDF of robotic hand as inputs. The outputs are the contact points on the surface of objects. We show in quantitative experiments that our method predicts over 90% valid contact points for various known two- and three-fingered grippers in Top10 predictions and over 90% valid grasps in real world experiments. Our model also generates 93% and 83% successful grasps for novel two-fingered and five-fingered robotic hands in real-world experiments.


  • [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas (2018-10–15 Jul) Learning representations and generative models for 3D point clouds. In

    Proceedings of the 35th International Conference on Machine Learning

    , J. Dy and A. Krause (Eds.),
    Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 40–49. Cited by: §II-B.
  • [2] J. Bohg, A. Morales, T. Asfour, and D. Kragic (2014-04) Data-driven grasp synthesis—a survey. IEEE Transactions on Robotics 30 (2), pp. 289–309. External Links: Document, ISSN 1552-3098 Cited by: §I, §II-A, §II.
  • [3] C. Borst, M. Fischer, and G. Hirzinger (2003) Grasping the dice by dicing the grasp. In Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003)(Cat. No. 03CH37453), Vol. 4, pp. 3692–3697. Cited by: §III-B.
  • [4] B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar (2015-Sep.) Benchmarking in manipulation research: using the yale-cmu-berkeley object and model set. IEEE Robotics Automation Magazine 22 (3), pp. 36–52. External Links: Document, ISSN 1070-9932 Cited by: §V-C.
  • [5] Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li (2007) Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129–136. Cited by: §III-C.
  • [6] E. Coumans (2015) Bullet physics simulation. In ACM SIGGRAPH 2015 Courses, pp. 7. Cited by: Fig. 5, §IV.
  • [7] H. Fan, H. Su, and L. Guibas (2017) A point set generation network for 3d object reconstruction from a single image. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 2463–2471. Cited by: §III-A1, §V-A.
  • [8] F. E. GmbH (2019) Panda 2-finger gripper. External Links: Link Cited by: footnote 1.
  • [9] K. Inc. (2019) Kinova kg-2 and kg-3 gripper. External Links: Link Cited by: footnote 1.
  • [10] R. Inc. (2019) Robotiq 3-finger gripper. External Links: Link Cited by: footnote 1.
  • [11] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine (2018-29–31 Oct)

    Scalable deep reinforcement learning for vision-based robotic manipulation

    In Proceedings of The 2nd Conference on Robot Learning, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.), Proceedings of Machine Learning Research, Vol. 87, , pp. 651–673. Cited by: §I, §II-A.
  • [12] D. Kappler, J. Bohg, and S. Schaal (2015) Leveraging big data for grasp planning. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pp. 4304–4311. Cited by: §I, §II-A, §II-A, §IV.
  • [13] D. Kappler, S. Schaal, and J. Bohg (2016) Optimizing for what matters: the top grasp hypothesis. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 2167–2174. Cited by: §III-C.
  • [14] D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Int. Conf. on Learning Representations (ICLR), Cited by: §III-C.
  • [15] Q. V. Le, D. Kamm, A. F. Kara, and A. Y. Ng (2010) Learning to grasp objects with multiple contact points. In 2010 IEEE International Conference on Robotics and Automation, pp. 5062–5069. Cited by: §II-A, §V-B.
  • [16] I. Lenz, H. Lee, and A. Saxena (2015) Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34 (4-5), pp. 705–724. Cited by: §I, §II-A.
  • [17] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research 37 (4-5), pp. 421–436. External Links: Document Cited by: §I, §II-A, §II-A.
  • [18] H. Liang, X. Ma, S. Li, M. Görner, S. Tang, B. Fang, F. Sun, and J. Zhang (2019) PointNetGPD: detecting grasp configurations from point sets. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §I, §II-A, §II-A.
  • [19] Q. Lu, K. Chenna, B. Sundaralingam, and T. Hermans (2018) Planning multi-fingered grasps as probabilistic inference in a learned deep network. arXiv preprint arXiv:1804.03289. Cited by: §I, §II-A.
  • [20] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg (2017) Dex-net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. Cited by: §I, §II-A, §IV, §V-C.
  • [21] J. Mahler, M. Matl, V. Satish, M. Danielczuk, B. DeRose, S. McKinley, and K. Goldberg (2019) Learning ambidextrous robot grasping policies. Science Robotics 4 (26), pp. eaau4984. Cited by: §I, §II-A, §II-A, §IV.
  • [22] J. Mahler, F. T. Pokorny, B. Hou, M. Roderick, M. Laskey, M. Aubry, K. Kohlhoff, T. Kröger, J. Kuffner, and K. Goldberg (2016) Dex-net 1.0: a cloud-based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pp. 1957–1964. Cited by: §IV.
  • [23] D. Morrison, P. Corke, and J. Leitner (2018) Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach. In Proc. of Robotics: Science and Systems (RSS), Cited by: §II-A.
  • [24] V. Nguyen (1988) Constructing force-closure grasps. The International Journal of Robotics Research 7 (3), pp. 3–16. Cited by: §III-B.
  • [25] R. K. Pasumarthi, X. Wang, C. Li, S. Bruch, M. Bendersky, M. Najork, J. Pfeifer, N. Golbandi, R. Anil, and S. Wolf (2018)

    TF-ranking: scalable tensorflow library for learning-to-rank

    arXiv preprint arXiv:1812.00073. Cited by: §III-C.
  • [26] L. Pinto and A. Gupta (2016)

    Supersizing self-supervision: learning to grasp from 50k tries and 700 robot hours

    In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pp. 3406–3413. Cited by: §I, §II-A.
  • [27] F. T. Pokorny and D. Kragic (2013) Classical grasp quality evaluation: new theory and algorithms. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Tokyo, Japan. Cited by: §IV.
  • [28] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §II-B, §III-A1, §V-A.
  • [29] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5099–5108. Cited by: §II-B, §III-B.
  • [30] R. Robotics (2019) Sawyer 2-finger gripper. External Links: Link Cited by: footnote 1.
  • [31] A. Sahbani, S. El-Khoury, and P. Bidaud (2012) An overview of 3d object grasp synthesis algorithms. Robotics and Autonomous Systems 60 (3), pp. 326 – 336. Note: Autonomous Grasping External Links: ISSN 0921-8890 Cited by: §I.
  • [32] A. Saxena, J. Driemeyer, and A. Y. Ng (2008) Robotic grasping of novel objects using vision. The International Journal of Robotics Research 27 (2), pp. 157–173. Cited by: §II-A.
  • [33] P. Schmidt, N. Vahrenkamp, M. Wächter, and T. Asfour (2018)

    Grasping of unknown objects using deep convolutional neural networks based on depth images

    In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6831–6838. Cited by: §I, §II-A, §II-A.
  • [34] Schunk (2019) SCHUNK svh 5-finger. External Links: Link Cited by: §V-C.
  • [35] B. Technology (2019) Barrett bh8-282 3-fingered gripper. External Links: Link Cited by: footnote 1.
  • [36] A. Telea (2004)

    An image inpainting technique based on the fast marching method

    Journal of graphics tools 9 (1), pp. 23–34. Cited by: §V-C.
  • [37] Unified robot description format (urdf). External Links: Link Cited by: §I, §III-A2.
  • [38] J. Varley, J. Weisz, J. Weiss, and P. Allen (2015-Sep.) Generating multi-fingered robotic grasps via deep learning. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4415–4420. External Links: Document, ISSN Cited by: §I, §II-A, §II-A.
  • [39] J. Varley, C. DeChant, A. Richardson, J. Ruales, and P. Allen (2017) Shape completion enabled robotic grasping. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pp. 2442–2447. Cited by: §II-A.
  • [40] M. Veres, M. Moussa, and G. W. Taylor (2017) Modeling grasp motor imagery through deep conditional generative models. IEEE Robotics and Automation Letters 2 (2), pp. 757–764. Cited by: §I, §II-A.
  • [41] Wikipedia contributors Beam search — Wikipedia, the free encyclopedia. External Links: Link Cited by: §III-B.