Deep Differentiable Grasp Planner for High-DOF Grippers

02/04/2020 ∙ by Min Liu, et al. ∙ 0

We present an end-to-end algorithm for training deep neural networks to grasp novel objects. Our algorithm builds all the essential components of a grasping system using a forward-backward automatic differentiation approach, including the forward kinematics of the gripper, the collision between the gripper and the target object, and the metric of grasp poses. In particular, we show that a generalized Q1 grasp metric is defined and differentiable for inexact grasps generated by a neural network, and the derivatives of our generalized Q1 metric can be computed from a sensitivity analysis of the induced optimization problem. We show that the derivatives of the (self-)collision terms can be efficiently computed from a watertight triangle mesh of low-quality. Put together, our algorithm allows the computation of grasp poses for high-DOF grippers in unsupervised mode with no ground truth data or improves the results in supervised mode using a small dataset. Our new learning algorithm significantly simplifies the data preparation for learning-based grasping systems and leads to higher qualities of learned grasps on common 3D shape datasets, achieving a 22 higher value of the Q1 grasp quality metric.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 5

page 6

page 7

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robot grasping of unknown objects is an important problem and an essential component of various applications, such as robot object packing [48, 47] and dexterous manipulation [4, 54]. Earlier methods [21, 11, 30, 13] could generate grasp poses for an arbitrary gripper or target object, but they ignored the uncertainty of real worlds. Recent learning-based methods [29, 49, 5, 40, 33, 10, 26] have demonstrated improved robustness in terms of handling sensor noise. Instead of directly inferring the grasp poses, these methods propose learning various intermediary information such as grasp quality measures [29] or reconstructed 3D object shapes [49] and then uses this information to help inferring grasp poses. On the positive side, it has been shown that learning this kind of information can improve both the data-efficacy of training and success rate of predicted grasp poses. On the downside, however, this intermediary information complicates the training procedure, hyper-parameter search, and data preparation [49].

Fig. 1: Using a small dataset, we train an end-to-end neural network to predict grasp poses for novel objects that it has never see before. The neural network prediction is adjusted using our differentiable grasp quality metric.

Ideally, a learning-based grasp planner should infer the grasp poses directly from raw sensor inputs such as RGB-D images. Such approaches have been developed by many researchers [39, 22]. However, recent methods [29, 5] show that it is preferable to first learn a grasp quality metric function and then optimize the metric at runtime for an unknown target object using sampling-based optimization algorithms, such as multi-armed bandits [28]. Such optimization can be very efficient for low-DOF parallel jaw grippers but less efficient for high-DOF anthropomorphic grippers due to their high-dimensional configuration spaces. In addition, it is possible for the sampling algorithm to generate samples at any point in the configuration space and the learned metric function has to return accurate values for all these samples. To achieve high accuracy, a large amount of training data is needed, such as 6.7 million ground truth grasps in the dataset used by [29].

Various techniques have been proposed to improve the robustness and efficiency of grasp planners training. Prior works [10, 49] proposed to improve the data-efficiency of training by having the neural network recover the 3D volumetric representation of the target object from 2D observations. A 2D-to-3D reconstruction sub-task allows the model to learn intrinsic features about the object. However, a volumetric representation also incurs higher computational and memory cost. In addition, compared with surface meshes, volumetric representations based on signed distance fields cannot resolve delicate, thin features of complex objects [26]. As an alternative method, prior works in [14, 33] show that higher robustness can also be achieved using adversarial training, which in turn introduces additional sub-tasks of training and requires new data.

Fig. 2: Our learning architecture takes multi-view depth images of the object as inputs. The features of these images are extracted using ResNet-50, and these are then fed into the fully connected (FC) blocks after view pooling [43]

to predict the high-DOF configuration of a gripper directly. The configuration space is then brought through a forward kinematics (FK) block and transformed into Euclidean space. We then execute grasps of these configurations in a physical platform. During training stage, we can formulate various requirements for a grasp planner as loss functions in Euclidean space (red), including (self-)collision-free, grasp quality maximization, data consistency, and closeness between the gripper and the target object’s surface. Our method can be used as a locally optimal grasp planner guided by analytic gradients, or as an additional loss function to improve the quality of learned grasp poses.

Main Results: We present a differentiable theory of grasp planning, extending ideas from [9], an early attempt to formulate grasp planning as a continuous optimization. Our main contribution is a generalized definition of the grasp quality metric that is defined when the gripper is not in contact with the target object. We show that this metric function is locally differentiable, and that its gradient can be computed from the sensitivity analysis of the optimality condition in a similar manner as [2]. We also propose a loss function to ensure that grasps are (self-)collision-free in a differentiable manner, which can be computed from only surface meshes of target objects.

Our method can be used as a locally optimal grasp planner similar to simulated annealing [30], but our method is guided by analytic gradients and can quickly find a locally optimal solution. More importantly, our method can be used to improve the quality of learned grasp poses using a simple neural network architecture. Specifically, we use a network that takes as input a set of multi-view depth images of the target object and directly predicts a grasp pose for a high-DOF gripper. This design choice is preferable to prior work [28] because it leads to a higher performance during runtime, as there is no need to optimize a learned grasp metric and we can obtain the grasp pose by a single forward propagation through the neural network.

By adding our differentiable loss, we show that the simple neural network architecture can predict high-quality grasps for the Shadow Hand (fig:results) after training on a dataset of only 400 objects and 40K ground truth grasps. When compared with the supervised-learning baseline

[26], our method achieves higher success rate on physical hardware and higher value in grasp quality metric [15]. Our learning architecture is illustrated in fig:architecture.

Ii Related Work

In this section, we review related works in grasp planning using either model-based or learning-based methods.

Model-based Grasp Planners assume perfect sensing about geometries of the environment and shapes of target objects. Given the geometric information, a grasp planner searches for a grasp pose that maximizes a certain grasp quality metric; many techniques have been proposed for defining reasonable grasp quality metrics [50, 15, 41, 37] and designing efficient search algorithms [13, 15, 11, 30]

. These methods can be applied to both low- and high-DOF grippers and can be classified into discrete sampling-based techniques

[30] and continuous optimization techniques [9]. Sampling-based methods allow virtually any grasp quality metric to be used as the objective function, while continuous methods require the metric to be differentiable with respective to the configuration of the gripper. In practice, continuous optimization techniques are more efficient in terms of finding the (locally) optimal grasp poses.

Some planning methods [15, 51, 52, 17] only compute optimal grasp points, while others [13, 27] compute both the grasp points and the gripper poses. When a gripper pose is needed, the planner uses a two-stage approach: a set of grasp points is first selected on the surface of the target object and then the pose of the gripper is found by inverse kinematics. Based on the idea of numerically optimizing the grasp quality metric, we extend the definition of a grasp quality metric to be well-defined in the ambient space, i.e. when the gripper is not in contact with the target object, thereby unifying grasp points selection and gripper pose computation.

Learning-Based Grasp Planners can predict grasp points or gripper poses given noisy observations of the environment. Most early works [39, 38] in this direction assume that a parallel-jaw gripper is designed for the target object and that the input is a single depth image of the target object. In this case the grasp problem boils down to that of selecting the gripper’s initial direction and orientation, which can be solved using an analytic method [22]. However, the noteworthy success in this problem is achieved by DexNet [28, 29]

, which uses deep convolutional neural networks to learn object similarity functions and grasp quality functions. DexNet can robustly pick a large number of unknown objects, and this is achieved using a dataset of tens of thousands of target objects and millions of ground truth grasp poses.

More recent techniques aim at improving the data efficiency of learning-based planners and also making the planner robust in challenging settings involving arbitrary approaching directions [14], more general gripper types [10, 27], and model discrepancies [20, 45]. It has been shown in [14], among others, that the grasp planning task can be divided into two sub-tasks, object reconstruction and gripper pose prediction, and learning these two sub-tasks can improve the rate of success. [49] showed that adversarial training can also improve the robustness of the learned model. However, these methods either perform extensive data generation or require delicate parameter tuning for the adversarial training.

A common drawback of prior works [28, 29, 14] is that they learn a grasp quality metric function or grasping success predictor, which requires an additional sampling-based optimizer to search for gripper poses. This requirement limits these methods to low-DOF grippers, since the high-dimensional configuration space of high-DOF grippers makes the sampling-based optimization computationally costly. Some recent methods [26] overcame this difficulty by directly predicting a nominal gripper pose from an observation of the object. However, the predicted gripper pose is not directly usable and needs to be post-processed. In comparison, our method predicts robust, usable gripper poses using a simple neural-network architecture and uses a rather smaller dataset for training. In addition, our method can be combined with previous learning-based methods to improve their results.

As an alternative to supervised learning, reinforcement learning allows a learned grasp planner to discover useful grasp poses by exploration. Learned grasp planners have been successfully applied to grasping

[35] and other manipulation problems [56]. However, the number of state transition data needed in an typical training is on the level of millions [35], while we show that robust gripper poses can be predicted by supervised learning on a dataset with 400 example objects and using 40K ground truth grasp poses.

Iii Learning Grasp Poses for High-DOF Grippers

Our goal is to learn a grasp prediction network from multiple depth images of the target object, where is the depth image taken from the th camera view toward the target object to be grasped and is the learnable parameters. The output of is both the 6D extrinsic parameters and joint angles of the gripper, i.e. . This is in contrast with prior works [14, 29], where another grasp quality metric function or grasp successful predicate function is learned and is a candidate grasp pose. Afterwards, the grasp pose is found by maximizing at runtime using sampling-based algorithms such as multi-arm bandits [28].

However, when the gripper is high-DOF, the maximization of becomes a search in a high-DOF configuration space which is time-consuming. As a result, we choose to learn instead of . The major challenge in learning is to resolve the ambiguity in grasp poses, because infinitely many grasp poses can have the same grasp quality for a target object but our neural network can only predict one pose. In order to resolve this ambiguity in grasp poses, we need the dataset to be consistent. A consistent grasp pose dataset is one where all the ground truth grasp poses can be represented by a single neural network. To enforce consistency, prior work [26] attempts to train by precomputing multiple grasp poses for each target object and use a Chamfer loss to have pick the most consistent pose. However, the learned gripper poses cannot be used directly due to its low-quality and a post-processing is needed for deploying the learned poses on physical hardwares.

We aim at further improving the quality of the learned function without increasing the complexity of training in terms of either the amount of data or the network architecture. Instead, we are inspired by the early works [9, 15, 13], which formulate grasp planning as a continuous optimization. We incorporate all the criteria of good grasps as additional loss functions in terms of stochastic optimization. It has recently been shown that gradients can be brought through complex numerical algorithms and provide additional guidance. These domain-specific differentiable models [2, 18, 19] can significantly improve the convergence rate of neural-network training and reduce the amount of data needed. However, we need to overcome several difficulties when using these approaches for grasp planning:

  • All the existing grasp quality metrics have discontinuities [53], so we have to modify them for differentiability.

  • A grasp quality metric is only defined when the gripper and the target object have exact contact, which is generally not the case when gripper poses are being stochastically updated by the training algorithm.

  • Our differentiable loss function is defined for a target object represented using triangle meshes. These triangle meshes come from well-known 3D shape datasets [7, 42, 24, 23], some of which are of low quality. If gradient computation becomes unreliable on low-quality meshes (with nearly degenerate triangles), training will be misled.

We present our design of loss functions and discuss how to address the three challenging problems in the next section.

Iv Differentiable Grasp Planner

Our loss function is comprised of three terms: , , and . The first term is a generalized grasp metric [15]

that measures the quality of a grasp using physic-based rules. However, when force closure is not satisfied, both the metric value and its gradients are zero. In this degenerate case, we add a second, heuristic term

that always provides a non-vanishing gradient. Our third term penalizes both self-collision and collisions between the gripper and the target object.

Iv-a Notation

Throughout the paper, we assume that a target object is defined by a watertight triangle mesh . As illustrated in fig:illusQ, given a point in the workspace, we can also define the signed distance to as and the outward normal with respect to as . In addition, we also define as the gripper normal, i.e. the outward normal direction on the gripper mesh. We further assume that the target object’s center-of-mass coincides with the origin of the Cartesian coordinates. During grasping, the object will be under an external wrench .

Fig. 3: An illustration of variables used to define our generalized metric.

For a set of grasp points satisfying , with respective grasping forces , the quality of a grasp pose is defined by the metric [15] as follows:

(1)

where is the frictional coefficient and

is the user-provided metric tensor that is equal to

. Intuitively, is the radius of the origin-centered 6D sphere in the admissible wrench space, where an admissible wrench should satisfy two conditions: limited force magnitude and frictional cone constraints.

Iv-B Generalized Metric with Inexact Contacts

In practice, it is infeasible to assume that a grasp metric can be computed in its original form, i.e. eq:Q1. This is because a learning system will generally not produce grasping points that lie exactly on the surface of the target object. It is well known that incorporating hard constraints into neural networks is difficult [36]. When a stochastic training scheme is used and neural network parameters are randomly perturbed, exact constraint satisfaction will be lost. As a result, we have to deal with cases where . Taking these cases into account, we derive a generalized version of by modifying the first condition of admissible wrenches in eq:Q1 as follows:

(2)

which essentially extends to the ambient space by an exponential weight function with two terms. The first term ensures that our generalized attains larger values when grasp points are closer to the surface of the target object. The second term ensures that our generalized attains larger values when the normal direction on the gripper and the normal direction on the target object align. Finally, it is obvious that eq:Q1EXT converges to eq:Q1 as . Like previous works [44, 32] on generalized contact-implicit models, our generalized metric allows a learning algorithm to determine the number of contact points and their positions.

To train neural networks using the generalized metric, we need to compute its sub-gradient with respect to efficiently. Unfortunately, the exact computation of the metric is difficult because the optimization in eq:Q1 is non-convex; several approximations have been proposed in [41, 13, 53]. We present two different techniques for computing and . The first method computes an upper bound of generalized

, which is cheaper to compute but creates zero entries in the gradient vector. The second method computes a smooth, lower bound of generalized

, which propagates non-zero gradient information but is more costly to compute.

Iv-B1 Derivatives of the Upper Bound

Our first technique adopts [41] which approximates by assuming that must be along one of a discrete set of directions: . This assumption results in a tractable upper bound of and can be extend to our generalized metric as follows:

(3)

which is a min-max optimization. Here the minimization is with respect to a set of discrete indices, for which sub-gradients can be computed. The maximization aims at finding the support of in and its optimal solution can be derived in a closed form. To show this, we first define the convex wrench space of each contact point as:

Then it is easy to verify that and the support of union of convex hulls is the maximum support of each hull, i.e.:

Finally, the support of in can be computed analytically as follows:

In this form, each operation for computing our generalized can be implemented as a standard math operation with derivatives that can be computed using automatic differentiation tools such as [34].

Iv-B2 Derivatives of the Lower Bound

We have shown that computing an upper bound of reduces to a series of simple operations with well-defined sub-gradients. However, due to the function in the computation of upper bound, the sub-gradient is non-zero only for one of the contact points, which is less efficient for training. To resolve this problem, it has been shown in [13] that sum-of-squares (SOS) optimization can be used to compute a lower bound of . This theory can be extended to compute our generalized metric. If we define as a set of directions on the tangent plane, then the generalized can be found by solving the following SOS optimization problem:

(4)

where we have extended the definition of

to account for our generalization (eq:Q1EXT). eq:Q1L can be reduced to a semidefinite programming (SDP) problem and its gradients can be computed via the chain rule:

While the second term in the chain rule above can be computed directly via automatic differentiation, the first term requires a sensitivity analysis of an SDP problem, as shown in [31] (see supplementary material for more details). Since SDP is a smooth approximation of non-smooth optimization, the derivatives are generally non-zero on all the contact points. As a result, each neural network update can adjust all the fingers of the gripper to generate better grasp poses, which is more efficient than the case with upper bound on . On the other hand, the cost of solving eq:Q1L is also higher than that of solving eq:Q1U because eq:Q1L involves an SDP solve. Note that a similar analysis for quadratic programming (QP) problems has been previously exploited for training neural networks in [2].

Iv-C Geometry Related Loss Functions

In this section, we show that the geometric terms such as can be computed robustly from a triangle mesh. We also formulate the collision-free requirement as a novel loss term. Geometric terms arise in many places in a grasping system. To compute the metric, we need to evaluate and . In addition, we need to avoid penetrations between grippers and the target objects. To perform these computations, we can introduce a monotonic loss function:

where is the weight of loss. To provide sub-gradients for all these terms, we need to plug into the chain rule. In this section, we show a robust method to compute for complex, watertight, triangle meshes of the target objects, which can be accelerated with the help of a bounding volume hierarchy (BVH). Note that it is easy to compute and its gradients from a signed distance field (SDF) [3], but we choose to use triangle meshes for two reasons. First, most existing 3D shape datasets, such as [7, 8, 55], use triangle meshes, and converting them to SDFs is time and memory consuming. Second, for very complex meshes, low-resolution SDFs cannot represent thin geometric features and determining an appropriate resolution of SDF is difficult.

(a)(b)(c)

Fig. 4: Three cases in the computation of . (a): The geometric feature is an edge . (b): The geometric feature is a vertex . (c): The geometric feature is a triangle.

Let’s assume that a triangle mesh consists of a set of triangles . Then the distance between and is the solution of the following QP problem:

where is the th vertex of . Finally, the signed distance is defined as:

(5)

where is the outward normal of and is the sign function. Similarly, we can define the outward normal of to be:

(6)

In these formulations, the sign function and the operator define a disjoint convex set with well-defined sub-gradients. The gradient of is:

And the gradient of can be computed from the dirichlet features on the triangle mesh to which belongs, as illustrated in fig:hessian. If the closest feature to is an edge , then we have:

If the closest feature to is a vertex , then we have:

If the closest feature to is inside a triangle, then .

Finally, eq:SDIST and eq:SNOR involve a loop over all triangles to find the one with smallest distance, which can be accelerated by building a BVH and quickly rejecting nodes where the bounding volume is further from than the current best distance [1].

In our experiments, the technique described above is computationally efficient but prone to floating-point’s truncation error. If a point is close to the triangle’s plane, finite-precision floating point arithmetics have difficulty deciding whether the point lies inside the triangle mesh or not. To solve this problem, we use exact rational arithmetics implemented in

[16] to perform all the computations in this section and convert the results back to inexact, finite precision floating point numbers at the end of the computation.

Iv-D Self-Collision of the Gripper

To prevent gripper-object collisions, we add a term to penalize any collisions between different links of the gripper. Assuming the gripper has links, we first approximate the shape of each link using a convex hull and define as:

which can be trivially computed from H-representations of and can be accelerated using a bounding volume hierarchy. In practice, we use a small set of sample points to compute the generalized metric and another large set of sample points to compute to achieve better resolution of self collisions.

Iv-E Defending Against Degenerate Cases and Local Minima

Our generalized metric is similar to the standard metric in that it implies force closure. However, if an initial guess for the gripper pose has no force closure, then

and no gradient information is available. In this case, we add the following heuristic term to guide the optimization to compute a force-closed pose with a high probability:

by ensuring that all the grasp points are as close to the object as possible. In addition, our generalized has many local minima due to nonlinearity and complex geometries of objects. To defend our neural network against these sub-optimal solutions, we add a data loss to guide the training. We use Chamfer loss for our data term:

following previous works [10, 26], where is the ground truth grasp pose, is the Chamfer distance measure in the gripper’s configuration space. In other words, we precompute many ground truth grasp poses for each target object and let the neural network pick the grasp pose that leads to the minimal distance.

Iv-F Forward Kinematics

Our neural network predicts , which consists of the global rigid transformation and the joint angles to define the pose of a gripper. Further, the gradient with respect to the grasp points is propagated backward to via a forward kinematics layer denoted as , similar to [26, 46]. We make a minor modification to account for joint limits with non-vanishing gradients. If has joint limits in range , then we transform as follows:

which is guaranteed to satisfy the constraints and has non-vanishing gradients compared with the functions.

In summary, our learning system uses the following compound loss function:

where are various weights.

V Experimental Setup

Data Preparation: Following [10], we prepare a small dataset of 500 watertight objects by combining existing grasping datasets [7, 42, 24, 23]; We split the dataset into an (400) training set and a (100) test set. It is known that predicting a single grasp pose from a single target object is an ambiguous problem because many grasp poses are equally effective [26]. Therefore, we use [30] to precompute a set of grasp poses for each target object and then use Chamfer data loss to let the neural network pick which grasp pose is the most representable. This gives a dataset of K grasps, from which our neural network will select as ground truth. For our 24-DOF gripper, collecting these data requires about 150 CPU hours of computation on a cluster using a sampling-based grasp planner [30]. Finally, we assume that the neural network observes objects from a set of multi-view depth cameras of resolution . These images are obtained by rendering the triangle mesh of each target object into the depth channel. As a result, each sample in our dataset is a -tuple of depth images, triangle mesh, and ground truth grasp poses. After collecting our dataset, we augment it by rotating each target object and gripper for 8 times along 8 symmetric axes.

Gripper Setup: In all our simulated and real-world experiments, we use a (6+18)-DOF Shadow Hand as our gripper as shown in fig:gripper, which is mounted onto a UR10 arm. However, during the training phase, the DOFs of the arm are not predicted by our neural network. These DOFs are computed at runtime using a conventional motion planner. We use the SrArmCommander [12] to move the UR10 arm to the target poses and use the SrHandCommander [12] to move the Shadow Hand fingers to the target joint states. During the training phase, we manually label =45 potential grasp points on the gripper and, to detect self-collisions, we use a denser sample of 15,555 potential contact points using Poisson disk sampling, as illustrated in fig:gripper.

Fig. 5: Left: The real Shadow Hand. Middle: Original meshes of the Shadow Hand. Right: Convex hulls of each part of the Shadow Hand meshes, the sampled potential grasp points (red), and the sampled potential contact points (green) via Poisson disk sampling.

Neural Network:

We deploy ResNet-50 as a feature extractor for multi-view depth images. For each depth image, we duplicate it to 3 channel to meet the input requirement of ResNet-50. A shared ResNet-50 takes multi-view depth images as input and outputs 2,048 dimensional vectors. These vectors are used with max-pooling and connected with a fully-connected layer, of which the output dimension is equal to gripper’s DOF (6+18 for Shadow Hand). Outputs of fully-connected layer are the predicted gripper configurations.

Training configurations: We use the parameters listed in table:param for in both settings. Our neural network is trained using the ADAM algorithm [25] with a batch size of 16. The initial learning rate is set to be 1

-4 and decayed by 0.9 every 20 epochs. All experiments are carried out on a desktop with 2 Intel

Xeon Silver 4208 CPUs, 32 GB RAM, and 2 NVIDIA RTX 2080 GPUs.

Parameters m
Value 6.0 8.0 0.7 0.001 64 1.0 1.0 0.1 1.0
TABLE I: Parameter settings in our training configuration.

Vi Experimental Results

In this section, we evaluate the performance of different settings for high-DOF grasp planning. Our method can be used either as a standalone grasp planner or as a method to train grasp predicting neural networks.

Vi-a Grasp Planning Without Ground Truth

The differentiable grasp metric and collision loss allows our method to be used as a standalone, locally optimal grasp planner. To setup this experiment, we replace the neural network with a -DOF optimizable vector of gripper pose, set , and minimize with respect to . As compared with [30], our planner only provides locally optimal and the computational cost is comparable. An example is illustrated in fig:hand_mode_more, where we use a trivial initialization shown as the transparent green poses and after minutes of optimization, our optimizer converges to the gray poses. However, without guidance of data, our planner can fall into local minima without force closure () as shown in fig:hand_mode.

Fig. 6: We optimize the gripper pose without ground truth data. Initial pose is shown in transparent green, where all the joint angles are set to and we position the gripper right on top of the object. The final poses are shown in solid gray.
Fig. 7: Guided by the small dataset, we train our neural network by first pre-training on the dataset using Chamfer loss and then fine-tuning using our method as additional loss functions. Some predicted grasp poses for unseen objects are shown. These grasp poses do not require post-processing and can be realized directly on physical platform. In the last picture, we show the predicted grasp pose for the failure case in fig:hand_mode.

Vi-B Learning Grasp Poses With Ground Truth

In our second benchmark, we use our method to guide the training of a grasp-pose-predicting neural network. The training is performed in two phases. First, we adopt a pre-training by setting , i.e. excluding our differentiable loss. This step brings the neural network close to nearly optimal values and we run 35 epochs of learning at the first stage. Second, we fine-tune the network by adding our differentiable loss and use weights summarized in table:param. We run 71 epochs of learning at the second stage. After training, a set of predicted grasp poses on the test set are shown in fig:post-process, which shows that the quality of grasp is drastically improved when guided by the data term. The pre-training takes 4 hours and the fine-tuning takes 36 hours. On average, each forward-backward propagation with our additional loss function takes 0.85s and that without our loss function takes 0.61s, which shows that our additional loss functions only impose marginal cost to gradient computation.

Fig. 8: A failure case where the gripper converge to a pose with no force closure .

Vi-C Comparison

We have compared the (standard) metric [15] of our method and that generated using sampling-based grasp planner [30] in fig:NN_mode_more. The results show that the qualities of our grasp poses are on par with those of [30]. We have also compared with prior work [26], which also trains a grasp-pose predicting neural network on a small dataset of a similar size to ours. However, the algorithm in [26] requires a post-processing to resolve penetrations and collisions. Instead, the grasp poses predicted using our method can be directly deployed onto a physical hardware without post-processing. As illustrated in fig:post-process and table:comp_liu, our method can significantly improve the quality of grasp poses.

(a)(b)(a)(b)(a)(b)(a)(b)

Fig. 9: Improvement of predicted grasp poses due to our differentiable loss: We compare the grasp pose predicted using our method (b) and the grasp pose predicted using [26] (a). Our method can drastically improve the quality of grasp and reduce penetrations or collisions, and the visual difference made by our differentiable loss is evident.
Method Metric Penetration Success Plan Success Grasp
Ours 0.23 3.5mm 38 33
[26] 0.11 14.2mm 35 27
TABLE II: On the YCB objects of testing set, we compare the predicted quality of grasp poses in terms of metric, penetration depth, and rate of success for the planner and physical hardware.

Vi-D Grasping With the Arm on Physical Hardware

As our final evaluation, we deploy our learned neural network onto our physical platform. Our method does not require RGB input and only uses the depth channel. Therefore, we do not perform any sim-to-real transfer. Our neural network only predicts the gripper pose and does not predict the configuration for the UR10 arm to achieve the predicted position and orientation. These configurations of the arm are computed using a motion planner at runtime. We choose 50 YCB objects from our 100 test objects. Our depth cameras are calibrated beforehand to make the camera pose exactly same as the poses used for training.

To profile the rate of success on the 50 YCB objects, we use two metrics summarized in table:comp_liu. First, we record how many times the motion planner can successfully move the gripper to the predicted position (Success Plan). This metric measures the ability of our method in avoiding penetrations and collisions with the desk that objects are put on, since a pose with penetrations or desk collisions cannot be achieved by a motion planner. Second, we record how many times the grasp planner can successfully lift the object (Success Grasp). This metric measures the ability of our method in improving the grasp quality. Our method outperforms [26] in terms of both metrices. We observe a improvement in terms of Success Plan, and of improvement in terms of Success Grasp. Our neural network failed on the objects due to slippage.

Vii Conclusion and limitations

We present a differentiable grasp planner that enables a neural network to be trained with a small dataset and a simplified architecture. Our differentiable loss accounts for various requirements for a good grasp, including high grasp metric values and collision-free gripper poses. We use a generalized definition to allow inexact contact and we show that the sub-gradients of each loss term are well-defined and can be efficiently computed from target object shapes represented using watertight triangle meshes. We show that our method can be used both as a standalone grasp planner and as a neural network training algorithm. Finally, we show that the trained neural network performs robustly on unseen objects and hardware platforms.

Our current implementation suffers from several limitations. First, our method requires the target objects be watertight and have a non-zero volume. Although we do not require a signed distance field transformation, our method still computes a signed value of distance, which is impossible when the target object is a thin-shell. A limitation related to this problem is that our method suffers from tunneling. In other words, when the target object is very thin, a stochastic update of our neural network might result in the hand going from one side to the other side of the object, leading to missed solutions. In the future, this problem can be resolved using continuous collision detection [6].

Second, our experimental setup and neural network architecture prevents the neural network from predicting multiple grasp poses for a single object. If there are other constraints in the workspace preventing a grasp pose from being achieved, then our method will lead to failure. However, this problem can be resolved by using adversarial training similar to [14, 33], where a distribution of grasp poses is learned. We emphasize that more sophisticated learning algorithms are orthogonal to our approach and can be combined with it.

Viii Appendix

In this document, we provide some details on computing the lower bound and its derivatives. First, we derive a slightly different formulation of the lower bound using quadratic frictional cones. As compared with the linearized frictional cones used in [13], using quadratic frictional cones is more efficient in terms of reducing the problem size of semidefinite programming.

1 Q1 Lower Bound Using Quadratic Frictional Cone

We re-derive the lower bound of using SOS optimization as done in [13], but using quadratic frictional cones. For a set of points , with normals and two tangents being , then the cones are defined as:

where we have:

It is easy to find that the dual cones of are defined as:

The induced SOS problem is:

(7)

Note that eq:SOS will induce an SDP problem with exactly the same order (of polynomials) as the original SDP problem induced in [13], but with fewer cones and also smaller linear system when performing sensitivity analysis. Finally, we briefly prove the correctness of .

Lemma VIII.1

The dual cone of is .

Proof: If , then for any , we have:

For the other direction, if there is an such that , then we can pick a point in as follows:

such that:

2 SDP Sensitivity Analysis for Lower Bound of

In this section, we present an efficient way to perform sensitivity analysis for SOS problems. We use the same notations as those in [31]. A standard SDP takes the form:

where there are PSD cones in our problem ( equals the number of tangent directions if linearized frictional cones are used and if quadratic frictional cones are used). The dual variable to the th cone is and we have . We also define the coefficient matrix:

In an SOS problem, the first PSD cone and the other cones are of two different types. The first PSD cone specifies the conditional polynomial positivity condition. The other cones specify the positivity of Lagrangian multipliers. We also observe that some variables only affect the first PSD cone and other variables affect the other PSD cones. Therefore, we can write the matrix in a block form as follows:

where it is trivial to verify that we can choose variables to make the bottom right block of

an identity matrix. When SDP is solved using primal-dual interior point method, the set of primal and dual solutions are computed simultaneously, with the dual variables defined as:

where we apply the same decomposition of cones for . Next, we apply the optimalty condition of SDP:

where is the symmetric kronecker product operator. Apply sensitivity analysis with respect to an arbitrary parameter , we have:

where:

In the following derivation, we assume that and have strict complementarity. Note that if strict complementarity is not satisfied, then the SDP problem is not differentiable. Prior work [31] showed that and have simultaneous diagonalization, and so does and :

By plugging these identities info the sensitivity equation, we get:

where there are 8 equations. For th rows, we have:

(8)

By plugging eq:reduction into the st row, we have: