GQ-STN: Optimizing One-Shot Grasp Detection based on Robustness Classifier

03/06/2019 ∙ by Alexandre Gariépy, et al. ∙ 22

Grasping is a fundamental robotic task needed for the deployment of household robots or furthering warehouse automation. However, few approaches are able to perform grasp detection in real time (frame rate). To this effect, we present Grasp Quality Spatial Transformer Network (GQ-STN), a one-shot grasp detection network. Being based on the Spatial Transformer Network (STN), it produces not only a grasp configuration, but also directly outputs a depth image centered at this configuration. By connecting our architecture to an externally-trained grasp robustness evaluation network, we can train efficiently to satisfy a robustness metric via the backpropagation of the gradient emanating from the evaluation network. This removes the difficulty of training detection networks on sparsely annotated databases, a common issue in grasping. We further propose to use this robustness classifier to compare approaches, being more reliable than the traditional rectangle metric. Our GQ-STN is able to detect robust grasps on the depth images of the Dex-Net 2.0 dataset with 92.4 single pass of the network. We finally demonstrate in a physical benchmark that our method can propose robust grasps more often than previous sampling-based methods, while being more than 60 times faster.



There are no comments yet.


page 1

page 3

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Grasping, corresponding to the task of grabbing an object initially resting on a surface with a robotic gripper, is one of the most fundamental problems in robotics. Its importance is due to the pervasiveness of operations required to seize objects in an environment, in order to accomplish a meaningful task. For instance, manufacturing systems often perform pick-and-place, but rely on techniques such as template matching to locate pre-defined grasping points [mercier18_learn_objec_local_pose_estim]. In a more open context such as household assistance, where objects vary in shape and appearance, we are still far from a completely satisfying solution. Indeed, in an automated warehouse, it is often one of the few tasks still performed by humans [Correll2016].

To perform autonomous grasping, the first step is to take a sensory input, such as an image, and produce a grasp configuration. The arrival of active 3D cameras, such as the Microsoft Kinect, enriched the sensing capabilities of robotic systems. One could then use analytical methods [Bohg2013] to identify grasp locations, but these often assume that we already have a model. They also tend to perform poorly in the face of sensing noise. Instead, recent methods have explored data-driven approaches. Although sparse coding has been used [Trottier]

, the vast majority of new data-driven grasping approaches employ machine learning, more specifically deep learning

[zhou18_fully_convol_grasp_detec_networ, park18_real_time_highl_accur_robot, park18_class_based_grasp_detec_using, chu2018, Chen2017, trottier_resnet]. A major drawback to this is that deep learning approaches require a significant amount of training data. Currently, grasping training databases based on real data are scant, and generally tailored to specific robotic hardware [Pinto2016, Levine2016]. Given this issue, others have explored the use of simulated data [Mahler2017, bousmalis17:_using_simul_domain_adapt_improv].

Similarly to computer vision, data-driven approaches in grasping can be categorized into

classification and detection

methods. In classification, a network is trained to predict if the sensory input (a cropped and rotated part of the image) corresponds to a successful grasp location. For the detection case, the network outputs directly the best grasp configuration for the whole input image. One issue with classification-based approaches is that they require a search on the input image, in order to find the best grasping location. This search can be exhaustive, and thus suffers from the curse of dimensionality 

[Lenz2015]. To speed-up the search, one might use informed proposals [Mahler2017, park18_class_based_grasp_detec_using], in order to focus on the most promising parts of the input image. This tends to make the approach relatively slow, depending on the number of proposals to evaluate.

Fig. 1: Overview of our method. Classical one-shot methods for grasping supervise a prediction (in red) using geometric supervision from randomly selected ground truth (in green). We instead suggest to use robustness supervision (in cyan) to learn fine-grained adjustments, without requiring a ground truth annotation at this specific grasp location.

While heavily inspired by computer vision techniques, training a network for detection in a grasping context is significantly trickier. As opposed to classic vision problems, for which detection targets are well-defined instances of objects in a scene, grasping configurations are continuous. This means that there exist a potentially infinite number of successful grasping configurations. Thus, one cannot exhaustively generate all possible valid grasps in an input image. Another issue is that grasping databases are not providing the absolute best grasping configuration for a given image of an object, but rather a (limited) number of valid grasping configurations.

In this paper, we propose a one-shot grasping detection architecture for parallel grippers, based on deep learning. Importantly, our detection approach on depth images can be trained from sparse grasping annotations meant to train a classifier. As such, it does not require the best grasping location to be part of the training dataset. To achieve this, we leverage a pre-existing grasp robustness classifier, called Grasp Quality CNN (GQ-CNN[Mahler2017]. This is made possible by the fact that our network architecture directly outputs an image corresponding to a grasp proposal, allowing it to be fed directly to an image-based grasp robustness classifier. Our architecture makes extensive use of the STN [jaderberg15_spatial_trans_networ], which can learn to perform geometric transformations in an end-to-end manner. Because our network is based on STNs, the gradient generated by the GQ-CNN robustness classifier will propagate throughout our architecture. Our network is thus able to climb the robustness gradient, as opposed to simply regressing towards grasp configurations, which are limited in the training database. In some sense, our network is able to learn from the implicit knowledge of the quality of a grasp, knowledge that was captured by GQ-CNN.

In short, our contributions are the following:

  1. Describing one of the first techniques to train a one-shot detection network on the detection version of the Dex-Net 2.0 dataset. Our network is based on an attention mechanism, the STN, to perform one-shot grasping detection, resulting in our Grasp Quality Spatial Transformer Network (GQ-STN) architecture;

  2. Using the Grasp Quality CNN (GQ-CNN) as a supervisor to train this one-shot detection network, thus enabling to learn from a limited number of grasp annotations and to achieve a high robustness classification score; and

  3. Showing that our method generalizes well to real world conditions in a physical benchmark, where our GQ-STN proposes a high rate of robust grasp.

Ii Related Work

Over the years, many network architectures have been proposed to solve the grasping problem. In this section, we present them grouped by themes, either based on their overall method of operation or on the type of generated output.

Ii-a Proposal + Classification Approaches

Drawing inspiration from previous data-driven methods [Bohg2013], some approaches work in a two-stage manner, first by proposing grasp candidates then by choosing the best one via a classification score. Note that this section does not include architecture employing Region Proposal Network (RPN), as these are applied on a fixed-grid pattern, and can be trained end-to-end. They are discussed later.

Early work in applying deep learning on the grasping problem employed such a classification approach. For instance, Lenz2015 employed a cascaded approach of two fully-connected neural networks. The first one was designed to be small and fast to evaluate and perform the exhaustive search. The second and larger network then evaluated the best 100 proposals of the previous network. This architecture achieved 93.7% accuracy on the

Cornell Grasping Dataset (CGD).

Pinto2016 reduced the search space of grasp proposals by only sampling grasp locations and cropping a patch of the image around this location. To find the grasp angle, the author proposed to have 18 outputs, separating the angle prediction into 18 discrete angles by increments.

The EnsembleNet

 asifensemblenet worked in a radically different manner. It trained four distinct networks to propose different grasp representations (regression grasp, joint regression-classification grasp, segmentation grasp, and heuristic grasp). Each of these proposals was then ranked by the

SelectNet, a grasp robustness predictor trained on grasp rectangles.

To alleviate the issue of small training datasets labelled manually, Mahler2017 relied entirely on a simulator setup to generate a large database of grasp examples called Dex-Net 2.0 (see section III-B). Each grasp example was rated using a rule-based grasp robustness metric named Robust Ferrari Canny. By thresholding this metric, they trained a deep neural network, dubbed Grasp-Quality CNN (GQ-CNN), to predict grasp success or failure. The GQ-CNN takes as input a depth image centered on the grasp point, taken from a top view to reduce the dimensionality of the grasp prediction. For grasp detection in an image, they used an antipodal sampling strategy. This way, 1000 antipodal points on the object surface were proposed and ranked with GQ-CNN. Even though their system is mostly trained using synthetic data, it performed well in a real-world setting. For example, it achieves a 93% success rate on objects seen during the training time and 80% success rate on novel objects on a physical benchmark.

park18_class_based_grasp_detec_using decomposed the search for grasps in different steps, using STN. The first STN acted as a proposal mechanism, by selecting 4 crops as candidate grasp locations the image. Then, each of these 4 crops were fed into a single network, comprising a cascade of two STN

s: one estimated the grasp angle and the last STN chose the image’s scaling factor and crop. The latter crop can be seen as a fine adjustment of the grasping location. The four final images were then independently fed to a classifier, to find the best one. Each component, being the STNs and the classifier, were trained on

CGD separately using ground truth data and then fine-tuned together. This is a major distinction from other Proposal + Classification approaches, as the others cannot jointly train the proposal and classification sub-systems.

Ii-B Single-shot Approaches

Ii-B1 Regression Approaches

To eliminate the need to perform the exhaustive search of grasp configurations, Redmon2015Grasp proposed the first one-shot detection approach. To this effect, the authors proposed different CNN architectures, in which they always used AlexNet[Krizhevsky2012]

pretrained on ImageNet as the feature extractor. To exploit depth, they fed the depth channel from the RGB-D images into the blue color channel, and fine-tuned. The first architecture, named

Direct Regression, directly regressed from the input image the best grasp rectangle represented by the tuple . The second architecture, Regression + Classification added object class prediction to test its regularization effect. Kumra2016 further developed this one-shot detection approach by employing the more powerful ResNet-50 architecture [He2015]. They also explored a different strategy to integrate the depth modality, while seeking to preserve the benefits of ImageNet pre-training. As a solution, they introduced the multi-modal grasp architecture which separated RGB processing and depth processing in two different ResNet-50 networks, both pre-trained on ImageNet. Their architecture then performed late fusion, before the fully connected layers performed direct grasp regression.

Ii-B2 Multibox Approaches

Redmon2015Grasp also proposed a third architecture, MultiGrasp

, separating the image into a regular grid (dubbed Multi-box). At each grid cell, the network predicted the best grasping rectangle, as well as the probability of this grasp being positive. The grasp rectangle with the highest probability was then chosen. trottier_resnet improved results by employing a custom ResNet architecture for feature extraction. Another advantage was the reduced need for pre-training on ImageNet. chen2019convolutional remarked that grasp annotations in grasping datasets are not exhaustive. Consequently, they developed a method to transform a series of discrete grasp rectangles to a continuous

grasp path

. Instead of matching a prediction to the closest ground truth to compute the loss function, they mapped the prediction to the closest grasp path. This means that a prediction that falls directly between two annotated ground truths can still have a low loss value, thus (partially) circumventing the limitations of the

Intersection-over-Union (IoU) metric when used with sparse annotation, as long as the training dataset is sufficiently densely labeled (see Figure 2). The authors re-used the MultiGrasp architecture from Redmon2015Grasp for their experimentation.

Ii-B3 Anchor-box Approaches

zhou18_fully_convol_grasp_detec_networ introduced the notion of oriented anchor-box, inspired by YOLO9000 [Redmon2016]. This approach is similar to MultiGrasp (as the family of YOLO object detectors is a direct descendant of MultiGrasp [Redmon2015Grasp]) with the key difference of predicting offsets to predefined anchor boxes for each grid cell, instead of directly predicting the best grasp at each cell. A similar anchor-box approach is presented in park18_real_time_highl_accur_robot. chu2018 extends MultiGrasp to multiple object grasp detection by using region-of-interest pooling layers [Ren2015]. Similar work is presented in zhang18_roi_based_robot_grasp_detec.

Ii-B4 Discrete Approaches

johns16_deep_learn_grasp_funct_grasp proposed to use a discretization of the space with a granularity of 1 cm and . In a single pass of the network, the model predicts a score at each grid location. Their method can explicitly account for gripper pose uncertainty. If a grasp configuration has a high score, but the neighboring configurations on the grid have a low score, it is probable that a gripper that has a Gaussian error on its position will fail to grasp at this location. The authors explicitly handled this problem by smoothing the 3D grid (two spatial axis, one rotation axis) by a Gaussian kernel corresponding to the gripper error.

Satish2019 introduced a fully-convolutional successor to GQ-CNN. It extends GQ-CNN to a -class classification where each output is the probability of a good grasp at the angle , similar to [Pinto2016]. They train their network for this classification task. They then transform the fully-connected layer into a convolutional layer, enabling classification at each location of the feature map. This effectively evaluates each discrete location for graspability.

Fig. 2: (Left) Training example from the Dex-Net 2.0 detection dataset. Notice how there are very few annotations, thus not covering all of the possible grasp positions on the entire object. (Middle) Training example from the Cornell Grasping Dataset (CGD). These manually-labeled grasp annotations tend cover a more important fraction of the object, but for a much more limited number of examples. Figure from [Redmon2015Grasp].
(Right) Grasp path proposed by chen2019convolutional to augment the grasp rectangle representation on the CGD. A grasp prediction (green) is projected to a grasp path that lies between two ground-truth annotation. This allows for better evaluation of detection approaches. Figure from [chen2019convolutional].

Iii Problem Description

Iii-a One-shot Grasp Detection

Given the depth image of an object on a flat surface, we want to find a grasp configuration that maximizes the probability of lifting the object with a parallel-plate gripper. We aimed at performing this detection in a one-shot manner, i.e. with a single pass of the depth image through our network. As prediction output, we used the 5D grasp representation }, where captures the 3D coordinates of the grasp, the angle of the gripper and its opening. This representation considers grasps taken from above the object, perpendicular to the table’s surface, as in [Mahler2017, Redmon2015Grasp]. As our network is trained using both the dataset and the grasp robustness classifier GQ-CNN of Dex-Net 2.0 [Mahler2017], we detail them below.

Fig. 3: Our complete one-shot STN-based architecture. The three STNs learn respectively translation to the grasp’s center, rotation to the grasp’s angle and scaling to the grasp’s opening. The intermediary outputs of the STNs are fully observable and are used to determine the grasp location. The last STN feeds into GQ-CNN, which predicts a grasp robustness label. A detailed view of a STN block is depicted in Fig. 4.

Iii-B Dex-Net 2.0 Dataset

Dex-Net 2.0 is a large-scale simulated dataset for parallel-gripper grasping. It contains 6.7 million grasps on pre-rendered depth images of 3D models. These 3D models come from two different sources. 1,371 models come from 3DNet [wohlkinger20123dnet], a synthetic model dataset built for classification and pose estimation. The other 129 additional models are laser scans from KIT [kasper2012kit]. All of the 3D models were resized to fit within a 5 cm parallel gripper.

The grasp labels in the Dex-net 2.0 dataset were acquired via random sampling of antipodal grasp candidates. A heuristic-based approach developed in previous work (Dex-Net 1.0[Mahler2016]) was used to compute a robustness metric. This metric was thresholded to determine the grasp robustness label, i.e. robust vs. non-robust.

Learning one-shot grasp detection on the Dex-Net 2.0 dataset is in itself a challenging task, because of the few positive annotations per image. Annotations are very sparse compared to Cornell Grasping Dataset (CGD), a standard dataset used in one-shot grasp detection. For instance, it can be seen from Figure 2 that the ground truth annotation of Dex-Net 2.0 is clearly sparser than CGD. This prevents the grasp annotation augmentations method such as grasp path [chen2019convolutional] from being employed on the former.

There are two available versions of the Dex-net 2.0 dataset. The first version is a classification dataset. It was used by Mahler2017 to train GQ-CNN . It contains depth images of grasp candidates with associated grasp robustness metrics, which are thresholded to obtain robustness labels. The authors also released a detection version of the dataset. This version contains the centered depth images of the object, at full resolution ().

Please note that in this work, we used the original Dex-Net 2.0 annotations. Recently published work [Satish2019] developed a sampling method for generating additional annotations for the Dex-Net 2.0 images. Our approach could potentially benefit from more detection annotations on images contained in the Dex-Net 2.0 dataset. Still, for a given object, there is an infinity of possible grasp configurations which cannot all be annotated. Instead of improving learning at the annotation level, our approach, described in the following section, explicitly handles this inherent constraint.

Iv Gq-Stn Network Architecture

In this paper, we propose Grasp Quality Spatial Transformer Network (GQ-STN), a neural network architecture for one-shot grasp detection based on the Spatial Transformer Network (STN). This architecture enables us to train directly on a robustness label outputted by GQ-CNN, unlike previous one-shot grasp detection methods that enforce robustness implicitly through geometric regression on annotated locations.

Iv-a Spatial Transformer Network

The main component in our single-shot detection architecture is the Spatial Transformer Network (STN[jaderberg15_spatial_trans_networ], depicted in Figure 3. In some sense, it acts as an attention mechanism, by narrowing/reorienting objects in a more canonical representation for the task at hand. It is a drop-in block that can be inserted between two feature maps of a Convolutional Neural Network (CNN) to learn a spatial transformation of the input feature map. The Spatial Transformer Network (STN) consists of three parts: a localization network, a grid generator and a sampler. The localization network learns a transformation matrix based on the input feature map. The grid generator and the sampler transform the input feature map by the geometric transformation specified by

. It does so in a fully differentiable manner, in a process similar to texture mapping. It can thus stretch, rotate, or skew the input feature map, resulting in a new feature map as output. A pure rotation transformation is illustrated, at the top of Figure 


A Spatial Transformer Network (STN) can be constrained to only represent specific geometric transformations, instead of freely learning the six elements of . In our approach, we will employ three different STNs, one for each basic transformation:

represents a relative translation by a factor of , a rotation by an angle and an isotropic scaling by a factor of .

Iv-B Full architecture

Instead of predicting all transformations in a single network, we used a cascade of three STN blocks, STN STN and STN, which are respectively constrained by , and . In other words, STN learns the translation to the grasp center, STN learns the rotation of the gripper and STN learns a scaling representing the opening of the gripper. A motivation behind this architecture is to isolate the regression of the angle , which is a challenging task for a one-shot network according to [park18_rotat_ensem_modul_detec_rotat_invar_featur]. All Spatial Transformer Networks (STN) were applied directly to the 1-channel depth map; contrary to Kumra2016, we found no benefit in using a 3-channel version pre-trained on ImageNet for the STNs. All STNs also output a depth image, meaning that the communication between blocks of the network is not conducted via high-level feature maps, but via fully-observable depth images.

Fig. 4: A Spatial Transformer Network (STN) block performing a rotation of on an input depth image, aligning the image to the grasp’s axis. A ResNet-34 localization network predicts the transformation matrix . This is the second of the three STNs shown in Fig. 3.

We used ResNet-34 as localization networks in all three Spatial Transformer Network (STN)s, as in [park18_class_based_grasp_detec_using]. This yielded slighty better results than the smaller ResNet-18 while maintaining a reasonable training time. Drawing from [Redmon2015Grasp] and [Redmon2016], the output layers of the ResNet-34 computed the elements of as follows:

The tuples and are the raw outputs of the localization networks of respectively STN STN and STN. To break the two-fold rotational symmetry of the angle prediction, we predict which are respectively the sine and cosine of twice the angle , as in [Redmon2015Grasp]. is the mean scaling factor in the training set. In conjunction with the scaling , the last STN’s localization network also predicts the normalized gripper’s height .

The input of the complete network, illustrated in Figure 3, is a depth image. The translation and rotation STNs both generate a depth image of the same size as the input, while the STN generates a depth image at a resolution of . STN is followed by GQ-CNN. The latter predicts a grasp robustness label given the image outputted by STN. We use pre-trained weights made available by Mahler2017 for GQ-CNN. These weights are frozen throughout training. At evaluation time, GQ-CNN is not required for grasp detection. However, because evaluating a single grasp on GQ-CNN is low-cost, we keep GQ-CNN to avoid a GPU memory transfer cost later if we need a robustness label associated with a detection.

Iv-C Training

At each step of training, we randomly select a ground truth positive grasp example from the Dex-net 2.0, thus obtaining target values for location , and . We train the network using two types of supervision:

  • Localization loss : the loss on the predictions of the localization networks of the STNs using ;

  • Robustness loss : the cross-entropy loss on the output of GQ-CNN, where the expected value is a positive grasp label.

The total loss is given by:

Note that every block in the architecture is fully differentiable, thus allowing us to leverage information from the error on the grasp robustness label, by back-propagating from the grasp robustness label all the way back to the first STN. A significant advantage of using a CNN for robustness classification is that, being fully differentiable, it can be used for end-to-end training of a neural network in a straightforward manner.

The training regimen begins with and we gradually slide the loss mixing parameter toward . This way, we bootstrap the learning of our architecture with groud-truth grasp positions. These provide strong cues to the STNs, via the loss . As we reach , the network training then focuses on directly improving the grasp quality metric, irrespective of grasp positions. Importantly, this allows our one-shot detection network to learn from sparsely labeled ground-truth, by eventually strictly focusing on a grasp robustness metric provided by GQ-CNN. The bootstrapping induced by was necessary for the network training to converge, enabling a proper focus on the object. It can be seen in Figure 4 that transformations on the depth image introduce artifacts on the edges. If one would start training with , the network would enter a degenerate state where edge artifacts are mistaken for object edges.

During early stages of bootstrapping when , training tend to be quite unstable. There is an accumulation of error where, for instance, STN cannot provide a good prediction because of errors made by STN and STN, resulting in a high . We solved this issue by using a teacher forcing approach [Goodfellow-et-al-2016] where the STNs are trained in a disjoint manner. Instead of using the and predicted by the first and second localization networks respectively, we directly transform the images using the ground truth information , . Teacher forcing allows the three STN to be trained simultaneously, instead of training them in sequence as proposed in [park18_class_based_grasp_detec_using], resulting in a shorter training time. Teacher forcing is disabled after , allowing a joint training of all parameters on .

V Experiments and Evaluation

We compared our architecture against two baselines: the single-shot MultiGrasp architecture [Redmon2015Grasp] and the approach based on Proposal+Classification from Dex-Net 2.0[Mahler2017] that we will refer to as Prop+GQ-CNN. For MultiGrasp, we replaced the AlexNet feature extractor by a ResNet feature extractor, as seen in Kumra2016. We trained both our GQ-STN model and MultiGrasp on of the Dex-Net 2.0 dataset and held in a test set. For the Dex-Net 2.0 approach, we used the pre-trained model made available by the authors.

Fig. 5: (Left) Physical setup used for evaluation. It contains a UR5 arm, a Robotiq 85 gripper and a Microsoft Kinect sensor. (Right) Set of 12 household and office objects used in tests.

We implemented both GQ-STN

and MultiGrasp using the Tensorflow library. We trained both models 40 epochs with the Adam Optimizer. For

GQ-STN, we had the following scheduling for and the learning rate :

  • epochs at ;

  • epochs at ;

  • epochs at ;

  • epochs at ; and

  • Fine-tuning stage of epochs at using early stopping.

Teacher-forcing was turned-on for only the first 12 epochs. For MultiGrasp, we had the same schedule, though it converged faster and the last fine-tuning step with did not improve results. We kept the MultiGrasp model that had the highest rectangle metric score (see V-B) in validation. We had for both models a regularization factor of .

We compared the quality of predictions of MultiGrasp and our GQ-STN network using the robustness classification metric (sec. V-A). We also evaluated both MultiGrasp and GQ-STN networks according to the rectangle metric (sec. V-B), to pinpoint it weaknesses. Finally, we conducted real world grasping experiments (sec. V-D) where we evaluated all three approaches (MultiGrasp, our GQ-STN, and Prop+GQ-CNN). All experiments and training were conducted on a Desktop computer with a 4 GHz Intel i7-6700k and an NVIDIA Titan X GPU.

V-a Robustness Classification via GQ-CNN

asifensemblenet used SelectNet, a CNN trained for grasp evaluation. However, SelectNet was trained based on a metric similar to Jaccard, which is problematic (see Sec. V-B) and would thus provide for poor evaluation. In our situation, we preferred instead to use the pre-trained classifier GQ-CNN [Mahler2017]

for robustness evaluation of predicted grasp configurations. Indeed, this classifier was trained with a heuristic-based robustness evaluation metric named

Robust Ferrari-Canny. Moreover, the GQ-CNN was found experimentally to be an excellent predictor of grasp success, with on known objects and a precision of on unknown objects [Mahler2017]. As a reminder, the GQ-CNN takes as an input a depth image centered around the grasp location and classifies whether or not it is a robust grasp location.

We evaluated both our architecture and our baseline MultiGrasp architecture (i.e. with a ResNet feature extractor) using this robustness evaluation methodology. For the MutiGrasp architecture, we extracted a depth image around the grasp rectangle and fed it to GQ-CNN for classification. The output image crop generated automatically by our our GQ-STN architecture was used directly for evaluation. For all architectures, a grasp configuration was considered positive if it was classified as robust by GQ-CNN. Robustness classification results are found in Table I.

V-B Rectangle Metric

Fig. 6: Examples of grasp predictions (in red) and ground truth annotations (in green) depicting the limitations of the grasp rectangle as an evaluation metric. (Left) Examples of negatives grasps of the rectangle metric classified robust by GQ-CNN. (Right) Examples of positive grasps of the rectangle metric classified non-robust by GQ-CNN.

The rectangle metric is a standard evaluation metric for grasping systems introduced in [Jiang2011]. Given a grasp prediction and its closest ground truth , is considered correct if both:

  1. the angle difference between and is below ;

  2. the Jaccard index of

    and is greater that .

The Jaccard index is given by:

Note that the Dex-Net 2.0 dataset does not contain the rectangle height required by the rectangle grasp representation. In their case, we simply assumed that , which corresponds to the size of the gripper’s finger tips. Our architecture does not predict directly, but an analogous scaling factor . We considered that which corresponds to how grasps are represented in the Dex-Net 2.0 dataset. Both of these architectures predict a gripper height in addition to the 2D grasp configuration. For the rectangle metric evaluation purposes, this parameter is ignored.

We evaluate both MultiGrasp and GQ-STN on the rectangle metric. Table I shows that MultiGrasp performs slighty better than GQ-STN on the rectangle metric. This is understandable since it was specifically trained for rectangle regression. However, MultiGrasp has a poor Robustness Classification Metric score.

The rectangle metric is known to have a number of issues [ghazaei18_dealin_with_ambig_robot_grasp, chen2019convolutional]. First and foremost, the score bears no physical meaning in terms of grasp robustness, as it is purely computed in the image space. For example, a grasp rectangle can be considered as valid (high Jaccard index), even if a finger collides with the object. Second, for a grasp prediction to be evaluated, there needs to be a ground truth annotation near the exact position of the prediction. In other words, the validity of a grasp prediction depends on whether or not it was annotated in the dataset. This is particularly problematic when evaluating grasp detection frameworks, as for a given object, there is an infinity of possible grasp configurations which cannot all be annotated. In a classification framework, one does not suffer from this issue, since only labeled examples are used during evaluation.

To observe the lack of correlation between the rectangle metric and grasp robustness, we first examined the quantity of grasp rectangles that are considered positive by the rectangle metric but are not robust according to the robustness classification metric of GQ-CNN, described in Sec. V-A. These account for and of grasps detected by respectively MultiGrasp and GQ-STN. Conversely, we examined the grasps that are considered negative according to the rectangle metric but are robust according to the robustness classification metric. These account for and of grasps detected by respectively MultiGrasp and GQ-STN. These represent grasp rectangles that would be positive if they were annotated in the dataset. Examples are shown in Figure 6. These auxiliary results show that, especially in the context of sparse grasp annotations such as with the Dex-Net 2.0 dataset, the rectangle metric does not properly represent the performance of a grasping system. This further motivates our choice of evaluating with a robustness classification metric.

V-C Metric Results

Table I shows that overall, our approach is able to return a significantly higher percentage of high-quality grasps (92.4%) than the one-shot detection approach based on MultiGrasp (69.4%). This large performance gap can be explained by the fact that our approach enables us to optimize directly on the robustness classification metric, which is impossible for MultiGrasp. For both approaches, the rectangle metric tends to under-estimate the performance, which is explainable by sparse grasp annotations of the Dex-Net 2.0 dataset, as discussed in Section V-B.

Model Precision ()
Rectangle Robust
MultiGrasp 48.4 69.4
GQ-STN (ours) 46.7 92.4
TABLE I: Comparison of one-shot methods on evaluation metrics.

V-D Physical benchmark

Fig. 7: Examples of robust and non-robust grasp detection made by GQ-STN and Dex-Net 2.0 in our physical benchmark.
Model Success rate () Robust pred. rate () Grasp detect. time (sec)
MultiGrasp 95 21.7 0.014
GQ-STN (ours) 96.7 61.7 0.024
Prop+GQ-CNN 98.3 48.3 1.5
TABLE II: Comparison of methods on our physical benchmark.

We evaluated all three methods in real world conditions using the physical setup seen in Figure 5. It comprised a Universal Robots UR5 arm, a Robotiq 85 gripper and a Microsoft Kinect sensor. The Kinect sensor was mounted 70 cm perpendicular to the table’s surface. Grasp prediction was based on a single rectified depth image, where we replaced invalid depth pixels using inpainting [johns16_deep_learn_grasp_funct_grasp].

We selected 12 household and office objects for testing, shown in Figure 5. We chose objects that have a good variety of shape, material and texture and are similar to the one used in [Mahler2017]. During testing, we placed the target object at a random position near the center of the table, by shaking it under a box to ensure random orientation, as in Mahler2017. We then estimated the grasp configuration with one of the three methods, and used a custom path planner to execute the grasp motion. The gripper default opening was 8.5 cm. It closed on the object until a maximum force feedback is reached. Upon closure, the object was lifted from the table and the success evaluated manually. Each of the 12 objects was tested 5 times, for each compared method. In total, we performed 180 grasp attempts.

We computed three metrics in this physical benchmark:

  1. Success rate: Percentage of the lift attempts that resulted in a success. We execute the detected grasp even if it is not classified robust by the robustness classification metric.

  2. Robust prediction rate: Percentage of the time the detected grasp (or the top grasp candidate for the sampling-based Prop+GQ-CNN) is robust according to the robustness classification metric.

  3. Grasp detection time: Time in seconds between capturing an image and returning a grasp location. Here, we ignore time taken for inpainting.

As we can see in Table II, all three methods performed similarly, within the uncertainty of low samples. However, our method returned a robust grasp of the time, which is significantly more than MultiGrasp and above Prop+GQ-CNN.

Qualitatively, the approach Prop+GQ-CNN seemed to perform slightly better during real experiments, especially with larger objects such as the red chips clip. In some sense, this is not surprising as it evaluated the grasp quality over 1000 positions. Figure 7 shows examples of grasp detection on our physical benchmark. Even though the methods were trained only on simulated data, its large amount helped generalization to real world conditions, as noted as well by Mahler2017. Note that no domain-randomization was used here, contrary to bousmalis17:_using_simul_domain_adapt_improv.

In terms of timing, our GQ-STN approach is in the same order of magnitude as the MultiGrasp approach, even though we run an image through three ResNet networks (one per Localization Network inside the STN). The detection time for Prop+GQ-CNN is two order of magnitudes larger than our approach, i.e. around times slower. This limits its ability to perform real-time grasp detection.

Even though GQ-STN returns a single grasp and does so much faster, GQ-STN finds a robust grasp more often that Prop+GQ-CNN’s sampling. Considering the high precision of the robustness classification metric, this enables GQ-STN to be used in a framework where we first evaluate the fast GQ-STN then fallback to a slow sampling method if we have not found a robust grasp, improving the overall average planning time.

Vi Conclusion

In this paper, we present a novel architecture for one-shot detection of grasp localization, based on the Spatial Transformer Network (STN) architecture. With it, we have demonstrated how one can use supervision from a robustness classifier to train one-shot grasp detection. On the Dex-Net 2.0 dataset, our method returns robust grasps more often than a baseline model that is only trained using the geometric supervision. We showed in a physical benchmark that our method can find robust grasps in real-world conditions more often that sampling methods, while still performing real-time (over 40 Hz), which is greater than frame rate grasp detection on a Kinect.

This speed opens up the possibility of carrying out visual servoing for grasping, for moving objects for instance. If a camera in-hand is used, it makes it possible to explore a object in real-time, similarly to a next-best-view approach, akin to Levine2016. There are other interesting research avenues at the network architecture level for future work. For instance, since all inputs of the Spatial Transformer Network (STN) are similar depth images, one could imagine a parameter sharing mechanism to speed up the training time and reduce the model size.