Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach

04/14/2018 ∙ by Douglas Morrison, et al. ∙ qut 0

This paper presents a real-time, object-independent grasp synthesis method which can be used for closed-loop grasping. Our proposed Generative Grasping Convolutional Neural Network (GG-CNN) predicts the quality of grasps at every pixel. This one-to-one mapping from a depth image overcomes limitations of current deep learning grasping techniques, specifically by avoiding discrete sampling of grasp candidates and long computation times. Additionally, our GG-CNN is orders of magnitude smaller while detecting stable grasps with equivalent performance to current state-of-the-art techniques. The lightweight and single-pass generative nature of our GG-CNN allows for closed-loop control at up to 50Hz, enabling accurate grasping in non-static environments where objects move and in the presence of robot control inaccuracies. In our real-world tests, we achieve an 83 unseen objects with adversarial geometry and 88 that are moved during the grasp attempt. We also achieve 81 grasping in dynamic clutter.



There are no comments yet.


page 1

page 3

page 5

page 7

Code Repositories


Generative Grasping CNN from "Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach" (RSS 2018)

view repo


Generative Grasping CNN from "Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach" (RSS 2018)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In order to perform grasping and manipulation tasks in the unstructured environments of the real world, a robot must be able to compute grasps for the almost unlimited number of objects it might encounter. In addition, it needs to be able to act in dynamic environments, whether that be changes in the robot’s workspace, noise and errors in perception, inaccuracies in the robot’s control, or perturbations to the robot itself.

Robotic grasping has been investigated for decades, yielding a multitude of different techniques [2, 3, 27, 29]. Most recently, deep learning techniques have enabled some of the biggest advancements in grasp synthesis for unknown items. These approaches allow learning of features that correspond to good quality grasps that exceed the capabilities of human-designed features [12, 17, 21, 23].

Fig. 1: Our real-time, generative grasping pipeline. A camera mounted to the wrist of the robot captures depth images containing an object to be grasped. Our Generative Grasping Convolutional Neural Network (GG-CNN) generates antipodal grasps – parameterised as a grasp quality, angle and gripper width – for every pixel in the input image in a fraction of a second. The best grasp is calculated and a velocity command () is issued to the robot. The closed-loop system is capable of grasping dynamic objects and reacting to control errors.

However, these approaches typically use adapted versions of Convolutional Neural Network (CNN) architectures designed for object recognition [12, 15, 23, 25], and in most cases sample and rank grasp candidates individually [17, 21, 23], resulting in long computation times in the order of a second [21] to tens of seconds [17]. As such, these techniques are rarely used in closed-loop grasp execution and rely on precise camera calibration and precise robot control to grasp successfully, even in static environments.

We propose a different approach to selecting grasp points for previously unseen items. Our Generative Grasping Convolutional Neural Network (GG-CNN) directly generates an antipodal grasp pose and quality measure for every pixel in an input depth image and is fast enough for closed-loop control of grasping in dynamic environments (Fig. 1). We use the term “generative” to differentiate our direct grasp generation method from methods which sample grasp candidates.

The advantages of GG-CNN over other state-of-the-art grasp synthesis CNNs are twofold. Firstly, we do not rely on sampling of grasp candidates, but rather directly generate grasp poses on a pixelwise basis, analogous to advances in object detection where fully-convolutional networks are commonly used to perform pixelwise semantic segmentation rather than relying on sliding windows or bounding boxes [19]. Secondly, our GG-CNN has orders of magnitude fewer parameters than other CNNs used for grasp synthesis, allowing our grasp detection pipeline to execute in only 19 on a GPU-equipped desktop computer, fast enough for closed-loop grasping.

We evaluate the performance of our system in different scenarios by performing grasping trials with a Kinova Mico robot, with static, dynamic and cluttered objects. In dynamic grasping trials, where objects are moved during the grasp attempt, we achieve 83% grasping success rate on a set of eight 3D-printed objects with adversarial geometry [21] and 88% on a set of 12 household items chosen from standardised object sets. Additionally, we reproduce the dynamic clutter grasping experiments of [32] and show an improved grasp success rate of 81%. We further illustrate the advantages of using a closed-loop method by reporting experimental results when artificial inaccuracies are added to the robot’s control.

[17] [25] [23] [12] [15] [21] [18] [32] Ours

Real Robot Experiments
Objects from Standard Sets - -
Adversarial Objects [21] - -
Closed-loop - -
Dynamic Objects - -
Code Available *
Training Data Available
Training Data Type Real Real Real Synthetic Real Synthetic Real Synthetic Real
(Cornell [17]) (Cornell) (Trial) (Cornell) (Trial) (Cornell)

TABLE I: A comparison of our work to related deep learning approaches to grasp synthesis.
* Code is available at

Ii Related Work

Grasping Unknown Objects

Grasp synthesis refers to the formulation of a stable robotic grasp for a given object, which is a topic which has been widely researched resulting in a plethora of techniques. Broadly, these can be classified into analytic methods and empirical methods 

[3, 27]. Analytic methods use mathematical and physical models of geometry, kinematics and dynamics to calculate grasps that are stable [2, 24], but tend to not transfer well to the real world due to the difficultly in modelling physical interactions between a manipulator and an object [2, 26, 27].

In contrast, empirical methods focus on using models and experience-based approaches. Some techniques work with known items, associating good grasp points with an offline database of object models or shapes [6, 8, 22], or familiar items, based on object classes [28] or object parts [7], but are unable to generalise to new objects.

For grasping unknown objects, large advancements have been seen recently with a proliferation of vision-based deep-learning techniques [17, 21, 23, 25, 33]. Many of these techniques share a common pipeline: classifying grasp candidates sampled from an image or point cloud, then ranking them individually using Convolutional Neural Networks (CNN). Once the best grasp candidate is determined, a robot executes the grasp open-loop (without any feedback) which requires precise calibration between the camera and the robot, precise control of the robot and a completely static environment.

Execution time is the primary reason that grasps are executed open-loop. In many cases, deep-learning approaches use large neural networks with millions of parameters [12, 21, 23] and process grasp candidates using a sliding window at discrete intervals of offset and rotation [17, 23], which is computationally expensive and results in grasp planning times in the order of a second [21] to tens of seconds [17].

Some approaches reduce execution time by pre-processing and pruning the grasp candidates [17, 33] or predicting the quality of a discrete set of grasp candidates simultaneously [12, 23], trading off execution time against the number of grasps which are sampled, but ignoring some potential grasps.

Instead of sampling grasp candidates, both [15] and [25] use a deep CNN to regress a single best grasp pose for an input image. However, these regression methods are liable to output the average of the possible grasps for an object, which itself may not be a valid grasp [25].

Similar to our method, Varley et al. [31] use a neural network to generate pixelwise heatmaps for finger placement in an image, but still rely on a grasp planner to determine the final grasp pose.

We address the issues of execution time and grasp sampling by directly generating grasp poses for every pixel in an image simultaneously, using a comparatively small neural network.

Closed-Loop Grasping Closed-loop control of a robot to a desired pose using visual feedback is commonly referred to as visual servoing. The advantages of visual servoing methods are that they are able to adapt to dynamic environments and do not necessarily require fully accurate camera calibration or position control. A number of works apply visual servoing directly to grasping applications, with a survey given in [14]. However, the nature of visual servoing methods mean that they typically rely on hand-crafted image features for object detection [13, 30]

or object pose estimation 

[11], so do not perform any online grasp synthesis but instead converge to a pre-determined goal pose and are not applicable to unknown objects.

CNN-based controllers for grasping have very recently been proposed to combine deep learning with closed loop grasping [18, 32]. Rather than explicitly performing grasp synthesis, both systems learn controllers which map potential control commands to the expected quality of or distance to a grasp after execution of the control, requiring many potential commands to be sampled at each time step. In both cases, the control executes at no more than approximately 5. While both are closed-loop controllers, grasping in dynamic scenes is only presented in [32] and we reproduce these experiments.

The grasp regression methods [15, 25] report real-time performance, but are not validated with robotic experiments.

Benchmarking for Robotic Grasping Directly comparing results between robotic grasping experiments is difficult due to the wide range of grasp detection techniques used, the lack of standardisation between object sets, and the limitations of different physical hardware, e.g. robot arms, grippers or cameras. Many people report grasp success rates on sets of “household” objects, which vary significantly in the number and types of objects used.

The ACRV Picking Benchmark (APB) [16] and the YCB Object Set [5] define item sets and manipulation tasks, but benchmark on tasks such as warehouse order fulfilment (APB) or table setting and block stacking (YCB) rather than raw grasp success rate as is typically reported. Additionally, many of the items from these two sets are impractically small, large or heavy for many robots and grippers, so have not been widely adopted for robotic grasping experiments.

We propose a set of 20 reproducible items for testing, comprising comprising 8 3D printed adversarial objects from [21] and 12 items from the APB and YCB object sets, which we believe provide a wide enough range of sizes, shapes and difficulties to effectively compare results while not excluding use by any common robots, grippers or cameras.

In Table I we provide a summary of the recent related work on grasping for unknown objects, and how they compare to our own approach. This is not intended to be a comprehensive review, but rather to highlight the most relevant work.

Iii Grasp Point Definition

Fig. 2: Left: A grasp is defined by its Cartesian position , rotation around the z-axis and gripper width required for a successful grasp. Right: In the depth image the grasp pose is defined by its centre pixel , its rotation   around the image axis and perceived width .

Like much of the related literature [12, 17, 21, 23, 32], we consider the problem of detecting and executing antipodal grasps on unknown objects, perpendicular to a planar surface, given a depth image of the scene (Fig. 2).

Let define a grasp, executed perpendicular to the - plane. The grasp is determined by its pose, i.e. the gripper’s centre position in Cartesian coordinates, the gripper’s rotation around the axis and the required gripper width . A scalar quality measure , representing the chances of grasp success, is added to the pose. The addition of the gripper width enables a better prediction and better performance over the more commonly used position and rotation only representation.

We want to detect grasps given a 2.5D depth image with height and width , taken from a camera with known intrinsic parameters. In the image a grasp is described by

where is the centre point in image coordinates (pixels), is the rotation in the camera’s reference frame and is the grasp width in image coordinates. A grasp in the image space can be converted to a grasp in world coordinates by applying a sequence of known transforms,


where transforms from the camera frame to the world/robot frame and transforms from 2D image coordinates to the 3D camera frame, based on the camera intrinsic parameters and known calibration between the robot and camera.

We refer to the set of grasps in the image space as the grasp map, which we denote

where , and are each and contain values of , and respectively at each pixel .

Instead of sampling the input image to create grasp candidates, we wish to directly calculate a grasp for each pixel in the depth image . To do this, we define a function from a depth image to the grasp map in the image coordinates: . From we can calculate the best visible grasp in the image space , and calculate the equivalent best grasp in world coordinates via Eq. (1).

Iv Generative Grasping Convolutional
Neural Network

We propose the use of a neural network to approximate the complex function . denotes a neural network with being the weights of the network.

We show that , can be learned with a training set of inputs and corresponding outputs

and applying the L2 loss function

, such that

Iv-a Grasp Representation

estimates the parameters of a set of grasps, executed at the Cartesian point , corresponding to each pixel . We represent the grasp map as a set of three images, , and . The representations are as follows:

is an image which describes the quality of a grasp executed at each point . The value is a scalar in the range where a value closer to 1 indicates higher grasp quality, i.e. higher chance of grasp success.

is an image which describes the angle of a grasp to be executed at each point. Because the antipodal grasp is symmetrical around radians, the angles are given in the range .

is an image which describes the gripper width of a grasp to be executed at each point. To allow for depth invariance, values are in the range of pixels, which can be converted to a physical measurement using the depth camera parameters and measured depth.

Fig. 3: Generation of training data used to train our GG-CNN. Left: The cropped and rotated depth and RGB images from the Cornell Grasping Dataset [17], with the ground-truth positive grasp rectangles representing antipodal grasps shown in green. The RGB image is for illustration and is not used by our system. Right: From the ground-truth grasps, we generate the Grasp Quality (), Grasp Angle () and Grasp Width () images to train our network. The angle is further decomposed into and for training as described in Section IV-B.

Iv-B Training Dataset

To train our network, we create a dataset (Fig. 3) from the Cornell Grasping Dataset [17]. The Cornell Grasping Dataset contains 885 RGB-D images of real objects, with 5110 human-labelled positive and 2909 negative grasps. While this is a relatively small grasping dataset compared to some more recent, synthetic datasets [20, 21], the data best suits our pixelwise grasp representation as multiple labelled grasps are provided per image. This is a more realistic estimate of the full pixel-wise grasp map, than using a single image to represent one grasp, such as in  [21]. We augment the Cornell Grasping Dataset with random crops, zooms and rotations to create a set of 8840 depth images and associated grasp map images , effectively incorporating 51,100 grasp examples.

The Cornell Grasping Dataset represents antipodal grasps as rectangles using pixel coordinates, aligned to the position and rotation of a gripper [35]. To convert from the rectangle representation to our image-based representation , we use the centre third of each grasping rectangle as an image mask which corresponds to the position of the centre of the gripper. We use this image mask to update sections of our training images, as described below and shown in Fig. 3. We consider only the positive labelled grasps for training our network and assume any other area is not a valid grasp.

Grasp Quality: We treat each ground-truth positive grasp from the Cornell Grasping Dataset as a binary label and set the corresponding area of to a value of 1. All other pixels are 0.

Angle: We compute the angle of each grasping rectangle in the range , and set the corresponding area of

. We encode the angle as two vector components on a unit circle, producing values in the range

and removing any discontinuities that would occur in the data where the angle wraps around if the raw angle was used, making the distribution easier for the network to learn [9]. Because the antipodal grasp is symmetrical around radians, we use use two components and which provides values which are unique within and symmetrical at .

Width: Similarly, we compute the width in pixels (maximum of 150) of each grasping rectangle representing the width of the gripper and set the corresponding portion of . During training, we scale the values of by to put it in the range . The physical gripper width can be calculated using the parameters of the camera and the measured depth.

Depth Input: As the Cornell Grasping Dataset is captured with a real camera it already contains realistic sensor noise and therefore no noise addition is required. The depth images are inpainted using OpenCV [4] to remove invalid values. We subtract the mean of each depth image, centring its value around to provide depth invariance.

Iv-C Network Architecture

Fig. 4: (a) The Generative Grasping CNN (GG-CNN) takes an inpainted depth image (), and directly generates a grasp pose for every pixel (the grasp map ), comprising the grasp quality , grasp width and grasp angle . (b) From the combined network output, we can compute the best grasp point to reach for, .

Our GG-CNN is a fully convolutional topology, shown in Fig. 4a. It is used to directly approximate the grasp map from an input depth image

. Fully convolutional networks have been shown to perform well at computer vision tasks requiring transfer between image domains, such image segmentation 

[1, 19] and contour detection [34].

The GG-CNN computes the function , where , , and are represented as 300300 pixel images. As described in Section IV-B, the network outputs two images representing the unit vector components of , from which we calculate the grasp angles by .

Our final GG-CNN contains 62,420 parameters, making it significantly smaller and faster to compute than the CNNs used for grasp candidate classification in other works which contain hundreds of thousands [10, 18] or millions [12, 21, 23, 25] of parameters. Our code is available at

Iv-D Training

We train our network on 80% of our training dataset, and keep 20% as an evaluation dataset. We trained 95 networks with similar architectures but different combinations of convolutional filters and stride sizes for 100 epochs each.

To determine the best network configuration, we compare relative performance between our trained networks by evaluating each on detecting ground-truth grasps in our 20% evaluation dataset containing 1710 augmented images.

V Experimental Set-up

V-a Physical Components

To perform our grasping trials we use a Kinova Mico 6DOF robot fitted with a Kinova KG-2 2-fingered gripper.

Our camera is an Intel RealSense SR300 RGB-D camera. The camera is mounted to the wrist of the robot, approximately 80 above the closed fingertips and inclined at towards the gripper. This set-up is shown in Fig. 4a.

The GG-CNN computations were performed on a PC running running Ubuntu 16.04 with a 3.6 Intel Core i7-7700 CPU and NVIDIA GeForce GTX 1070 graphics card. On this platform, the GG-CNN takes 6 to compute for a single depth image, and computation of the entire grasping pipeline (Section V-C) takes 19, with the code predominantly written in Python.

V-A1 Limitations

The RealSense camera has a specified minimum range of 200. In reality, we find that the RealSense camera is unable to produce accurate depth measurements from a distance closer than 150, as the separation between the camera’s infra-red projector and camera causes shadowing in the depth image caused by the object. For this reason, when performing closed-loop grasping trials (Section V-D2), we stop updating the target grasp pose at this point, which equates to the gripper being approximately 70 from the object. Additionally, we find that the RealSense is unable to provide any valid depth data on many black or reflective objects.

The Kinova KG-2 gripper has a maximum stroke of 175, which could easily envelop many of the test items. To encourage more precise grasps, we limit the maximum gripper width to approximately 70. The fingers of the gripper have some built-in compliance and naturally splay slightly at the tips, so we find that objects with a height less than 15 (especially those that are cylindrical, like a thin pen) cannot be grasped.

V-B Test Objects

Fig. 5: The objects used for grasping experiments. Left: The 8 adversarial objects from [21]. Right: The 12 household objects selected from [5] and [16].

There is no set of test objects which are commonly used for robotic grasping experiments, with many people using random “household” objects which are not easily reproducible. We propose here two sets of reproducible benchmark objects (Fig. 5) on which we test the grasp success rate of our approach.

Adversarial Set The first set consists of eight 3D-printed objects with adversarial geometry, which were used by Mahler et al. [21] to verify the performance of their Grasp Quality CNN. The objects all have complex geometry, meaning there is a high chance of a collision with the object in the case of an inaccurate grasp, as well as many curved and inclined surfaces which are difficult or impossible to grasp. The object models are available online as part of the released datatasets for Dex-Net 2.0111 [21].

Household Set This set of items contains twelve household items of varying sizes, shapes and difficulty with minimal redundancy (i.e. minimal objects with similar shapes). The objects were chosen from the standard robotic grasping datasets the ACRV Picking Benchmark (APB) [16] and the YCB Object Set [5], both of which provide item specifications and online purchase links. Half of the item classes (mug, screwdriver, marker pen, die, ball and clamp) appear in both data sets. We have made every effort to produce a balanced object set containing objects which are deformable (bear and cable), perceptually challenging (black clamp and screwdriver handle, thin reflective edges on the mug and duct tape, and clear packaging on the toothbrush), and objects which are small and require precision (golf ball, duck and die).

While both the APB and YCB object sets contain a large number of objects, many are physically impossible for our robot to grasp due to being too small and thin (e.g. screws, washers, envelope), too large (e.g. large boxes, saucepan, soccer ball) or too heavy (e.g. power drill, saucepan). While manipulating these objects is an open problem in robotics, we do not consider them for our experiments in order to compare our results to other work which use similar object classes to ours [12, 17, 18, 21, 23].

V-C Grasp Detection Pipeline

Our grasp detection pipeline comprises three stages: image processing, evaluation of the GG-CNN and computation of a grasp pose.

The depth image is first cropped to a square, and scaled to pixels to suit the input of the GG-CNN. We inpaint invalid depth values using OpenCV [4].

The GG-CNN is then evaluated on the processed depth image, to produce the grasp map . We filter with a Gaussian kernel, similar to  [12]

, and find this helps improve our grasping performance by removing outliers and causing the local maxima of

to converge to regions of more robust grasps.

Finally, the best grasp pose in the image space is computed by identifying the maximum pixel in , and the rotation and width are computed from and respectively. The grasp in Cartesian coordinates is computed via Eq. (1) (Fig. 4b).

V-D Grasp Execution

We evaluate the performance of our system using two grasping methods. Firstly, an open-loop grasping method similar to [17, 23, 21], where the best grasp pose is calculated from a single viewpoint and executed by the robot open-loop. Secondly, we implement a closed-loop visual servoing controller which we use for evaluating our system in dynamic environments.

V-D1 Open Loop Grasping

To perform open-loop grasps, the camera is positioned approximately 350 above and parallel to the surface of the table. An item is placed in the field of view of the camera. A depth image is captured and the pose of the best grasp is computed using the grasp detection pipeline. The robot moves to a pre-grasp position, with the gripper tips aligned with and approximately 170 above the computed grasp. From here, the robot moves straight down until the grasp pose is met or a collision is detected via force feedback in the robot. The gripper is closed and lifted, and the grasp is recorded as a success if the object is successfully lifted to the starting position.

V-D2 Closed Loop Grasping

To perform closed-loop grasping, we implement a Position Based Visual Servoing (PBVS) controller [14]. The camera is initially positioned approximately 400 above the surface of the table, and an object is placed in the field of view. Depth images are generated at a rate of 30 and processed by the grasp detection pipeline to generate grasp poses in real time. There may be multiple similarly-ranked good quality grasps in an image, so to avoid rapidly switching between them, which would confuse the controller, we compute three grasps from the highest local maxima of and select the one which is closest (in image coordinates) to the grasp used on the previous iteration. As the control loop is fast compared to the movement of the robot, there is unlikely to be a major change between frames. The system is initialised to track the global maxima of at the beginning of each grasp attempt. We represent the poses of the grasp and the gripper fingers as 6D vectors comprising the Cartesian position and roll, pitch and yaw Euler angles , and generate a 6D velocity signal for the end-effector:

where is a 6D scale for the velocity, which causes the gripper to converge to the grasp pose. Simultaneously, we control the gripper fingers to the computed gripper width via velocity control. Control is stopped when the grasp pose is reached or a collision is detected. The gripper is closed and lifted and the grasp is recorded as a success if the object is successfully lifted to the starting position.

V-E Object Placement

To remove bias related to object pose, objects are shaken in a cardboard box and emptied into the robot’s workspace for each grasp attempt. The workspace is an approximately 250300 area in the robot’s field of view in which the robot’s kinematics allow it to execute a vertical grasp.

Vi Experiments

[17] [23] [12] [21] [18] [32] Ours

Grasp Success Rate (%)
Household Objects (Static) 89 73 80 80 80 925
Adversarial Objects (Static) 93* 848
Household Objects (Dynamic) 886
Adversarial Objects (Dynamic) 838
Objects from [32] (Single) 98 100
Objects from [32] (Clutter) 89 877
Objects from [32] (Clutter, Dynamic) 77 818

Network Parameters (approx.)
60M 60M 18M 1M 62k
Computation Time (to generate pose or command) 13.5s 0.8s 0.2-0.5s 0.2s 19ms

Results from grasping experiments with 95% confidence intervals, and comparison to other deep learning approaches where available.

# Note that all experiments use different item sets and experimental protocol, so comparative performance is indicative only.
*Contrary to our approach, [21] train their grasp network on the adversarial objects!

To evaluate the performance of our grasping pipeline and GG-CNN, we perform several experiments comprising over 2000 grasp attempts. In order to compare our results to others, we aim to reproduce similar experiments where possible, and also aim to present experiments which are reproducible in themselves by using our defined set of objects (Section V-B) and defined dynamic motions.

Firstly, to most closely compare to existing work in robotic grasping, we perform grasping on singulated, static objects from our two object sets. Secondly, to highlight our primary contribution, we evaluate grasping on objects which are moved during the grasp attempt, to show the ability of our system to perform dynamic grasping. Thirdly, we show our system’s ability to generalise to dynamic cluttered scenes by reproducing the experiments from [32] and show improved results. Finally, we further show the advantage of our closed-loop grasping method over open-loop grasping by performing grasps in the presence of simulated kinematic errors of our robot’s control.

Table II provides a summary of our results in different grasping tasks and comparisons to other work where possible.

Vi-a Static Grasping

To evaluate the performance of our GG-CNN under static conditions, we performed grasping trials using both the open- and closed-loop methods on both sets of test objects, using the set-up shown in Fig. 6a. We perform 10 trials on each object. For the adversarial object set, the grasp success rates were 84% (67/80) and 81% (65/80) for the open- and closed-loop methods respectively. For the household object set, the open-loop method achieved 92% (110/120) and the closed-loop 91% (109/120).

A comparison to other work is provided in Table II. We note that the results may not be directly comparable due to the different objects and experimental protocol used, however we aim to show that we achieve comparable performance to other works which use much larger neural networks and have longer computation times. A noteworthy difference in method is [18], which does not require precise camera calibration, but rather learns the spatial relationship between the robot and the objects using vision.

Vi-B Dynamic Grasping

Fig. 6: Grasping experiments. (a) Set-up for static grasping, and initial set-up for dynamic grasping. (b) During a dynamic grasp attempt, the object is translated at least 100 and rotated at least 25, measured by the grid on the table. (c) Set-up for static grasping in clutter, and initial set-up for dynamic grasping in clutter. (d) During a dynamic grasp attempt, the cluttered objects are translated at least 100 and rotated at least 25, measured by the grid on the table.

To perform grasps on dynamic objects we take inspiration from recent work in [32], where items are moved once by hand randomly during each grasp attempt. To assist reproducibility, we define this movement to consist of a translation of at least 100  and a rotation of at least 25 after the grasp attempt has begun, shown in Fig. 6a-b, which we measure using a grid on the table.

We perform 10 grasp attempts on each adversarial and household object using our closed-loop method, and achieve grasp success rates of 83% (66/80) for the adversarial objects and 88% (106/120) for the household objects. These results are not significantly different to our results on static objects, and are within the 95% confidence bounds of our results on static objects, showing our method’s ability to maintain a high level of accuracy when grasping dynamic objects.

We do not compare directly to an open-loop method as the object movement moves the object sufficiently far from the original position that no successful grasps would be possible.

Vi-C Dynamic Grasping in Clutter

Fig. 7: Left: The objects used to reproduce the dynamic grasping in clutter experiment of [32]. Right: The test objects used by [32]. We have attempted to recreate the object set as closely as possible.

Viereck et al. [32] demonstrate a visuomotor controller for robotic grasping in clutter that is able to react to disturbances to the objects being grasped. As this work is closely related to our own, we have made an effort to recreate their experiments using objects as close as possible to their set of 10 (Fig. 7) to perform a comparison. Even though our GG-CNN has not been trained on cluttered environments, we show here its ability to perform grasping in the presence of clutter. We recreate the three grasping experiments from [32] as follows:

Vi-C1 Isolated Objects

We performed 4 grasps on each of the 10 test objects (Fig. 7) in isolation, and achieved a grasp success rate of 100%, compared to 98% (39/40) in [32].

Vi-C2 Cluttered Objects

The 10 test objects are shaken in a box and emptied in a pile below the robot (Fig. 6c). The robot attempts multiple grasps, and any objects that are grasped are removed. This continues until all objects are grasped, three consecutive grasps are failures or all objects are outside the workspace of the robot. We run this experiment 10 times.

Despite our GG-CNN not being trained on cluttered scenes, we achieved a grasp success rate of 87% (83/96) compared to 89% (66/74) in [32]. Our most common failure cause was collision of the gripper with two objects that had fallen up against each other. 8 out of the 13 failed grasps were from two runs where objects had fallen into an ungraspable position and failed repeatedly. 8 out of the 10 runs finished with 0 or 1 grasp failures.

Vi-C3 Dynamic Cluttered Objects

For dynamic scenes, we repeat the procedure as above with the addition of a random movement of the objects during the grasp attempt. Viereck et al. [32] do not give specifications for their random movement, so we use the same procedures as in Section VI-B, where we move the objects randomly, at least 100 and 25 during each grasp attempt (Fig. 6d).

In 10 runs of the experiment, we performed 94 grasp attempts of which 76 were successful (81%), compared to 77% (58/75) in [32]. Like the static case, 8 of the 18 failed grasps were from two runs where the arrangement of the objects resulted in repeated failed attempts. In the other 8 runs, all available objects (i.e. those that didn’t fall/roll out of the workspace) were successfully grasped with 2 or fewer failed grasps.

Despite not being trained on cluttered scenes, this shows our approach’s ability to perform grasping in clutter and its ability to react to dynamic scenes, showing only a 5% decrease in performance for the dynamic case compared to 12% in [32].

For the same experiments,  [32] shows that an open-loop baseline approach on the same objects that is able to achieve 95% grasp success rate for the static cluttered scenes achieves only 23% grasp success rate for dynamic scenes as it is able to react to the change in item location.

Vi-D Robustness to Control Errors

The control of a robot may not always be precise. For example, when performing grasping trials with a Baxter Research Robot, Lenz et al. [17] found that positioning errors of up to 20 were typical. A major advantage of using a closed-loop controller for grasping is the ability to perform accurate grasps despite inaccurate control. We show this by simulating an inaccurate kinematic model of our robot by introducing a cross-correlation between Cartesian (, and ) velocities:

where each is sampled at the beginning of each grasp attempt. While a real kinematic error (e.g. a link length being incorrectly configured) would result in a more non-linear response, our noise model provides a good approximation which is independent of the robot’s kinematic model, so has a deterministic effect with respect to end-effector positioning and is more easily replicated on a different robotic system.

We test grasping on both object sets with 10 grasp attempts per object for both the open- and closed-loop methods with 0.0 (the baseline case), 0.05, 0.1 and 0.15. In the case of our open-loop controller, where we only control velocity for 170 in the direction from the pre-grasp pose (Section V-D1

), this corresponds to having a robot with an end-effector precision described by a normal distribution with zero mean and standard deviation 0.0, 8.5, 17.0 and 25.5

 respectively, by the relationship for scalar multiplication of the normal distribution:

The results are illustrated in Fig. 8, and show that the closed-loop method outperforms the open-loop method in the presence of control error. This highlights a major advantage of being able to perform closed-loop grasping, as the open-loop methods are unable to respond, achieving only 38% grasp success rate in the worst case. In comparison, the closed-loop method achieves 68% and 73% grasp success rate in the worst case on the adversarial and household objects respectively.

The decrease in performance of the closed-loop method is due to the limitation of our camera (Section V-A1), where we are unable to update the grasp pose when the gripper is within 70 of the object, so can not correct for errors in this range.

The addition of control inaccuracy effects objects which require precise grasps (e.g. the adversarial objects, and small objects such as the die and ball) the most. Simpler objects which are more easily caged by the gripper, such as the pen, still report good grasp results in the presence of kinematic error.

Fig. 8: Comparison of grasp success rates for open-loop and closed-loop control methods with velocity cross-correlation added to simulate kinematic errors (see Section VI-D for full details). The closed-loop method out-performs the open-loop method in all cases where kinematic errors are present. 10 trials were performed on each object in both the adversarial and household object sets.

Vii Conclusion

We present our Generative Grasping Convolutional Neural Network (GG-CNN), an object-independent grasp synthesis model which directly generates grasp poses from a depth image on a pixelwise basis, instead of sampling and classifying individual grasp candidates like other deep learning techniques. Our GG-CNN is orders of magnitude smaller than other recent grasping networks, allowing us to generate grasp poses at a rate of up to 50 and perform closed-loop control. We show through grasping trials that our system is able to gain state-of-the-art results in grasping unknown, dynamic objects, including objects in dynamic clutter. Additionally, our closed-loop grasping method significantly outperforms an open-loop method in the presence of simulated robot control error.

We encourage reproducibility in robotic grasping experiments by using two standard object sets, a set of eight 3D-printed objects with adversarial geometry [21] plus a proposed set of twelve household items from standard robotic benchmark object sets, and by defining the parameters of our dynamic grasping experiments. On our two object sets we achieve 83% and 88% grasp success rate respectively when objects are moved during the grasp attempt, and 81% for objects in dynamic clutter.


This research was supported by the Australian Research Council Centre of Excellence for Robotic Vision (project number CE140100016).