Antipodal Robotic Grasping using Generative Residual Convolutional Neural Network

by   Sulabh Kumra, et al.
Rochester Institute of Technology

In this paper, we tackle the problem of generating antipodal robotic grasps for unknown objects from n-channel image of the scene. We propose a novel Generative Residual Convolutional Neural Network (GR-ConvNet) model that can generate robust antipodal grasps from n-channel input at realtime speeds ( 20ms). We evaluate the proposed model architecture on standard datasets and previously unseen household objects. We achieved state-of-the-art accuracy of 97.7 demonstrate a 93.5 Our open-source implementation of GR-ConvNet can be found at



There are no comments yet.


page 1

page 5

page 7


Lightweight Convolutional Neural Network with Gaussian-based Grasping Representation for Robotic Grasping Detection

The method of deep learning has achieved excellent results in improving ...

Dealing with Ambiguity in Robotic Grasping via Multiple Predictions

Humans excel in grasping and manipulating objects because of their life-...

Learning Based Industrial Bin-picking Trained with Approximate Physics Simulator

In this research, we tackle the problem of picking an object from random...

Domestic waste detection and grasping points for robotic picking up

This paper presents an AI system applied to location and robotic graspin...

Predicting the dynamics of 2d objects with a deep residual network

We investigate how a residual network can learn to predict the dynamics ...

Domain Randomization and Generative Models for Robotic Grasping

Deep learning-based robotic grasping has made significant progress the p...

Logical Learning Through a Hybrid Neural Network with Auxiliary Inputs

The human reasoning process is seldom a one-way process from an input le...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robotic manipulators are constantly compared to humans due to the inherent characteristics of humans to instinctively grasp any unknown object rapidly and with ease based on their own experiences. As more and more research is being done to make the robots more intelligent, there exists a demand for a generalized technique to infer fast and robust grasps for any kind of object that the robot encounters. The major challenge is being able to precisely transfer the knowledge that the robot learns to novel real-world objects.

In contrast to other techniques used in the past [20, 28, 27]

, we present a different approach to tackle this problem of grasping unknown objects. The core of our research is a Generative Residual Convolutional Neural Network (GR-ConvNet) that provides a generalized solution for grasping novel objects. GR-ConvNet generates antipodal grasps for every pixel in an n-channel input image by generating images of grasp scores, angle and width. We use the term generative to distinguish our method from other techniques that output a grasp probability or classify grasp candidates in order to predict the best grasp. Fig.

1 shows the proposed system architecture. Our architecture consists of an RGB-D camera that captures input in the form of RGB-D images. The pre-processing unit performs pre-conditioning techniques and generates an input RGB image along with its aligned depth image in the desired format in order to be fed into the GR-ConvNet for extracting useful information from the input images. The network then outputs quality, angle and width images, which are then used to infer grasp rectangles on novel real objects.

In robotic grasping, it is very essential to generate grasps that are not just robust but also ones that require the least amount of computation time. Our state-of-the-art technique demonstrates both of these from our outstanding results in generating robust grasps with lowest recorded inference time of 19ms on the Cornell Grasp dataset as well as the new Jacquard dataset. We also demonstrate that our technique works equally well on novel real objects.

The contributions of this paper can be summarized as follows:

  • We propose a novel generative residual convolutional neural network based model architecture which detects objects in the camera’s field of view and predicts a suitable antipodal grasp configuration for the objects in the image.

  • We evaluate our model on publicly available grasping datasets and achieved state-of-the-art accuracy of 97.7% and 94.6% on Cornell and Jacquard grasping datasets respectively.

  • We demonstrate that we can deploy the proposed model in real-world using a real-sense camera and predict grasps with a success rate of 93.5% on previously unseen objects.

Fig. 1: Proposed system overview

Ii Related Work

Robotic Grasping: There has been an extensive on-going research in the field of robotics, especially robotic grasping. Although the problem seems to just be able to find a suitable grasp for an object the actual task involves multifaceted elements such as- the object to be grasped, shape of the object and physical properties of the object and the gripper with which it needs to be grasped among others. Early research in this field involved hand-engineering the features [23, 18] which can be a tedious and time consuming task but can be helpful for learning to grasp objects with multiple fingers such as [17, 7].

Initially for obtaining a stable grasp, the mechanics and contact kinematics of the end effector in contact with the object was studied and the grasp analysis was performed as seen from the survey by [6, 5, 32]. Prior work [30]

in robotic grasping for novel objects involved using supervised learning which was trained on synthetic data but it was limited to environments such as office, kitchen and dishwasher. Satish

et al. [29] introduced a Fully Convolutional Grasp Quality Convolutional Neural Network (FC-GQ-CNN) which predicted a robust grasp quality by using a data collection policy and synthetic training environment.This method enabled increase in the number of grasps considered to 5000 times in 0.625s. Bousmalis et al. [8] discussed domain adaptation and simulation in order to bridge the gap between simulated and real world data. In that pixel-level domain adaptation model, GraspGAN was used to generate adapted images that are similar to real ones and are differentiated by the descriminator network. Trembley et al. [34] worked on similar problem as bousmalis et al. They used a deep network trained only on synthetic images on 6 DoF pose of known object. However this has been shown to work on household items only. James et al. [13] discuss about a Randomized to Canonical Adaptation Networks (RCANs) method that learns to translate images from randomized simulated environments to their equivalent simulated canonical images using a image-conditioned GAN.They then use this to train their RL algorithm for real-world images. Further, an actor-critic network that combines the results obtained by the actor network is presented in [38]

which samples grasp samples directly with the results obtained from a critic network which re-scores the results obtained from actor network to find stable and robust grasps. The current research entirely relies on using just the RGB-D data to acquire good grasps. These approaches depend wholly on machine learning techniques.

Deep learning for grasping

: Deep learning has been a hot topic of research since the advent of ImageNet success and the use of GPU’s and other fast computational techniques. Also, the availability of affordable RGB-D sensors enabled the use of deep learning techniques to learn the features of objects directly from image data. Recent experimentations on Convolutional neural network

[28, 31, 40] have demonstrated that they can be used to efficiently compute stable grasps, Pinto et al. [27] used an architecture similar to AlexNet to depict that by increasing the size of the data, their CNN was able to generalize better to new data. Varley et al. [35] propose an interesting approach to grasp planning through shape completion where a 3D CNN was used to train the network on 3D prototype of objects in their own dataset captured from various viewpoints. Guo et al. [12] used tactile data along with visual data to train a hybrid deep architecture. Mahler et al. [22] proposed a Grasp Quality Convolutional Neural Network (GQ-CNN) that predicts grasps from synthetic point cloud data trained on Dex-Net 2.0 grasp planner dataset. Levine et al. [21] discuss use of monocular images for hand to eye co-ordination for robotic grasping using a deep learning framework. They use a CNN for grasp success prediction and further use continuous servoing to continuously servo the manipulator to correct mistakes. Asif et al. [3] introduce a consolidated framework known as EnsembleNet in which grasp generation network generates four grasp representations and EnsembleNet synthesizes these generated grasps to produce grasp scores from which the grasp with highest score gets selected. Antanas et al. [1] discuss an interesting approach known as probabilistic logic framework that is said to improve the grasping capability of robot with the help of semantic object parts. This framework combines high-level reasoning with low-level grasping. The high-level reasoning comprises of object affordances, its categories and task based information while low-level reasoning uses visual shape features. This has been observed to work well on kitchen related scenarios.

Grasping using Uni-modal data : Johns et al. [15] used a simulated depth image to predict a grasp outcome for every grasp pose predicted and select the best grasp by smoothing the predicted pose using a grasp uncertainty function. A generative approach to grasping is discussed by Morrison et al. [24]. The Generative grasp CNN architecture generates grasp poses using a depth image and the network computes grasp on a pixel-wise basis and insists that it reduces existing shortcomings of discrete sampling and computational complexity. Another recent approach that merely relies on depth data as the sole input to the deep CNN is as seen in [31].

Fig. 2: Proposed Generative Residual Convolutional Neural Network

Grasping using multi-modal data: There are different ways of handling objects multi-modalities. Many have used seperate features to learn the modalities which can be computationally exhaustive. Wang et al. [36] proposed methods that consider multimodal information as the same. Jiang et al. [14] used RGB-D images to infer grasps based on a two step learning process. The first step was used to narrow down the search space and the second step was used to compute the optimal grasp rectangle from the top grasps obtained using first method. Lenz et al. [20] used a similar two-step approach but with a deep learning architecture which however could not work well on all types of objects and often predicted a grasp location that was not the best grasp for that particular object such as in [14] the algorithm predicted the grasp for a shoe was from its laces which in practice failed when the robot tried to grasp using the shoe laces while in [20] the algorithm sometimes could not predict grasps which are more practical using just the local information as well as due to the RGB-D sensor used. Yan et al. [39] used point cloud prediction network to generate a grasp by first preprocessing the data by obtaining the color, depth and masked image and then obtaining a 3D point cloud of the object to be fed into a critic network to predict a grasp. Chu et al. [9] propose a novel architecture that can predict multiple grasps for multiple objects simultaneously instead of for a single object. For this they used a multi-object dataset of their own. The model was tested on Cornell Grasp Dataset. A robotic grasping method that consists of a ConvNet is discussed by Ogas et al. [26] for object recognition and a grasping method for manipulating the objects. The grasping method assumes an industry assembly line where the object parameters are assumed to be known in advance. Kumra et al. [19] proposed a Deep CNN architecture that uses residual layers for predicting robust grasps. The paper demonstrates that a deeper network along with residual layers learns better features and performs faster. Our work is based on similar concepts and is designed to advance the research done in this area.

Iii Problem Formulation

In this work, we define the problem of robotic grasping as predicting antipodal grasps for unknown objects from a n-channel image of the scene and executing it on a robot.

Instead of the 5 dimensional grasp representation used in [20, 28, 19], we use an improved version of grasp representation similar to the one proposed by Morrison et al. in [24]. We denote the grasp pose in robot frame as:


where, is the tool tip’s center position, is the tools rotation around the z-axis, is the required width for the tool, and is the grasp quality score.

We detect a grasp from a n-channel image with height h and width w. A grasp in image can be defined as:


where corresponds to the center of grasp in image coordinates, is the rotation in camera’s frame of reference, is the required width in image coordinates, and is the same scalar as in (1).

The grasp quality score is the quality of the grasp at every point in the image and is indicated as a score value between 0 and 1 where a value that is in proximity to 1 indicates a greater chance of grasp success. which indicates the antipodal measurement of the amount of angular rotation required at each point to grasp an object of interest and is represented as a value in the range . which is the required width is represented as a measure of uniform depth is indicated in the range of pixels.

To execute a grasp obtained in the image space on a robot we can apply the following transformations to convert the image coordinates to robot’s frame of reference.


where, is a transformation that converts image space into camera’s 3D space using the intrinsic parameters of camera, and converts camera space into the robot space using the camera pose calibration value.

This notation can be scaled for multiple grasps in an image. The collective group of all the grasps can be denoted as:


where , and Q represents three images in the form of grasp angle, grasp width and grasp quality score respectively calculated at every pixel of an image using (2).

Iv Approach

Deep learning has redefined how robotic grasping was approached in the past. Further, CNN’s have enhanced the way object detection and classification problems have been dealt with in computer vision. Furthermore, state-of-the-art results have been obtained by using residual networks for deeper architectures

[19, 41]. Our approach is inspired by these two network architectures to form a novel and improved architecture. We propose a Generative Residual Convolutional Neural network (GR-ConvNet) that is used to predict a suitable grasp configuration for the objects detected in the camera’s field of view.

Iv-a Overview

We present a unique and generalized solution for the problem of robotic grasping. Unlike the previous work done in robotic grasping [3, 33] where the required grasp is predicted as a grasp rectangle calculated by choosing the best grasp from multiple grasp probabilities, our network generates three images from which we can infer grasp rectangles for multiple objects. Additionally, it is possible to infer multiple grasp rectangles for multiple objects from output of GR-ConvNet in one-shot thereby decreasing the overall computational time.

The overview of our system’s architecture is shown in fig. 1. Our network uses n-channel input that is not limited to a particular type of input modality such as a depth-only input or RGB-only image as our input image thus making it generalized for any kind of input modality. Our approach consists of three parts. First, the input data is pre-processed where it is cropped, resized and normalized. If the input has a depth image, it is inpainted to obtain a depth representation [37]. The

n-channel processed input image is fed into the GR-ConvNet. The second generates three images as grasp angle, grasp width and grasp quality score as the output using the features extracted from the pre-processed image using GR-ConvNet. The third infers grasp rectangles from the three output images.

Iv-B Model architecture

Fig. 2 shows the proposed GR-ConvNet model, which is a generative architecture that takes in an n-channel input image and generates three images as the output. The n-channel image is passed through three convolutional layers, followed by five residual layers and finally passed through convolution transpose layers to generate four images. These output images consists of grasp quality score, required angle in the form of and as well as the required width of the end effector. Since the antipodal grasp is uniform around , we extract the angle in the form of two elements and that output distinct values which can then be used to infer the required angle.

The convolutional layers extract the features from the input image. The output of the convolutional layer is then fed into 5 residual layers. As we know, accuracy increases with increasing the number of layers. However, it is not true when you exceed a certain number of layers which results in the problem of vanishing gradients and dimensionality error, thereby causing saturation and degradation in the accuracy. Thus, using residual layers enables us to better learn the identity functions by using skip connections. After passing the image through these convolutional and residual layers the size of the image is reduced to a , which can be difficult to interpret. Therefore, to make it easier to interpret and retain spatial features of the image after convolution operation, we up-sample the image by using convolution transpose operation. Thus, we obtain the same size of the image at the output as the size of the input.

Our network has a total of 1,900,900 parameters which indicate that our network is comparatively shorter as opposed to other networks [19, 41]. Thereby making it computationally less expensive and faster in contrast to other architectures using similar grasp prediction techniques that contain millions of parameters and complex architectures. The lightweight nature of the model makes it suitable for closed-loop control at a rate of up to 50 Hz.

Iv-C Training methodology

For a dataset having objects , input scene images and successful grasps in image frame , we can train our model end-to-end to learn the mapping function by minimizing the negative log-likelihood of conditioned on the input image scene , which is given by:


The models were trained using the Adam optimizer [11] and standard backpropogation and mini-batch SGD technique [25]. The learning rate was set as and a mini-batch size of was used. We trained each model using three random seeds, and report the average of the three seeds.

Iv-D Loss function

We analyzed the performance of various loss functions for our network and after running a few trials found that in order to handle exploding gradients, the smooth L1 loss also known as Huber loss works best. We define our loss as :


where is given by:


is the grasp generated by the network and is the ground truth grasp.

(a) Cornell Dataset
(b) Jacquard Dataset
Fig. 3: Results on Grasping Datasets

V Evaluation

V-a Datasets

There are limited number of publicly available antipodal grasping datasets. Table I shows a summary of the publicly available antipodal grasping datasets. We used two of these datasets for training and evaluating our model. The first one is the Cornell grasp dataset [14] which is the most common grasping dataset used to benchmark results and the second one is a more recent Jacquard grasping dataset [10] which is more than 50 times bigger the Cornell grasp dataset.

Dataset Modality Objects Images Grasps
Cornell RGB-D 240 1035 8019
Dexnet Depth 1500 6.7M 6.7M
Jacquard RGB-D 11k 54k 1.1M
TABLE I: Summary of Antipodal Grasping Datasets

The extended version of Cornell Grasp Dataset 111Available at comprises of 1035 RGB-D images with a resolution of pixels of 240 different real objects with 5110 positive and 2909 negative grasps. The annotated ground truth consists of several grasp rectangles representing grasping possibilities per object. However, it is a small dataset for training our GR-ConvNet model, therefore we create an augmented dataset using random crops, zooms and rotations which effectively has 51k grasp examples. Only positively labelled grasps from the dataset were considered during training.

The Jacquard Grasping Dataset 222Available at is built on a subset of ShapeNet which is a large CAD models dataset. It consists of 54k RGB-D images and annotations of successful grasping positions based on grasp attempts performed in a simulated environment. In total, it has 1.1M grasp examples. As this dataset was large enough to train our model, no augmentation was performed.

V-B Grasp Detection Metric

For a fair comparison of our results, we use the rectangle metric [14] proposed by Jiang et al. to report the performance. According to the proposed rectangle metric, a grasp is considered as a valid one when it satisfies the following two conditions:

  • The intersection over union (IoU) score between the ground truth grasp rectangle and the predicted grasp rectangle is more than .

  • The offset between the grasp orientation of the predicted grasp rectangle and the ground truth rectangle is less than .

This metric requires a grasp rectangle representation, but our model predicts image based grasp representation using equation 2. Therefore, in order to convert from image based grasp representation to rectangle representation, the value corresponding to each pixel in the output image is mapped to its equivalent rectangle representation.

V-C Setup

To get the scene image for the real-world experiments, we used the Intel RealSense Depth Camera D435 that uses stereo vision to calculate depth. It consists of a pair of RGB sensor, depth sensors and infrared projector.

The execution times for our proposed GR-ConNet are measured on a system running Ubuntu 16.04 with an Intel Core i7-7800X CPU clocked at 3.50 Ghz and a NVIDIA GeForce GTX 1080 Ti graphics card with CUDA 10.

V-D Experiments

In our experiments, we evaluate our approach on two datasets and real-world objects. A number of experiments were performed by training the two datasets on our proposed GR-ConvNet by tweaking a number of parameters including filter size, batch size, learning rate, and varying the number of layers. After evaluating 12 different architectures we determined the architecture that gave us state-of-the-art results along with the lowest recorded inference time. This confirmed that our GR-ConvNet offered efficient performance in real world scenarios which is of utmost importance.

Further, we demonstrate the viability of our method in comparison to other methods by gauging the performance of our network on different types of input objects. Additionally, we evaluate the performance of our network on different input modalities. The modalities that the model was tested on included uni-modal input such as depth only and RGB only input images and multi-modal input such as RGB-D images. Our network performed better on multi-modal data since multiple input modalities enabled better learning of the input features.

Furthermore, the performance was validated on actual objects by using an Intel RealSense depth camera. Random household objects were chosen to authenticate the results obtained on both the datasets on which the network had been tested. The effectiveness of the network was tested in different types of environments including varying light conditions for grasping individual objects or grasping objects in a cluster.

Vi Results

In this section, we discuss the results from our experiments. We evaluate GR-ConvNet on both the Cornell and the Jacquard dataset to examine the outcomes for each of the dataset based on factors such as size of the dataset, type of training data and to demonstrate our model’s capacity to generalize to any kind of object. Further, we show that our model is also able to generate multiple grasps for multiple objects in a cluster. Fig. 2(a) and 2(b) shows the results obtained on previously unseen objects in the Cornell Grasp dataset and Jacquard dataset respectively. The figure consists of output in image representation in the form of grasp quality score , the required angle for grasping and the required gripper width . It also includes the output in the form of rectangle grasp representation projected on RGB and depth images. We also report the grasp score for rectangle representation on both the RGB and depth images for both the datasets. Additionally, we include our results for real-world objects in fig. 5. The results for each of the dataset and real objects are discussed in detail in further sections.

(a) Cornell dataset
(b) Jacquard dataset
Fig. 4: Validation loss
Authors Algorithm Accuracy (%) Speed
IW OW (ms)
Jiang [14] Fast Search 60.5 58.3 5000
Lenz [20] SAE, struct. reg. 73.9 75.6 1350
Redmon [28] AlexNet, MultiGrasp 88.0 87.1 76
Wang [36] Two-stage closed-loop 85.3 - 140
Asif [2] STEM-CaRFs 88.2 87.5 -
Kumra [19] ResNet-50x2 89.2 88.9 103
Morrison [24] GG-CNN 73.0 69.0 19
Guo [12] ZF-net 93.2 89.1 -
Zhou [41] FCGN, ResNet-101 97.7 96.6 117
Karaoguz [16] GRPN 88.7 - 200
Asif [4] GraspNet 90.2 90.6 24
GR-ConvNet-D 92.8 90.6 19
Our GR-ConvNet-RGB 94.5 93.2 19
GR-ConvNet-RGB-D 97.7 96.8 20
TABLE II: Accuracy on the Cornell Dataset

Vi-a Cornell Dataset

For Cornell grasp dataset we follow a cross-validation setup as in previous works [20, 28, 19, 4, 12], using image-wise (IW) and object-wise (OW) data splits. The results obtained on the previously unseen objects in the dataset depict that our network can predict robust grasps for different type of objects in the validation set. The data augmentation performed on the Cornell Grasp dataset improved the overall performance of the network. Table II shows the performance of our system for multiple modalities in comparison to other techniques used for grasp prediction. We obtained state-of-the-art accuracy of 97.7% on Image-wise split and 96.8% on Object-wise split on RGB-D data outperforming all competitive methods as seen in table II. Further, the recorded execution speed of 20ms suggests that GR-ConvNet is suitable for real-world closed loop applications. Fig. 3(a) indicates the validation loss for Cornell dataset on RGB only, Depth only and RGB-D data. The plot shows that the loss function is decreasing and is lower for the RGB-D type of input and it also indicates that the model is not overtrained.

Vi-B Jacquard Dataset

For the Jacquard dataset, we trained our network on 90% of the dataset images and validated on 10% of the remaining dataset. As the Jacquard dataset is much larger than the Cornell dataset, no data augmentation was required for the Jacquard dataset. We performed similar experiments using multiple modalities on the Jacquard dataset and obtained state-of-the-art results with an accuracy of 94.6% on RGB-D data. This shows that our network not only gives the best results on the Cornell Grasp dataset but also outperforms on the Jacquard dataset. Fig. 3(b) indicates the validation loss for Jacquard dataset for RGB only, depth only and RGB-D data. The loss plot indicates that the loss is equally low for depth only and RGB-D type of data and is higher for RGB only type of input.

Authors Algorithm Accuracy (%)
Depierre [10] Jacquard 74.2
Morrison [24] GG-CNN2 84
Zhou [41] FCGN, ResNet-101 91.8
GR-ConvNet - D 93.7
Our GR-ConvNet - RGB 91.8
GR-ConvNet - RGB-D 94.6
TABLE III: Accuracy on the Jacquard Dataset
(a) Novel objects
(b) Objects in cluster
Fig. 5: Results on novel real-world objects

Vi-C Novel real-world objects

Along with state-of-the-art results on two standard datasets, we also demonstrate that our network equally outperforms on novel real-world objects. We used 20 random household objects, previously unseen by the network, to evaluate the performance of our model in the physical world. Table IV shows the results for 10 of these objects using models trained on Cornell and Jacquard datasets. The results obtained in table IV and fig. 4(a) indicates that GR-ConvNet is able to generalize well to new objects that it has never seen before. The model was able to generate grasps for all the objects except for a transparent bottle. As seen from fig 4(a) the model was unable to generate a good grasp for the transparent bottle since the model generated a poor depth map for the bottle where the black spots indicate no depth information. This could be due to the Intel RealSense camera which was unable to capture depth data due to possible object reflections. However, based on the depth data along with RGB data, the model was still able to generate a fairly good grasp for the transparent bottle.

Vi-D Objects in cluster

Along with predicting optimum grasps for novel real objects, our robust model is able to predict multiple antipodal grasps for multiple objects in a cluster. Fig. 4(b) displays the results obtained for different household objects taken together. Despite the model being trained only on isolated objects, it is able to efficiently predict grasps for manifold objects. Moreover, as seen from the figure it can even predict multiple grasps for a single object regardless of the environment. This demonstrates that GR-ConvNet generalizes to all types of objects and can predict robust grasps for multiple objects.

Object Accuracy (%) Accuracy (%)
(Trained on Cornell) (Trained on Jacquard)
Toy 100 100
Earphones 100 95
Bottle 40 65
Opener 100 100
Shaker 100 100
Controller 100 100
Remote 100 90
Pen 100 100
Headphones 100 90
Scissor 95 90
TABLE IV: Accuracy on novel objects

Vii Conclusions

We present a generalized solution for generating antipodal grasps for novel objects using our Generative Residual Convolutional Neural Network that uses n-channel input data to generate images which can be used to infer grasp rectangles for each pixel in an image. We evaluated the GR-ConvNet on two standard datasets, the Cornell grasp dataset and the Jacquard dataset, and obtained state-of-the-art results on both the datasets. We also validated the proposed system on novel real objects and on objects in clusters. The results demonstrate that our system provides an accurate and generalized prediction for previously unseen real-world objects. Moreover, our method is able to achieve outstanding accuracy along with low inference time which indicate that our system is suitable for closed-loop applications.

In future work, we would like to extend our solution for different types of grippers used such as single and multiple suction cup and multi-fingered grippers. We would also like to use depth prediction techniques to accurately predict depth for reflective objects, which can aid in improving the grasp prediction accuracy for reflective objects like the bottle.


  • [1] L. Antanas, P. Moreno, M. Neumann, R. P. de Figueiredo, K. Kersting, J. Santos-Victor, and L. De Raedt (2019) Semantic and geometric reasoning for robotic grasping: a probabilistic logic approach. Autonomous Robots 43 (6), pp. 1393–1418. Cited by: §II.
  • [2] U. Asif, M. Bennamoun, and F. A. Sohel (2017) RGB-d object recognition and grasp detection using hierarchical cascaded forests. IEEE Transactions on Robotics. Cited by: TABLE II.
  • [3] U. Asif, J. Tang, and S. Harrer (2018) EnsembleNet: improving grasp detection using an ensemble of convolutional neural networks. In BMVC, Cited by: §II, §IV-A.
  • [4] U. Asif, J. Tang, and S. Harrer (2018) GraspNet: an efficient convolutional neural network for real-time grasp detection for low-powered devices.. In IJCAI, pp. 4875–4882. Cited by: §VI-A, TABLE II.
  • [5] A. Bicchi and V. Kumar (2000) Robotic grasping and contact: a review. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), Vol. 1, pp. 348–353. Cited by: §II.
  • [6] A. Bicchi (1995) On the closure properties of robotic grasping. The International Journal of Robotics Research 14 (4), pp. 319–334. Cited by: §II.
  • [7] J. Bohg, A. Morales, T. Asfour, and D. Kragic (2013) Data-driven grasp synthesis—a survey. IEEE Transactions on Robotics 30 (2), pp. 289–309. Cited by: §II.
  • [8] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, et al. (2018) Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4243–4250. Cited by: §II.
  • [9] F. Chu, R. Xu, and P. A. Vela (2018) Real-world multiobject, multigrasp detection. IEEE Robotics and Automation Letters 3 (4), pp. 3355–3362. Cited by: §II.
  • [10] A. Depierre, E. Dellandréa, and L. Chen (2018) Jacquard: a large scale dataset for robotic grasp detection. In IEEE/RSJ International Conference on Intelligent Robots and Systems, Cited by: §V-A, TABLE III.
  • [11] J. B. Diederik P. Kingma (2015) Adam: a method for stochastic optimization. International Conference for Learning Representations. Cited by: §IV-C.
  • [12] D. Guo, F. Sun, H. Liu, T. Kong, B. Fang, and N. Xi (2017) A hybrid deep architecture for robotic grasp detection. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1609–1614. Cited by: §II, §VI-A, TABLE II.
  • [13] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis (2019) Sim-to-real via sim-to-sim: data-efficient robotic grasping via randomized-to-canonical adaptation networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 12627–12637. Cited by: §II.
  • [14] Y. Jiang, S. Moseson, and A. Saxena (2011) Efficient grasping from rgbd images: learning using a new rectangle representation. In 2011 IEEE International Conference on Robotics and Automation, pp. 3304–3311. Cited by: §II, §V-A, §V-B, TABLE II.
  • [15] E. Johns, S. Leutenegger, and A. J. Davison (2016) Deep learning a grasp function for grasping under gripper pose uncertainty. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4461–4468. Cited by: §II.
  • [16] H. Karaoguz and P. Jensfelt (2019) Object detection approach for robot grasp detection. In 2019 International Conference on Robotics and Automation (ICRA), pp. 4953–4959. Cited by: TABLE II.
  • [17] M. Kopicki, R. Detry, M. Adjigble, R. Stolkin, A. Leonardis, and J. L. Wyatt (2016) One-shot learning and generation of dexterous grasps for novel objects. The International Journal of Robotics Research 35 (8), pp. 959–976. Cited by: §II.
  • [18] D. Kragic and H. I. Christensen (2003) Robust visual servoing. The international journal of robotics research 22 (10-11), pp. 923–939. Cited by: §II.
  • [19] S. Kumra and C. Kanan (2017) Robotic grasp detection using deep convolutional neural networks. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 769–776. Cited by: §II, §III, §IV-B, §IV, §VI-A, TABLE II.
  • [20] I. Lenz, H. Lee, and A. Saxena (2015) Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34 (4-5), pp. 705–724. Cited by: §I, §II, §III, §VI-A, TABLE II.
  • [21] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research 37 (4-5), pp. 421–436. Cited by: §II.
  • [22] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg (2017) Dex-net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312. Cited by: §II.
  • [23] J. Maitin-Shepard, M. Cusumano-Towner, J. Lei, and P. Abbeel (2010) Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In 2010 IEEE International Conference on Robotics and Automation, pp. 2308–2315. Cited by: §II.
  • [24] D. Morrison, P. Corke, and J. Leitner (2019) Learning robust, real-time, reactive robotic grasping. The International Journal of Robotics Research, pp. 0278364919859066. Cited by: §II, §III, TABLE II, TABLE III.
  • [25] A. Nitanda (2014) Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pp. 1574–1582. Cited by: §IV-C.
  • [26] E. Ogas, L. Avila, G. Larregay, and D. Moran (2019) A robotic grasping method using convnets. In 2019 Argentine Conference on Electronics (CAE), pp. 21–26. Cited by: §II.
  • [27] L. Pinto and A. Gupta (2016) Supersizing self-supervision: learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA), pp. 3406–3413. Cited by: §I, §II.
  • [28] J. Redmon and A. Angelova (2015) Real-time grasp detection using convolutional neural networks. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 1316–1322. Cited by: §I, §II, §III, §VI-A, TABLE II.
  • [29] V. Satish, J. Mahler, and K. Goldberg (2019) On-policy dataset synthesis for learning robot grasping policies using fully convolutional deep networks. IEEE Robotics and Automation Letters 4 (2), pp. 1357–1364. Cited by: §II.
  • [30] A. Saxena, J. Driemeyer, and A. Y. Ng (2008) Robotic grasping of novel objects using vision. The International Journal of Robotics Research 27 (2), pp. 157–173. Cited by: §II.
  • [31] P. Schmidt, N. Vahrenkamp, M. Wächter, and T. Asfour (2018) Grasping of unknown objects using deep convolutional neural networks based on depth images. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6831–6838. Cited by: §II, §II.
  • [32] K. B. Shimoga (1996) Robot grasp synthesis algorithms: a survey. The International Journal of Robotics Research 15 (3), pp. 230–266. Cited by: §II.
  • [33] J. Tobin, L. Biewald, R. Duan, M. Andrychowicz, A. Handa, V. Kumar, B. McGrew, A. Ray, J. Schneider, P. Welinder, et al. (2018) Domain randomization and generative models for robotic grasping. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3482–3489. Cited by: §IV-A.
  • [34] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birchfield (2018)

    Deep object pose estimation for semantic robotic grasping of household objects

    arXiv preprint arXiv:1809.10790. Cited by: §II.
  • [35] J. Varley, C. DeChant, A. Richardson, J. Ruales, and P. Allen (2017) Shape completion enabled robotic grasping. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2442–2447. Cited by: §II.
  • [36] Z. Wang, Z. Li, B. Wang, and H. Liu (2016) Robot grasp detection using multimodal deep convolutional neural networks. Advances in Mechanical Engineering 8 (9), pp. 1687814016668077. Cited by: §II, TABLE II.
  • [37] H. Xue, S. Zhang, and D. Cai (2017)

    Depth image inpainting: improving low rank matrix completion with low gradient regularization

    IEEE Transactions on Image Processing 26 (9), pp. 4311–4320. Cited by: §IV-A.
  • [38] M. Yan, A. Li, M. Kalakrishnan, and P. Pastor (2019) Learning probabilistic multi-modal actor models for vision-based robotic grasping. arXiv preprint arXiv:1904.07319. Cited by: §II.
  • [39] X. Yan, M. Khansari, J. Hsu, Y. Gong, Y. Bai, S. Pirk, and H. Lee (2019) Data-efficient learning for sim-to-real robotic grasping using deep point cloud prediction networks. arXiv preprint arXiv:1906.08989. Cited by: §II.
  • [40] A. Zeng, S. Song, K. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, et al. (2018) Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §II.
  • [41] X. Zhou, X. Lan, H. Zhang, Z. Tian, Y. Zhang, and N. Zheng (2018) Fully convolutional grasp detection network with oriented anchor box. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7223–7230. Cited by: §IV-B, §IV, TABLE II, TABLE III.