MetaGrasp: Data Efficient Grasping by Affordance Interpreter Network

02/18/2019 ∙ by Junhao Cai, et al. ∙ 0

Data-driven approach for grasping shows significant advance recently. But these approaches usually require much training data. To increase the efficiency of grasping data collection, this paper presents a novel grasp training system including the whole pipeline from data collection to model inference. The system can collect effective grasp sample with a corrective strategy assisted by antipodal grasp rule, and we design an affordance interpreter network to predict pixelwise grasp affordance map. We define graspability, ungraspability and background as grasp affordances. The key advantage of our system is that the pixel-level affordance interpreter network trained with only a small number of grasp samples under antipodal rule can achieve significant performance on totally unseen objects and backgrounds. The training sample is only collected in simulation. Extensive qualitative and quantitative experiments demonstrate the accuracy and robustness of our proposed approach. In the real-world grasp experiments, we achieve a grasp success rate of 93 and 91 We also achieve 87 using only RGB image, when changing the background textures, it also performs well and can achieve even 94 outperforms current state-of-the-art methods.



There are no comments yet.


page 1

page 3

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently vision-based data-driven grasp synthesis has achieved significant performance on robotic grasping task. Many works have shown that deep learning models trained using data labelled by human or collected by trial and error in physical environment perform well for seen and novel objects 

[1, 2, 3, 4, 5, 6, 7, 8, 9].

However, manually annotating data [7, 8] and doing trial and error in real environment [3, 4, 9] are time-consuming. An alternative way to alleviate this issue is to generate data from virtual environment, while much of simulated data cannot be directly used because of the domain shift problem [10]. Therefore, some approaches used simulated RGB data and domain adaptation to reduce the number of real-world samples and achieve comparable performance, while the number of real-world samples required is still large [11, 12].

Fig. 1: The proposed grasp system pipeline. Given a global workspace RGB image, the model predicts horizontal grasp affordance maps with respect to different rotations in the camera’s reference frame. Unlike [7], we directly obtain pixelwise affordance maps with only RGB image instead of upsampling the model output. The greenest region represents the most stable horizontal grasp point, red region stands for negative grasp location and blue is background. The right side shows a grasp episode.

On the other hand, many approaches have shown that models trained by only synthesized depth data work well in physical world [5, 6, 13]. Specifically, the promising grasping performance achieved by [5, 14] partially give the credit to using antipodal grasp sampling method in depth image [15], this grasp rule is able to generate robust grasp samples. However, in the real-world setting, the performance degenerates when trying to grasp tiny, meshed and translucent objects, because the depth information is missing for these items due to their special physical properties. Moreover, the camera configurations such as camera height corresponding to the workspace in simulator and real world should be consistent. In addition, the number of the training data in [5] is extremely huge, and the collected samples in simulator only contain local depth image of which the information is limited, hence they cannot utilize the context of global visual information to facilitate the grasp planning.

Therefore, we turn to explore another kind of grasp training system that captures grasp patterns only in RGB image. The system can effectively utilize grasp samples collected from simulator. In our proposed system, the generated grasp samples are based on antipodal grasp heuristic instead of random grasping as in 

[3, 4, 9]. Also, we use global RGB visual-based perception, not the local depth data as in [5]. A well-interpreted model is proposed to evaluate accurate grasp pattern trained using only a small number of collected samples without any real-world data.

For the data collection system, we propose a novel way to generate simulated data guided by antipodal grasp rule. Unlike other self-supervised methods which collect data only by random grasping [3, 4, 9], we use a corrective grasp strategy to generate antipodal grasp samples without cumbersomely sampling method [5]. The data collected by our system emphasizes the antipodal grasp pattern and has less perturbed information than data generated by random grasping trials, which is beneficial to training the model. In our system, we only collect global RGB image containing the whole workspace, object positions in the image and corresponding grasp label. We show that we can directly apply the model trained with only about 6,300 simulated data (approximately 1000 times smaller than [5]) to the real-world grasping task and achieve equivalent performance compared with current state-of-the-art techniques.

For the grasp inference model, we employ an end-to-end model like [7]

to learn grasp meta knowledge. Differently we extend the architecture to directly predict pixelwise grasp affordance map without the upsampling operation on the output. In this work, we define graspability, ungraspability and background as grasp affordances, i.e., each pixel of affordance map belongs to one of grasp affordances. In addition, we train the network with a novel loss function to better do with label sparsity problem. Extensive experiments show that our model can predict more accurate affordance map and achieve higher performance compared to 


In order to test the performance of our system, we perform grasp tries and evaluate the grasp success rate in several different scenarios, where we achieve significant grasp results. The grasp items are totally unseen for the model. To demonstrate the robustness of our approach, we also make grasp attempts on the set of adversarial items [5] under different background textures. The model is still able to predict the correct grasp configuration, which achieves even 94% success rate in a background texture that totally different to that of the simulated background.

In summary, our contributions mainly are:

  • Designing a data collecting system guided by the antipodal grasp rule in virtual environment. The collected sample is composed of the RGB image containing the whole workspace where there is only a single object placed, the object positions corresponding to the image and the grasp labels. We show that we can achieve significant performance using only a small number of samples without any real-world data.

  • We construct an end-to-end affordance interpreter network to achieve pixel-level grasp configuration and demonstrate that the network is able to learn antipodal grasp pattern from extremely sparse labeled dataset containing only about 6,300 samples without extensive domain randomization (i.e., we fix background, gripper parameters, camera location and lighting condition, and use only 35 objects from 3DNet [16]).

  • We evaluate our model performance on extensive experiments that consist of grasping unseen household and adversarial objects [5] in different scenarios, which shows better performance compared to current state-of-the-art methods.

Ii Related Work

Grasp Planning. Robotic grasping is one of the most widely studied topics in the area of object manipulation. In grasping task, we aim to achieve a desired object constraint in front of external disturbance [17]. The existing grasping techniques can be divided into analytic and empirical methods. Analytic methods try to evaluate grasp configuration according to some metric [18, 19, 20, 21] on the assumption of simplified contact model, Coulomb friction and rigid-object modeling [22]. While these methods often required known object model and location [23]. Contrary to analytic approaches, empirical methods try to obtain object representations from data, which can be used to evaluate grasp configuration with heuristics [24].

Deep Learning for Grasping. Because of the success of deep learning in visual perception [25, 26, 27], more and more grasping system make use of a vision-based deep learning framework to predict grasp configuration [1, 2, 3, 4, 5, 6, 7, 8, 9]. Levine et al. spent two months on collecting 800k grasp samples using 6-14 robots to train an evaluation network [4]. Zeng et al. proposed an end-to-end network to predict grasp affordance map using manually annotated data [7]. Morrison et al.

utilized Cornell grasping dataset to train a light-weight convolutional neural network to achieve pixelwise grasp pose prediction 

[8]. These methods achieve significant performance, while data collected from hand-designed procedure or pure trial-and-error manner in real physical environment is time-consuming and less effective, which reduce the practicability.

Virtual to Real-world Transfer in Grasping. Because we can easily have access to a large amount of annotated data from virtual environment, many works turned into simulator to generate training samples, especially in object manipulation task [28, 29, 30, 31]. In grasping scenario, Viereck et al. trained a distance function between current pose and nearest optimal pose using only simulated data [6]. Mahler et al. obtain synthesized data based on antipodal grasping sampling method and trained a grasp quality network to evaluate robust grasp configuration [5]. All of them use only simulated data and work well in real physical system, a key factor is the visual related data collected from simulator is depth data instead of RGB image. Viereck et al. believed that depth image contains less information than that of RGB image [6]. Hence it is easier to bridge the gap in transferring models trained in a simulator to the physical world. However, in the physical world setting, it is difficult for depth camera to measure thin, translucent and dark color objects due to their special physical properties. Under this circumstance, the grasp performance is no longer guaranteed.

Approach of Data Collection for Grasping. Large amount of available annotated data is one of the key factors for the success of deep learning for grasping. The works in  [29, 4, 9, 12, 11] proposed trial-and-error ways in real world to collect data. Although most works achieve good results, the number of required samples is relatively large (from 50k to 800k). One of the most important reason is that the data collection systems execute grasp trials without any heuristic. Dex-net 2.0 generated grasp samples with antipodal grasp sampling approach [15] to obtain robust grasp [5]. However, they only collect local depth information, which needs 6.7 million samples to train the model. Zeng et al. collected global visual samples and annotated them manually, it is less efficient and error-prone [7]. We believe that better grasp pattern can be obtained from global visual perception which can dramatically reduce the required number of training samples. Therefore, a better data collection system should be able to efficiently do grasp trials based on specific rule and obtain global visual information.

Iii Approach

Iii-a Problem Formulation

Similar to the related literature [3, 5, 8], given an RGB image, we consider the grasping problem as executing parallel-jaw grasp in planar workspace.

Let defines a perpendicular parallel-jaw grasp, where denotes the location of the grasp point in Cartesian, represents the grasp angle with respect to the end effector and

is a one hot vector which indicates the grasp affordances. When projected to image space, the grasp in image

can be represented as , where depicts the grasp location in image, represents discretized grasp angle. The discretization operation can reduce the complexity of learning procedure (note that in the inference process, we can replace the rotation of gripper equivalently by rotating the input image. Therefore, we can consider only the horizontal grasp for the image in the training process). Hence, we can define grasp affordance for each image pixel, and the grasp affordance maps can be written as: , where denotes the grasp affordances for the whole image given the condition of angle . The 3 channels here mean the positive, negative grasp and background. According to the affordance maps, we can extract the first channel (0 indices the positive affordance map in ) from each given angle and compose them together as : . Therefore, the most robust grasp configuration can be evaluated by


where denotes value of the positive affordance given the position of pixel and rotated angle. The pair is the location where end-effector should reach in image space, and represents that the gripper rotates degrees before executing grasp.

In training process, we define a parameterized function to achieve pixel-level mapping, which is denoted as:


where is the image obtained by rotating degrees with respect to , is the output affordance map corresponding to , and is implemented using a neural network. Applying the softmax loss function , our training objective can be formulated as:


where represents the label map. Details are discussed in Sec. III-D

Fig. 2: Antipodal grasp in simulator. and (yellow points) denote contact points between parallel-jaw gripper and rigid object. and (green directed lines) is the normal vector with respect to the contact points. (blue line) represents the grasp angle in image space. (a) The grasp direction is not parallel to the normal vectors, though it may be successfully grasped by some gripper with wide width, it is not a robust grasp. (b) The grasp direction is exactly parallel to the normal vectors, we consider it as a robust antipodal grasp.

Iii-B Data Collection from Simulator

Definition of antipodal grasp in image space. Similar to  [5], we formulate antipodal grasp in image space as follows. We consider the scenario that the robot executes grasp trials in workspace where there is only an object placed. We define and as contact points between parallel-jaw gripper and object, and as normal vector for each contact point, and as grasp direction in image space. , , , and are all in the 2-D image space illustrated in Fig. 2. As we can see,


where denotes the norm. We consider a grasp as an antipodal grasp if:


are all satisfied, where and are non-negative values that are prone to 0 and respectively. In short, a grasp can be regarded as stably antipodal grasp when grasp direction is parallel to normal vectors of contact points in image space. We will describe how to use this rule to guide our corrective strategy for grasp data collection.

Fig. 3: Illustration of data generation. The first three scenes on the left side shows the generation of antipodal sample by recording the global RGB image in which the object has been adjusted by the gripper. The information of saved sample contains 1) the RGB image, 2) grasp label and 3) the object mask where the white region represents locations of object.
Fig. 4: Sample generated by the corrective grasp strategy. The first column is the global RGB image captured by the camera. The green lines and red point represent the grasp angle and location respectively. And column 3 is object mask that indicate the positions of object. Row 1 displays the pattern of the sample collected before grasp trial. Row 2 shows that the grasp trial adjusts object’s pose so that the contact points between object and gripper become antipodal. Therefore, we renew the pattern of sample by re-recording the global image and corresponding object mask.

Corrective grasp strategy. In implementation, we directly collect antipodal grasp samples using a corrective grasp strategy. The overview of a grasp attempt is illustrated in Fig. 3. We first randomly select a grasp angle and a position in image space where the pixel belongs to object instead of background, and use global camera to record current RGB image . We also collect all the pixel coordinates that belong to object. The angle, position, image and coordinates are saved as the pattern of a sample . Then the end-effector moves to the position , rotates the angle and executes grasp. If the gripper successfully grasps the object (this can be determined by the distance between two fingertips of the gripper), we open the gripper and move end-effector to the initialization position. Due to the grasp trial, the object’s pose is adjusted so that the contact points between object and gripper become approximately antipodal, hence we renew the pattern of sample by using the global camera to re-record the global visual information and pixel coordinates of the object , then append grasp label to the sample which is changed to . If the gripper fails to grasp the object, we simply add grasp label ( in this case) to the sample with other information unchanged so that . Fig. 4 shows the sample generated by corrective grasp strategy.

Note that although the method in [5] also employs antipodal grasp rule in the data collection process, our method is based on trial-and-error method followed by a corrective strategy. On one hand, this can avoid the limitation of the heuristic rule and explore more potential graspable points. On the other hand, the corrective strategy improves the quality of the grasp samples. In particular, the synthesized samples collected from our system effectively highlight the grasp features, hence the model can easily capture the most useful information for grasping, ignore other perturbed information from the synthesized data, and generalize to different scenes.

In this work, we construct virtual scene using the simulator V-REP [32]. We can effectively collect antipodal grasp sample using the above method without any sampling procedure. Specifically, we use 6DOF UR5 robot mounted with RG2 gripper, while we mount ROBOTIQ85 gripper to the wrist of UR5 in physical world. We show that different grippers between virtual and real environments have no influence on grasp performance.

Data Structure As shown in Fig. 3, a grasp sample consists of an RGB image of the whole workspace, the grasp label that illustrates which locations are graspable or not as well as how much degrees the gripper should rotate, and object coordinates that indicate the whole object pixel locations in the RGB image.

Fig. 5: Affordance interpreter network. Given a rotated RGB image, the network outputs pixelwise affordance map in which blue region represents background, green denotes positive horizontal grasp point, red stands for negative grasp configuration.

Iii-C Model Architecture

The network architecture is shown in Fig. 5. Similar to [7]

, we design a fully convolutional residual network as affordance interpreter network to predict horizontal grasp affordance maps. Differently, in order to achieve pixelwise prediction, we append bilinear interpolation block for each 3x3 conv layer, and use 5x5 conv layer to refine affordance maps. This architecture is more accurate to evaluate graspable area than that proposed by 


We find that the model can avoid domain shift problem even if it is trained with only simulated RGB image. A key factor is that all the positive grasp samples are antipodal pattern based on antipodal grasp heuristic, hence the model has no need for learning other confusing grasp patterns and is able to pay more attention to the grasp meta knowledge. Under this circumstance, it is a small number of antipodal samples that is sufficient to train the model. Another key factor of this model is that it observes the global visual information so that it can predict robust results even in complex scenarios such as multiple objects and clutter.

Iii-D Training Process

Fig. 6: Label mask. According to collected object positions, we can generate label mask in which 0 is assigned to unlabelled object regions. We can see that there are several positive (green) or negative (red) labels within the black region.

Because 1) most of the labels with respect to all the image pixels are background, 2) positive and negative grasp labels are extremely sparse with respect to the object pixels, directly training the network becomes difficult in our experiment. Therefore, we introduce the object mask to the evaluation of loss function. As illustrated in Fig. 6, we set each pixel of unlabeled object region (i.e., the black region in Fig. 6) with and the others with to generate a training mask. Let denotes the feature maps of output of last convolutional layer. And the corresponding loss function of the network training is represented as:


where represents label map as in Eqn. (3).

To reduce the impact of label sparsity problem, we place more weight on positive or negative label map and shrink the loss of background. Concretely, we empirically multiply positive and negative maps by scale factor 120 and background by 0.1 to adjust the influence of label with respect to the model parameters, i.e., each pixel of modified mask is formulated as:


We show that these approaches significantly reduce the difficulty of model training.

Iii-E Grasp Execution

For each grasp execution, we first capture the RGB image containing the whole workspace. Then we rotate the image into 16 orientations with different multiples of and feed the rotated images into the affordance interpreter network to predict horizontal grasp pixelwise affordance maps. Next, we move the end effector to the position whose corresponding image location has the highest value of graspable affordance and rotate the angle according to the orientation of the rotated image. Finally, we approach the object and execute grasp.

Iv Experiments

Iv-a Experimental Setting

Physical Component. We use 6DOF UR5 robot mounted with ROBOTIQ85 gripper at end effector and Intel RealSense SR300 at robot’s wrist. SR300 is able to capture depth information. Note that we only use depth image to obtain suitable gripper reaching height not for grasping affordance inference. The workspace contains a planar area with grey texture. We change workspace texture at the robustness experiments.

(a) Training objects
(b) Test objects
(c) Scenarios
(d) Textures
Fig. 7: Grasp objects, scenarios and background textures. (a) Training objects. (b) Test objects. (c) The figures correspond to singular-object, multiple-object, cluttered and texture-change scenarios respectively in clockwise direction. (d) The first texture is used as background of simulated workspace, others are used in real world. There are background texture, texture1, texture2 and texture3 respectively in clockwise direction.

Training Dataset. The training dataset is collected using totally 35 objects from 3DNet [16] in V-REP simulator [32]. The set of objects can be seen in Fig. 7(a). We totally collect 6370 samples in which there are 3264 positive grasp samples and 3106 negative samples respectively.

Training Details.

We train our network with stochastic gradient descent method and set learning rate 0.001 with exponential decay and batch size 8. We augmented our data with vertical flip and slight rotation. For each sample, we rotate data to the left at most 5 degrees, so as to the right. Therefore, we augment our data to create a set of 140,140 samples.

Test Objects. We divide our test objects into three sets. One is household set containing 16 items of different sizes, shapes and physical properties. Another one is adversarial set consisting of 8 3D-printed items with adversarial geometry which were used by [5] and [8]. The last one is building blocks which have regular shapes, we use these objects to facilitate our qualitative analysis. All of the test objects are novel for our model. The objects are shown in Fig. 7(b).

Scenario Configuration. We evaluate the model performance with four different scenarios. The first is singular-object scenario, where there is only one object placed in the workspace. Once the gripper successfully grasps the object, it randomly rotates the object and place it to another position. The second one is multiple-object scenario, in which we can test whether the model is able to predict robust grasp configuration under multiple objects scene. The third one is clutter scene with 10 household objects. And the fourth one is similar to the first scene except the background texture. The scenarios are presented in Fig. 7(c). In this experiment we can demonstrate the robustness of our proposed approach.

Iv-B Qualitative Results

Fig. 8: Horizontally antipodal grasp analysis.

Top: Input the RGB image, the model outputs the pixelwise horizontal grasp affordance map. Buttom: Probability curves of two rows of affordance map.

Analysis of Horizontally Antipodal Grasp Pattern. We predict affordance map given an RGB image in which there are 6 objects placed. The result is shown in Fig. 8. We select two rows from affordance map and plot graspability and ungraspability curves of them. Row one shows that the positions of objects are able to be correctly located and labelled as non-background region. As for row two, we can see that the blue rectangle block placed vertically achieves higher probability of graspability, while the slant one in the lower right is prone to be ungraspable. The result demonstrates that the model can correctly find antipodal grasp pattern in horizontal direction.

Analysis of Grasp Location. Fig. 9 compares between our approach and other methods visually. We train a variant of model using the technique proposed by [7] with our collected data. In [7], the model was trained with 0 loss propagation for the background regions, and the shape of affordance map is 8 times smaller than that of input image, hence the output is upsampled 8 times using bilinear interpolation to map input image. We can see that affordance map predicted by [7] is ambiguous (Fig. 9(a)), and most of background regions are misclassified. Therefore, we modify the training loss to retain background loss, while the shape of affordance map remains unchanged. The predicted affordance map is shown in Fig. 9(b), it is coarse due to the interpolation and has some errors of grasp location. We also train the model using our proposed method except the data collected by random grasp rule instead of antipodal heuristic. The result is presented in Fig. 9(c). We can see that the model predicts some positions where the objects are not perpendicular to the horizontal direction as graspable region. By contrast, our proposed network architecture is able to predict more accurate grasp location.

Iv-C Quantitative Results

We execute extensive grasp experiments in physical world. Some of the visual results are shown in Fig. 10(a). More sufficient results can be seen in the supplementary video.

Fig. 9: Grasp location analysis. Input the RGB image, the models predict affordance map respectively. Red represents ungraspable, green denotes graspable and blue is background. (a) is the result of the model implemented according to [7], (b) is also the result of [7] except the training loss. Column (c) is the affordance maps generated by the model trained with data collected by random grasp rule. The last column is our result. It shows that our approach can evaluate more accurate grasp location.
Set MAG  [5] Ours(No CGS) Ours(T1) Ours(T2) Ours(T3)
H 770 - 856 936 923 904
A 806 93 898 913 927 941

Note: H means household set and A is Adversarial Set. T1, T2 and T3 represent texture1, texture2 and texture3 respectively.

TABLE I: Singular-object grasp success rate

Singular Object Grasp. In singular object experiment, we execute 20 grasps on each object and then average grasp success. For the household set, we achieve 93%

success rate with 95% confidence intervals. For the adversarial set, the accuracy obtained is

91% with 95% confidence intervals. We also implement two variants of our approach, one is the multi-affordance grasping proposed by [7] (MAG), another is the same as our method but without corrective grasp strategy (CGS) for data collection. The results are presented in Table I. MAG achieves 77% and 80% on household and adversarial items respectively, which shows that our pixel-level affordance interpreter network is able to predict more accurate grasp configuration. Moreover, although the model trained without corrective grasp strategy also achieves good performance, the visualization of the affordance maps presented in Fig. 10(b) and Fig. 10(c) shows that model trained with data under antipodal grasp rule predicts more precise affordance than that under random grasp rule.

(a) Affordance maps
(b) Random grasp rule
(c) Antipodal grasp rule
Fig. 10: Visualization of affordance maps. The affordance maps of random grasp rule in (b) is ambiguous when the object is slant (i.e., there exists both red and green colors on the position of the object). Conversely, the results in (c) show that the model trained with data under antipodal rule is able to predict the slant object as ungraspability more accurately. Best viewed in color.

Multiple Objects Grasp in Isolation. In multiple objects experiments, we randomly select 6 objects placed on the workspace isolatedly and grasp them into the bin successively. This task repeats 10 times. As a result, we achieve 90% (60/67) success rate on household set and 94% (60/64) on adversarial set. Although our training data only contains singular object scenario, we can achieve significant performance on the multiple-object scene.

Grasp in Clutter. We attempt 10-time tries at removing 10 household items cluttered on the workspace. The 10 objects are first shaken in a box and emptied in a pile on the workspace. The above configuration is the same as [8]. Despite our data collection does not involve objects in clutter, we demonstrate that our model performs well not only on the objects in isolation but also on clutter scenario. We achieve 87% (93/107) success rate.

Robustness to Different Backgrounds. To demonstrate the robustness of our approach, we perform 20 grasp tries for each object under different background textures. We do grasp experiments on 3 different textures which are illustrated in Fig. 7(d). The success rate can be seen in Table I. The results show that our approach is robust to different background textures and that the model is able to effectively pay attention to learning antipodal grasp pattern as well as ignores the disturbing information of the RGB image.

V Conclusion and Future Work

In this work, we propose a new grasping data collection method guided by antipodal grasping rule in virtual environment. The collected data can effectively contain horizontal antipodal grasp pattern. We apply only about 6,300 simulated samples to an end-to-end fully convolutional network to predict pixelwise grasp affordance map. Extensive experiments show that our proposed approach can achieve equivalent performance compared with current state-of-the-art methods. In this work we place emphasis on finding grasp pattern in synthesized RGB image, while there are also limitations to RGB image. In future work we will make the most of both RGB and depth information.


  • [1] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,” The International Journal of Robotics Research, vol. 34, no. 4-5, pp. 705–724, 2015.
  • [2] E. Johns, S. Leutenegger, and A. J. Davison, “Deep learning a grasp function for grasping under gripper pose uncertainty,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2016, pp. 4461–4468.
  • [3]

    L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,” in

    2016 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2016, pp. 3406–3413.
  • [4] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” The International Journal of Robotics Research, vol. 37, no. 4-5, pp. 421–436, 2018.
  • [5] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” in Robotics: Science and Systems (RSS), 2017.
  • [6] U. Viereck, A. Pas, K. Saenko, and R. Platt, “Learning a visuomotor controller for real world robotic grasping using simulated depth images,” in Conference on Robot Learning, 2017, pp. 291–300.
  • [7] A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo et al., “Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018.
  • [8] D. Morrison, P. Corke, and J. Leitner, “Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach,” in Robotics: Science and Systems (RSS), 2018.
  • [9] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke et al., “Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation,” arXiv preprint arXiv:1806.10293, 2018.
  • [10] M. Sugiyama, N. D. Lawrence, A. Schwaighofer et al.,

    Dataset shift in machine learning

    .   The MIT Press, 2017.
  • [11] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige et al., “Using simulation and domain adaptation to improve efficiency of deep robotic grasping,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018.
  • [12] K. Fang, Y. Bai, S. Hinterstoisser, and M. Kalakrishnan, “Multi-task domain adaptation for deep learning of instance grasping from simulation,” 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018.
  • [13] J. Tobin, W. Zaremba, and P. Abbeel, “Domain randomization and generative models for robotic grasping,” arXiv preprint arXiv:1710.06425, 2017.
  • [14] J. Mahler and K. Goldberg, “Learning deep policies for robot bin picking by simulating robust grasping sequences,” in Conference on Robot Learning, 2017, pp. 515–524.
  • [15]

    K. Goldberg, B. V. Mirtich, Y. Zhuang, J. Craig, B. R. Carlisle, and J. Canny, “Part pose statistics: Estimators and experiments,”

    IEEE Transactions on Robotics and Automation, vol. 15, no. 5, pp. 849–857, 1999.
  • [16] W. Wohlkinger, A. Aldoma, R. B. Rusu, and M. Vincze, “3dnet: Large-scale object class recognition from cad models,” in 2012 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2012, pp. 5384–5391.
  • [17] R. Suárez, J. Cornella, and M. R. Garzón, Grasp quality measures.   Institut d’Organització i Control de Sistemes Industrials Barcelona, Spain, 2006.
  • [18] C. Ferrari and J. Canny, “Planning optimal grasps,” in 1992 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 1992, pp. 2290–2295.
  • [19] V.-D. Nguyen, “Constructing force-closure grasps,” The International Journal of Robotics Research, vol. 7, no. 3, pp. 3–16, 1988.
  • [20] J. Weisz and P. K. Allen, “Pose error robust grasping from contact wrench space metrics,” in 2012 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2012, pp. 557–562.
  • [21] F. T. Pokorny and D. Kragic, “Classical grasp quality evaluation: New algorithms and theory,” in 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2013, pp. 3493–3500.
  • [22] B. Siciliano and O. Khatib, Springer handbook of robotics.   Springer, 2016.
  • [23] D. Prattichizzo and J. C. Trinkle, “Grasping,” in Springer handbook of robotics.   Springer, 2008, pp. 671–700.
  • [24] J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Data-driven grasp synthesis a survey,” IEEE Transactions on Robotics, vol. 30, no. 2, pp. 289–309, 2014.
  • [25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2016, pp. 770–778.
  • [26] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  • [27] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  • [28] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 23–30.
  • [29] L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel, “Asymmetric actor critic for image-based robot learning,” arXiv preprint arXiv:1710.06542, 2017.
  • [30] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” arXiv preprint arXiv:1710.06537, 2017.
  • [31] J. Matas, S. James, and A. J. Davison, “Sim-to-real reinforcement learning for deformable object manipulation,” arXiv preprint arXiv:1806.07851, 2018.
  • [32] E. Rohmer, S. P. Singh, and M. Freese, “V-rep: A versatile and scalable robot simulation framework,” in 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2013, pp. 1321–1326.