DemoGrasp: Few-Shot Learning for Robotic Grasping with Human Demonstration

The ability to successfully grasp objects is crucial in robotics, as it enables several interactive downstream applications. To this end, most approaches either compute the full 6D pose for the object of interest or learn to predict a set of grasping points. While the former approaches do not scale well to multiple object instances or classes yet, the latter require large annotated datasets and are hampered by their poor generalization capabilities to new geometries. To overcome these shortcomings, we propose to teach a robot how to grasp an object with a simple and short human demonstration. Hence, our approach neither requires many annotated images nor is it restricted to a specific geometry. We first present a small sequence of RGB-D images displaying a human-object interaction. This sequence is then leveraged to build associated hand and object meshes that represent the depicted interaction. Subsequently, we complete missing parts of the reconstructed object shape and estimate the relative transformation between the reconstruction and the visible object in the scene. Finally, we transfer the a-priori knowledge from the relative pose between object and human hand with the estimate of the current object pose in the scene into necessary grasping instructions for the robot. Exhaustive evaluations with Toyota's Human Support Robot (HSR) in real and synthetic environments demonstrate the applicability of our proposed methodology and its advantage in comparison to previous approaches.



There are no comments yet.


page 1

page 4

page 5

page 6


Learning Task-Oriented Grasping from Human Activity Datasets

We propose to leverage a real-world, human activity RGB datasets to teac...

Verbal Focus-of-Attention System for Learning-from-Demonstration

The Learning-from-Demonstration (LfD) framework aims to map human demons...

DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video

Dexterous multi-fingered robotic hands have a formidable action space, y...

Learning 6-DoF Object Poses to Grasp Category-level Objects by Language Instructions

This paper studies the task of any objects grasping from the known categ...

Task-grasping from human demonstration

A challenge in robot grasping is to achieve task-grasping which is to se...

Learning 6-DOF Grasping Interaction with Deep Geometry-aware 3D Representations

This paper focuses on the problem of learning 6-DOF grasping with a para...

Learning Descriptor of Constrained Task from Demonstration

Constrained objects, such as doors and drawers are often complex and sha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Grasping of objects is a fundamental problem in robotics as it enables a numerous applications [1]. Robotic manipulators are already an integral part in modern workplaces where they are often used for repetitive tasks [2]. While human-robot collaboration can even help in medical applications [3, 4], it is often restricted to cases with clearly defined a priori motion patterns. When the interaction is more intricate and active manipulation is required, a priori definition becomes non-trivial such as in a less structured domestic environments where service robots typically operate. Current approaches for robotic grasping lack generalization capabilities as they either concentrate on estimating the object pose [5, 6] or learn grasping points [7]

which require detailed prior information of the object or a large set of annotations. Just like human hands, robotic grippers and arms have natural limits in their range of motion and have limited degrees-of-freedom, which restrict their possible grasping poses. While the motion models for robotics gripper and human hands can differ greatly, it should be possible to distill information from human manipulation and deduce adequate gripping commands for the target robot from it. Doing so with a limited set of human demonstrations would enable a robot to imitate the human behaviour and thus enable to seamlessly grasp objects.

Fig. 1: DemoGrasp pipeline. After demonstrating human interaction with the object of interest, our method leverages this knowledge to infer correct human-object interaction for the current observation (1). Subsequently, grasping instructions are derived from the inferred interaction (2).

We focus on this imitation where the robot mirrors the human interaction as illustrated in Fig. 1. The task can be partitioned into a visual perception and an interpretation part where a human instructor demonstrates the manipulation a priori (Demo) from which the robot deduces the grasping information necessary to manipulate the current scene (Grasp). Given an adequate mapping from the human hand to the robotic gripper, decomposing the task into these two stages allows our method to scale to a large amount of different grippers. Eventually, this paves the way to teach the robot by natural human demonstration, which allows to realize a higher level of automation, especially in less structured environments.

During demonstration of the object to the robot (Demo) from various different viewpoints, our method constantly tracks both the hand and the object which are fused into a Truncated Signed Distance Field (TSDF) for 3D reconstruction [8, 9]. Using semantic segmentation of hand and object [10], the reconstruction can be separated and further processed to retrieve a full 3D representation of both object and hand. We then extract the associated 3D hand mesh leveraging the MANO [11] hand model, which we tightly align with the reconstructed object. During inference, we then use PPF-FoldNet [12] to predict if an object is present together with its relative transformation from object to camera space. The final grasp instructions are then derived from the estimated hand mesh applying the estimated pose.

To summarize, we propose , the first learning by demonstration pipeline that can infer robotic grasping pose directly from a short human hand demonstration sequence of RGB-D images. Further, we set up a new synthetic evaluation benchmark based on the HSR to systematically evaluate grasping via demonstration. We make the dataset publicly available to encourage future research in this field.

Ii Related Work

Methods for robotic grasping can be divided into model-based and model-free approaches [5]. While model-based approaches require a 3D CAD for each object of interest, model-free methods instead directly infer final grasp instructions without any knowledge about the object’s geometry.

Ii-a Model-based Grasping

The task of model-based grasping commonly involves solving for the 6D pose, i.e. 3D rotation and 3D translation, of the object of interest [13, 14]

. Most traditional methods for object pose estimation rely on local image features such as SIFT 

[15, 16] or template matching [17]. With the rise of consumer depth cameras, the trend shifted towards 6D pose estimation from RGB-D data. Also with the additional depth map, template matching [18] has been used while others proposed to use handcrafted 3D descriptors such as SHOT [19] or point pair features [20, 21], or learn the pose task [22, 23]

. With the advent of deep learning, 6D pose estimation received another boost in attention as consistently faster and more accurate approaches have been introduced 

[24]. In essence, there are three different lines of work for estimating the 6D pose. The first is grounded on 3D descriptors [25, 12]. These descriptors can be computed for example via metric learning [26] or by means of auto-encoding [27]. Other methods directly infer the 6D pose [28, 13]. While [29, 28] regress the output pose, Kehl et al. turn it into a classification problem [13]. A few methods also solve for the pose by means of pose refinement [30, 31, 32]. The last and most prominent branch of works establishes 2D-3D correspondences and optimize for pose using a variant of the PP/RANSAC paradigm [33, 34, 35, 36].

While the accuracy of estimating the 6D pose keeps steadily increasing, it also comes with a heavy burden in annotating data for 3D CAD model and 6D pose, which scales poorly to multiple objects, often rendering it impractical for real applications [37, 38, 39]. In contrast, our method only needs a short live demonstration of an human-object interaction in order to reliably interact with novel objects.

Ii-B Model-free Grasping

Model-free approaches are generally very attractive compared to model-based approaches due to their ability to generalize to previously unseen geometry. In essence, model-free grasping can be divided into discriminative approaches [7, 40] and generative approaches [41, 42]

. Discriminative methods sample grasping instructions which are then scored by a neural network, whereas generative models directly output grasping configurations.

Different training data modalities are used to train discriminative approaches. Dex-Net [40, 43]

collects a large amounts of samples from a simulator, which is then used to train their proposed Grasp Quality Convolutional Neural Network (CNN). The authors further extend Dex-Net to support suction grippers

[44] and dual-arm robots[45]. While the previous methods all require depth data, [46] and [47] only use RGB inputs to score grasp candidates. To train on real data, Levine et al. [46]

collect over 800k real grasps over the course of two months. They then train a CNN which scores the grasp success probability given the corresponding motor command. In

[48] the authors use deep learning to train a robot policy which is capable of fast evaluation of millions of grasp candidates. Mousavian et al. [7] leverage a variational auto-encoder to map from partial point cloud observations to a set of possible grasps for the object of interest. In contrast to our proposed method and generative approaches, discriminative models are usually computationally expensive as each of the proposals needs to be evaluated before grasping can be attempted.

As for generative models, Jiang et al. [49] infers oriented rectangles in the image plane, representing plausible gripper positioning. Redmon et al. [41] simultaneously predict object class and, similar to [49], oriented rectangles depicting the grasping instructions. They further extend their method to return a set of multiple possible grasps as most objects can be usually grasped at several locations. Lenz et al. [42] propose a cascade network, where the first network produces candidate grasps, which are subsequently scored by the second network. In [50, 51], the authors leverage a small CNN to generate antipodal grasps. They predict the grasping quality, angle and gripper width for each individual pixel. SG use a Single-Shot Grasp Proposal Network grounded on PointNet++ [25] to efficiently predict amodal grasp proposals [52]. Generative methods are thereby usually tailored towards the robot or gripper used during training. In contrast, we infer the full hand-object interaction. Given the appropriate mapping from human hand to gripper, this can be applied to various robots and grippers.

A few methods also rely on reinforcement learning to teach grasping to a robot 

[53, 54, 46, 55]. However, reinforcement learning based approaches require a large amount of training data and feedback from real robots to learn grasping of objects. To simplify the training process and enable the robot to grasp specific novel objects, several works utilize human demonstration to learn robotic grasping. Early work such as [56], [57] store trajectories in robot configuration-space during the demonstration phase. The trajectories are recorded using either kinesthetic teaching [56] or teleoperation [57]. Follow-up papers [58], [55] adopt deep learning approaches to solve the task. Similarly, Yu et al. [58] teach the robot to perform tasks from a demonstration video. In this work, the robot policy is directly predicted from hidden layers of the network. In comparison,   focuses on inferring the detailed hand pose and object geometry from the network. Although in this paper we transform the human hand pose to 2-finger gripper pose for evaluation purposes, our pipeline has the potential to be leveraged for dexterous robotic hands in humanoid robots.

Iii Methodology

Fig. 2: Schematic overview of our proposed methodology. Teaching Phase: After human demonstrations, the hand-object interaction is reconstructed by fusing the RGB-D images pairs from multiple views with the help of segmentation masks. In the meanwhile, the MANO hand pose  [59] is retrieved from the first RGB image. After registration, the grasp pose is learned as well as the object shape. Grasp Inference: Points from the object and the scene are matched with PPF-FoldNet  [12] features, after which the object is registered to the scene. Leveraging the grasp pose registered with the object in the scene, the robot is able to grasp the object with its gripper.

Our   pipeline consists of four consecutive steps, detailed hereafter: In stage one we segment the hand and object on a set of human demonstration RGB-D images and reconstruct their incomplete shapes leveraging the recorded depth maps. In stages two and three, we complete the shape of the object and fit a hand pose. In the last stage, we estimate the 6D pose of the current object in the robot’s view which enables transforming the hand model accordingly. The model can subsequently be used to infer plausible grasping instructions for the robot. See Fig.2 for a schematic overview of our method.

Iii-a 3D Reconstruction of Human-Object Interaction

To reconstruct the shown hand-object interaction, we first segment the hand and object using Mask R-CNN [10]. In the absence of a dataset with real labels for human-object interaction, we rely on synthetic images from the ObMan dataset [59]. To enhance segmentation quality, we follow [60] and apply a binary cross entropy for each class independently, thus preventing inter-class competition. During the demonstration phase we feed our trained Mask R-CNN with the RGB image sequence as perceived by the robot camera. The segmented hand and object depth images are back-projected into 3D where KinectFusion [8] creates a corresponding TSDF volume. Drift free tracking is achieved by continuously aligning the current input frame against the TSDF volume using ICP [61, 62]. Nevertheless, as most household objects often possess rather simple geometry, camera tracking can become challenging. We thus simultaneously track the hand and object together in a shared TSDF as the additional hand geometry significantly stabilizes the process and makes the camera pose estimation more robust. Finally, we separate hand from object by means of two individual TSDFs reconstructions using the previously estimated relative poses together with the segmentation results. The additional information from the 2D segmentation is essential in this process as the dynamic motion of the human demonstrator’s arm and body in relation to the static environment can be filtered out as to solely fuse the relevant hand and object information.

Iii-B Object Shape Completion

After reconstruction, the object model is not yet complete due to self-occlusion and partial visibility. As a detailed and complete mesh has a positive influence on the subsequent tasks, we first apply shape completion. While there are different approaches to achieve this, in this work we harness a 3D CNN to directly correct the TSDF volume, prior to shape extraction via marching cubes [9]. Despite the cubic complexity scaling of TSDFs, they still provide a suitable solution in our case where both object and hand have a limited spatial extent. We apply a 3D variant of U-Net [63] with skip-connections for improved accuracy and feed it with the fused object TSDF volume of resolution. The 3D volume is encoded to a 512-dimensional feature descriptor which is additionally concatenated with RGB features to provide complementary image information in addition to the geometry. The decoder consists of 6 up-convolutions with a kernel size of

followed by a max-pooling layer to eventually reach the input resolution. The object mesh prediction is modeled as a classification task where each voxel in the output volume represents its prediction score, denoting if a voxel is occupied. Since only a small portion of voxels in the volume is occupied, the training data is highly imbalanced. We thus apply a class average loss and additionally use the focal loss as objective function to re-balance the loss contributions 

[60] according to


where Pos and Neg respectively represent the occupied and empty voxels and as originally proposed by Lin et al. [60]. To train our shape completion model we render multiple images from different views of various hand-object interactions using the synthetic simulator of [59]. We then extract the associated object TSDFs using the previously described procedure. These extracted object TSDFs are eventually fed into the network as training input, while the object meshes from simulation are transformed to their voxel representation in order to obtain the associated groundtruth.

Iii-C Hand Pose Estimation

We retrieve parameters for a statistical hand model from the reconstructed hand shape. To this end we use the MANO [11] hand model which maps parameters for hand pose and shape to a mesh. The MANO hand model is developed from multiple scanned real hands featuring realistic deformation.

Fig. 3: Registration of MANO Hand. Left: Fused hand point cloud. Right: The MANO hand mesh (in green) registered to the point cloud.

As prediction under occlusion is difficult, the authors of [59] propose to jointly train their CNN for hand pose and object mesh estimation using auxiliary contact and collision losses to encourage natural predictions without collision, which mutually improves both tasks. We eventually leverage the hand branch of their architecture for a complete hand mesh reconstruction. As the hand is part of a joint reconstruction, the pose is already close to the actual grasping location. Nevertheless, to further improve grasping, we additionally employ ICP to tightly align the hand mesh with the point cloud extracted from the TSDF volume of the partial hand reconstruction. This allows to calculate the transformation of the hand mesh in robot view coordinates ready to calculate grasping instructions. The final hand and object mesh after alignment is visualized in Fig. 3.

Iii-D Grasping Point & Instruction Retrieval

After accurate calculation of both object and hand shape from the demonstration, we want to determine the grasping instructions given the robot view. We first retrieve the object pose which is then used to transform the hand mesh and calculate the grasping points with thumb and index finger of the hand model. The first step is realized with PPF-FoldNet [12] which encodes the information of the local geometry in a high dimensional feature descriptor. Thereby, we calculate features for both the demo reconstruction and the current robotic observation. We then align both point clouds through robust feature matching using RANSAC.

As our extracted hand mesh is aligned with the object pose, we can infer the associated human grasp for the given input. Now that we estimated hand and object pose in the robotic view coordinates, we can determine the grasping instructions depending on the gripper hardware. We exemplify this with the Toyota Human Support Robot (HSR) and a two-sided gripper in the evaluation section below.

Iv Evaluation

We evaluate   in both simulation and the real world. To enable a comparison with state-of-the-art, we introduce a new synthetic dataset and an evaluation benchmark which we make publicly available.

Iv-a Evaluation Protocol

We select four objects of different shape — drill, hole punch, cookie box and shampoo bottle — for our in silico tests (see Fig. 5). To obtain the ground truth object shapes, we scanned the cookie box and shampoo bottle using a 3D scanner (Shining3D EinScan-SP), and acquired the hole punch and drill mesh model from LineMOD dataset [64]

. For evaluation we use Toyota HSR, a service robot platform that provides several integrated functionalities for a wide range of applications. We leverage the open-sourced simulation environment for HSR on which we build our benchmark


Fig. 4: Exemplary training images. We show four example images from the demonstration sequence of the dill object together with our segmentation results for hand and object, which are used to reconstruct the overall interaction (right). The demonstration sequences are in average 10-30 seconds long and cover around 100-400 images.
Fig. 5: Test objects used in the simulation. From left to right: shampoo bottle, drill, hole punch, cookie box.

Each test scene is set up with a table one meter away in front of the robot, in the view of the robot head RGB-D camera. The scanned object models are placed at different locations and viewing angles on the table. We consider a grasp successful only if the robot is able to grasp the object with its gripper, retrieve it from the table, and hold it still for three seconds without dropping it. In addition, we also record a human demonstration for each object, which we release with the benchmark, consisting of short RGB-D sequences of less than 1 minute length ( 100-400 frames) demonstrating a hand-object interaction at a distance of 0.5 m. During demonstration we rotated the objects once around by hand as shown in Fig. 4.

As for the real experiments, we evaluate our pipeline in a physical grasping task using Toyota HSR. The goal of the task is again for HSR to grasp a previously-learned household object and retrieve it from the table. We use four household objects: a cup, a can, a bottle, and a remote control (see Fig. 8). For each of these objects, we record two human demonstration sequences, one in which the demonstrator holds the object from its side, and one from the top. For each demonstration sequence, the experiments are conducted as follows: we record the RGB-D demonstration sequence using HSR’s head camera. Then, we run our pipeline to create a model for the object and the corresponding hand model for grasping the object. Finally, we place the object on a table in front of the robot and let HSR grasp the object and retrieve it from the table. To assess the robustness of the learned models, we vary the placement of the objects on the table. In particular, we divide the front half of the table — that is, the closer half of the table w.r.t. the initial position of HSR – in three equal areas. In each run, we select one of the three areas, randomly place the object there and let HSR attempt to grasp it. We repeat the grasping experiment three times for each of the areas for each demonstration sequence. We thus perform 9 runs per demonstration sequence, 18 runs per object, and 72 runs overall. A run is considered successful if HSR is able to grasp the object, retrieve it from the table, and move back to a neutral position with the object still in its gripper. Notice that the rear half of the table is excluded from the experiments, as reaching it would create challenges for the path planner of HSR. A representation of the experimental environment is provided in Fig. 6.

Fig. 6: Our experimental setup. We randomly place a demonstrated object on the table in front of HSR. Afterwards, we let HSR retrieve the object using the grasp instructions inferred by our   pipeline.

Iv-B Extracting Grasping Instructions for Toyota HSR

Given the hand predicted in the test scene by , we generate the target 6D grasp pose for HSR’s gripper based on the locations of the index finger, the thumb, and the wrist. The final grasp location is then calculated as the 3D middle point between the index and thumb locations.

For rotation, we create a coordinate system using three vectors:

  • The vector between the wrist location and the calculated middle point between the index and thumb locations, which is the vector that HSR’s hand will follow to approach the object;

  • The vector perpendicular to the plane described by and the the vector between the thumb and the index finger location, which is the direction in which HSR will tighten the grip; and

  • The vector perpendicular to the plane described by vectors and as

An illustration of this logic is provided in Fig. 7.

Successful Grasps / Average Planning Attempts for Shampoo Drill Hole Punch Cookie Box Mean
 [7] 86.7 / 5.6 73.3 / 7.2 86.7 / 11.2 60.0 / 108.3 76.7 / 41.8
 w/o Shape Completion 73.3 / – 100.0 / – 66.7 / – 100.0 / – 85.0 –
86.7 / – 93.3 / – 86.7 / – 100.0 / – 91.7 / –
TABLE I: Grasping results on synthetic evaluation suite. We compare   with the state-of-the-art method  [7]. We additionally report the average number of attempts for the motion planning in , as [7] infers several possible grasps which are consecutively ranked. The second row shows an ablation of our   pipeline without the shape completion module.
Fig. 7: Calculation of the 6D grasp pose. Illustration of the logic that we use to calculate the grasp point and rotation given the hand predicted by   (top). The lower part shows a representation of the predicted grasp pose by   in the synthetic environment.

Iv-C Evaluation Results in Simulation

During evaluation we use the aforementioned training sequences to train our method. We then run   according to the described evaluation protocol and report the average success rate after 15 different grasping trials in TABLE I. In particular, we note robust grasping results with an average success rate of over all objects, clearly showing that   is capable of producing reliable grasps. Disabling the shape completion module, we saw a decrease in our success rate of approximately , which indicates the importance of completion for reliable grasping instructions. Interestingly, we are on par or better for all objects when using shape completion except for the drill. As this object is fairly large, matching results with PPF-FoldNet are reliable even without completion. Completion can add some noise in this case which lowers the performance for the large object. Nevertheless, for most hand-sized objects such as shampoo and hole punch, our results with shape completion are clearly superior, as the hand covers most of the object and thus matching without the completion cannot be reliably conducted.

We additionally compare   with  , a state-of-the-art approach for model-free grasping. We outperform   by with an average success rate of compared to our . Moreover,   is a discriminative approach that samples up to 900 plausible grasps with an associated confidence. These grasp poses are then sorted from high to low scores and tested with a motion planning algorithm one after the other to assess if the pose is feasible. In our experiments,   tested 41.8 candidate grasps per trial, on average. In contrast, our method evaluates only one single grasp pose, which makes the motion planning on average about 40 times faster. To enable future comparison, we will publicly release our benchmark suite as well as the training sequences that we recorded.

Iv-D Real World Evaluation Results on Toyota HSR

Fig. 8: DemoGrasp success rate on a real HSR. Overall success rate for different grasp sides (top). The diagram shows individual grasping results for each object when they are grasped from the side and from top (middle). The four objects that we used for our real-world experiments (bottom).

Fig. 8 summarizes the results of our real experiments. The grasp success rate is when grasping objects from their side, while a grasp from the top is successful of the time. We postulate that this difference is due to the fact that i) most of the objects considered are horizontally symmetrical, which sometimes can cause our pipeline to place the predicted hand at the bottom rather than at the top of the object, eventually leading to a grasping failure, and ii) the considered objects offer a larger surface for grasping on their side compared to their top. The drop in success rate for the bottle is mostly due to the partial transparency of the object, which makes the object shape reconstruction, and consequently the registration, more challenging. Nonetheless, HSR was able to successfully grasp the object in of the runs simply by observing a single interaction between a person and the object. In addition, a live demo of the real experiment is shown in the attached video.

Success Rate
Demo Shampoo Drill Hole Punch Cookie Average
100% 86.7 93.3 86.7 100.0 91.7
50% 80.0 100.0 60.0 33.3 68.3
30% 66.7 100.0 13.3 6.7 46.7
TABLE II: Grasping results for learning sequences with partial visibility. We reduce the demonstration sequence length and compare the corresponding   success rates in our simulation environment.

Iv-E Restricting the Learning Sequence to partial Visibility

During the teaching phase,   records and fuses images by demonstrating the object in a single motion from one side to the other providing different views of the hand-object interaction. To understand the effect and limitation of partial hand-object demonstration on the result of , we run the teaching phase for our simulated grasp experiments with decreasing amount of demonstration length using different fractions of the original demonstration sequences. For each of the fractions, we then run the grasp experiments with the Toyota HSR in our simulation environment, and compare the success rates in TABLE II. It is worth to note that random seeding for RANSAC caused one grasping mistake in our test for the large drill object. The results show that using fewer frames as learning sequence leads to a drop in the average success rate for the objects by points using half of the demonstration video and points with only . The incomplete demonstration of both the object and the hand geometry results in a less accurate reconstruction, which has a direct effect on the success of the entire pipeline eventually leading to a failure in registration of the reconstructed MANO hand model with the hand point cloud which severely affects the grasping of the objects hole punch and cookie box. We observe that demonstration sequences that show all the fingers of the hand at least once to the camera are key to robust grasping.

V Conclusions

In this paper we introduced , a novel method for inferring grasping instructions from a short human demonstration. In the core, a human demonstrates a hand-object interaction in a short RGB-D image sequence. The sequence is leveraged to jointly reconstruct the 3D object as well as the hand mesh. Then, we localize the object in 3D using point-pair features and estimate reliable grasping instruction from the previously reconstructed interaction. Exhaustive evaluations in a real and synthetic environments demonstrate the applicability of our approach to learning grasping instructions from a single human demonstration. In the current pipeline, our model is restricted to the prediction of a single object in the scene. Future research will focus on extending   with incremental learning: the grasping instructions for new objects can be incrementally learned and added to the existing object library. Overall, we believe that our methodology can pave the way towards natural and interpretable human-robot collaboration by imitation, and our new dataset can serve as a basis to compare future approaches.


  • [1] M. Alonso, A. Izaguirre, and M. Graña, “Current research trends in robot grasping and bin picking,” in Soft Computing Models, 2018.
  • [2] C.-J. Liang, V. R. Kamat, and C. C. Menassa, “Teaching robots to perform quasi-repetitive construction tasks through human demonstration,” Automation in Construction, 2020.
  • [3] M. Esposito, B. Busam, C. Hennersperger, J. Rackerseder, A. Lu, N. Navab, and B. Frisch, “Cooperative robotic gamma imaging: Enhancing us-guided needle biopsy,” in MICCAI, 2015.
  • [4] B. Busam, M. Esposito, S. Che’Rose, N. Navab, and B. Frisch, “A stereo vision approach for cooperative robotic movement therapy,” in ICCVW, 2015.
  • [5] K. Kleeberger, R. Bormann, W. Kraus, and M. F. Huber, “A survey on learning-based robotic grasping,” Current Robotics Reports, 2020.
  • [6] C. Sahin, G. Garcia-Hernando, J. Sock, and T.-K. Kim, “A review on object pose recovery: from 3d bounding box detectors to full 6d pose estimators,” Image and Vision Computing, 2020.
  • [7] A. Mousavian, C. Eppner, and D. Fox, “6-dof graspnet: Variational grasp generation for object manipulation,” in ICCV, 2019.
  • [8] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, et al., “Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera,” in symposium on User interface software and technology, 2011.
  • [9] W. E. Lorensen and H. E. Cline, “Marching cubes: A high resolution 3d surface construction algorithm,” ACM siggraph computer graphics, 1987.
  • [10] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
  • [11] J. Romero, D. Tzionas, and M. J. Black, “Embodied hands: Modeling and capturing hands and bodies together,” ACM ToG, 2017.
  • [12]

    H. Deng, T. Birdal, and S. Ilic, “Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors,” in

    ECCV, 2018.
  • [13] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab, “Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again,” in ICCV, 2017.
  • [14] X. Deng, Y. Xiang, A. Mousavian, C. Eppner, T. Bretl, and D. Fox, “Self-supervised 6d object pose estimation for robot manipulation,” in ICRA, 2020.
  • [15] D. G. Lowe, “Object recognition from local scale-invariant features,” in ICCV, 1999.
  • [16] A. C. Romea, M. M. Torres, and S. Srinivasa, “The moped framework: Object recognition and pose estimation for manipulation,” International Journal of Robotics Research, 2011.
  • [17] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, N. Navab, P. Fua, and V. Lepetit, “Gradient response maps for real-time detection of textureless objects,” TPAMI, 2012.
  • [18] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,” in ACCV, 2012.
  • [19] F. Tombari, S. Salti, and L. Di Stefano, “Unique signatures of histograms for local surface description,” in ECCV, 2010.
  • [20] B. Drost, M. Ulrich, N. Navab, and S. Ilic, “Model globally, match locally: Efficient and robust 3d object recognition,” in

    2010 IEEE computer society conference on computer vision and pattern recognition

    , 2010.
  • [21] S. Hinterstoisser, V. Lepetit, N. Rajkumar, and K. Konolige, “Going further with point pair features,” in ECCV, 2016.
  • [22] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother, “Learning 6D object pose estimation using 3D object coordinates,” in ECCV, 2014.
  • [23] A. Krull, E. Brachmann, F. Michel, M. Ying Yang, S. Gumhold, and C. Rother, “Learning analysis-by-synthesis for 6D pose estimation in RGB-D images,” in ICCV, 2015.
  • [24] T. Hodan, F. Michel, E. Brachmann, W. Kehl, A. GlentBuch, D. Kraft, B. Drost, J. Vidal, S. Ihrke, X. Zabulis, et al., “Bop: Benchmark for 6d object pose estimation,” in ECCV, 2018.
  • [25] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in CVPR, 2017.
  • [26] P. Wohlhart and V. Lepetit, “Learning descriptors for object recognition and 3d pose estimation,” in CVPR, 2015.
  • [27] M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and R. Triebel, “Implicit 3d orientation learning for 6d object detection from rgb images,” in ECCV, 2018.
  • [28] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” RSS, 2017.
  • [29] C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, and S. Savarese, “Densefusion: 6d object pose estimation by iterative dense fusion,” in CVPR, 2019.
  • [30] Y. Li, G. Wang, X. Ji, Y. Xiang, and D. Fox, “Deepim: Deep iterative matching for 6d pose estimation,” in ECCV, 2018.
  • [31] F. Manhardt, W. Kehl, N. Navab, and F. Tombari, “Deep model-based 6d pose refinement in rgb,” in ECCV, 2018.
  • [32] Y. Labbé, J. Carpentier, M. Aubry, and J. Sivic, “Cosypose: Consistent multi-view multi-object 6d pose estimation,” in ECCV, 2020.
  • [33] M. Rad and V. Lepetit, “Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth,” in ICCV, 2017.
  • [34] B. Tekin, S. N. Sinha, and P. Fua, “Real-time seamless single shot 6d object pose prediction,” in CVPR, 2018.
  • [35] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel-wise voting network for 6dof pose estimation,” in CVPR, 2019.
  • [36] T. Hodan, D. Barath, and J. Matas, “Epos: estimating 6d pose of objects with symmetries,” in CVPR, 2020.
  • [37] T. Hodaň, V. Vineet, R. Gal, E. Shalev, J. Hanzelka, T. Connell, P. Urbina, S. N. Sinha, and B. Guenter, “Photorealistic image synthesis for object instance detection,” in ICIP, 2019.
  • [38] B. Busam, H. J. Jung, and N. Navab, “I like to move it: 6d pose estimation as an action decision process,” arXiv preprint arXiv:2009.12678, 2020.
  • [39] G. Wang, F. Manhardt, J. Shao, X. Ji, N. Navab, and F. Tombari, “Self6d: Self-supervised monocular 6d object pose estimation,” in ECCV, 2020.
  • [40] J. Mahler, F. T. Pokorny, B. Hou, M. Roderick, M. Laskey, M. Aubry, K. Kohlhoff, T. Kröger, J. Kuffner, and K. Goldberg, “Dex-net 1.0: A cloud-based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated rewards,” in ICRA, 2016.
  • [41] J. Redmon and A. Angelova, “Real-time grasp detection using convolutional neural networks,” in ICRA, 2015.
  • [42] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,” The International Journal of Robotics Research, 2015.
  • [43] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” arXiv preprint arXiv:1703.09312, 2017.
  • [44] J. Mahler, M. Matl, X. Liu, A. Li, D. Gealy, and K. Goldberg, “Dex-net 3.0: Computing robust vacuum suction grasp targets in point clouds using a new analytic model and deep learning,” in ICRA, 2018.
  • [45] J. Mahler, M. Matl, V. Satish, M. Danielczuk, B. DeRose, S. McKinley, and K. Goldberg, “Learning ambidextrous robot grasping policies,” Science Robotics, 2019.
  • [46] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” The International Journal of Robotics Research, 2018.
  • [47]

    L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,” in

    ICRA, 2016.
  • [48] V. Satish, J. Mahler, and K. Goldberg, “On-policy dataset synthesis for learning robot grasping policies using fully convolutional deep networks,” RAL, 2019.
  • [49] Y. Jiang, S. Moseson, and A. Saxena, “Efficient grasping from rgbd images: Learning using a new rectangle representation,” in ICRA, 2011.
  • [50] D. Morrison, P. Corke, and J. Leitner, “Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach,” arXiv preprint arXiv:1804.05172, 2018.
  • [51] ——, “Learning robust, real-time, reactive robotic grasping,” The International Journal of Robotics Research, 2020.
  • [52] Y. Qin, R. Chen, H. Zhu, M. Song, J. Xu, and H. Su, “S4g: Amodal single-view single-shot se (3) grasp detection in cluttered scenes,” in Conference on robot learning, 2020.
  • [53] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al., “Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation,” arXiv preprint arXiv:1806.10293, 2018.
  • [54] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis, “Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks,” in CVPR, 2019.
  • [55] S. Song, A. Zeng, J. Lee, and T. Funkhouser, “Grasping in the wild: Learning 6dof closed-loop grasping from low-cost demonstrations,” IEEE Robotics and Automation Letters, 2020.
  • [56] P. Pastor, L. Righetti, M. Kalakrishnan, and S. Schaal, “Online movement adaptation based on previous sensor experiences,” in IROS, 2011.
  • [57] S. Calinon, P. Evrard, E. Gribovskaya, A. Billard, and A. Kheddar, “Learning collaborative manipulation tasks by demonstration using a haptic interface,” in International Conference on Advanced Robotics, 2009.
  • [58] T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine, “One-shot imitation from observing humans via domain-adaptive meta-learning,” arXiv preprint arXiv:1802.01557, 2018.
  • [59] Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. J. Black, I. Laptev, and C. Schmid, “Learning joint reconstruction of hands and manipulated objects,” in CVPR, 2019.
  • [60] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in ICCV, 2017.
  • [61] P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” in Sensor fusion IV: control paradigms and data structures.    International Society for Optics and Photonics, 1992.
  • [62] S. Rusinkiewicz and M. Levoy, “Efficient variants of the icp algorithm,” in Proceedings third international conference on 3-D digital imaging and modeling, 2001.
  • [63] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
  • [64] S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab, and V. Lepetit, “Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes,” in 2011 international conference on computer vision, 2011.
  • [65] T. Yamamoto, K. Terada, A. Ochiai, F. Saito, Y. Asahara, and K. Murase, “Development of human support robot as the research platform of a domestic mobile manipulator,” ROBOMECH journal, 2019.