Grasping of objects is a fundamental problem in robotics as it enables a numerous applications . Robotic manipulators are already an integral part in modern workplaces where they are often used for repetitive tasks . While human-robot collaboration can even help in medical applications [3, 4], it is often restricted to cases with clearly defined a priori motion patterns. When the interaction is more intricate and active manipulation is required, a priori definition becomes non-trivial such as in a less structured domestic environments where service robots typically operate. Current approaches for robotic grasping lack generalization capabilities as they either concentrate on estimating the object pose [5, 6] or learn grasping points 
which require detailed prior information of the object or a large set of annotations. Just like human hands, robotic grippers and arms have natural limits in their range of motion and have limited degrees-of-freedom, which restrict their possible grasping poses. While the motion models for robotics gripper and human hands can differ greatly, it should be possible to distill information from human manipulation and deduce adequate gripping commands for the target robot from it. Doing so with a limited set of human demonstrations would enable a robot to imitate the human behaviour and thus enable to seamlessly grasp objects.
We focus on this imitation where the robot mirrors the human interaction as illustrated in Fig. 1. The task can be partitioned into a visual perception and an interpretation part where a human instructor demonstrates the manipulation a priori (Demo) from which the robot deduces the grasping information necessary to manipulate the current scene (Grasp). Given an adequate mapping from the human hand to the robotic gripper, decomposing the task into these two stages allows our method to scale to a large amount of different grippers. Eventually, this paves the way to teach the robot by natural human demonstration, which allows to realize a higher level of automation, especially in less structured environments.
During demonstration of the object to the robot (Demo) from various different viewpoints, our method constantly tracks both the hand and the object which are fused into a Truncated Signed Distance Field (TSDF) for 3D reconstruction [8, 9]. Using semantic segmentation of hand and object , the reconstruction can be separated and further processed to retrieve a full 3D representation of both object and hand. We then extract the associated 3D hand mesh leveraging the MANO  hand model, which we tightly align with the reconstructed object. During inference, we then use PPF-FoldNet  to predict if an object is present together with its relative transformation from object to camera space. The final grasp instructions are then derived from the estimated hand mesh applying the estimated pose.
To summarize, we propose , the first learning by demonstration pipeline that can infer robotic grasping pose directly from a short human hand demonstration sequence of RGB-D images. Further, we set up a new synthetic evaluation benchmark based on the HSR to systematically evaluate grasping via demonstration. We make the dataset publicly available to encourage future research in this field.
Ii Related Work
Methods for robotic grasping can be divided into model-based and model-free approaches . While model-based approaches require a 3D CAD for each object of interest, model-free methods instead directly infer final grasp instructions without any knowledge about the object’s geometry.
Ii-a Model-based Grasping
. Most traditional methods for object pose estimation rely on local image features such as SIFT[15, 16] or template matching . With the rise of consumer depth cameras, the trend shifted towards 6D pose estimation from RGB-D data. Also with the additional depth map, template matching  has been used while others proposed to use handcrafted 3D descriptors such as SHOT  or point pair features [20, 21], or learn the pose task [22, 23]
. With the advent of deep learning, 6D pose estimation received another boost in attention as consistently faster and more accurate approaches have been introduced. In essence, there are three different lines of work for estimating the 6D pose. The first is grounded on 3D descriptors [25, 12]. These descriptors can be computed for example via metric learning  or by means of auto-encoding . Other methods directly infer the 6D pose [28, 13]. While [29, 28] regress the output pose, Kehl et al. turn it into a classification problem . A few methods also solve for the pose by means of pose refinement [30, 31, 32]. The last and most prominent branch of works establishes 2D-3D correspondences and optimize for pose using a variant of the PP/RANSAC paradigm [33, 34, 35, 36].
While the accuracy of estimating the 6D pose keeps steadily increasing, it also comes with a heavy burden in annotating data for 3D CAD model and 6D pose, which scales poorly to multiple objects, often rendering it impractical for real applications [37, 38, 39]. In contrast, our method only needs a short live demonstration of an human-object interaction in order to reliably interact with novel objects.
Ii-B Model-free Grasping
Model-free approaches are generally very attractive compared to model-based approaches due to their ability to generalize to previously unseen geometry. In essence, model-free grasping can be divided into discriminative approaches [7, 40] and generative approaches [41, 42]
. Discriminative methods sample grasping instructions which are then scored by a neural network, whereas generative models directly output grasping configurations.
collects a large amounts of samples from a simulator, which is then used to train their proposed Grasp Quality Convolutional Neural Network (CNN). The authors further extend Dex-Net to support suction grippers and dual-arm robots. While the previous methods all require depth data,  and  only use RGB inputs to score grasp candidates. To train on real data, Levine et al. 
collect over 800k real grasps over the course of two months. They then train a CNN which scores the grasp success probability given the corresponding motor command. In the authors use deep learning to train a robot policy which is capable of fast evaluation of millions of grasp candidates. Mousavian et al.  leverage a variational auto-encoder to map from partial point cloud observations to a set of possible grasps for the object of interest. In contrast to our proposed method and generative approaches, discriminative models are usually computationally expensive as each of the proposals needs to be evaluated before grasping can be attempted.
As for generative models, Jiang et al.  infers oriented rectangles in the image plane, representing plausible gripper positioning. Redmon et al.  simultaneously predict object class and, similar to , oriented rectangles depicting the grasping instructions. They further extend their method to return a set of multiple possible grasps as most objects can be usually grasped at several locations. Lenz et al.  propose a cascade network, where the first network produces candidate grasps, which are subsequently scored by the second network. In [50, 51], the authors leverage a small CNN to generate antipodal grasps. They predict the grasping quality, angle and gripper width for each individual pixel. SG use a Single-Shot Grasp Proposal Network grounded on PointNet++  to efficiently predict amodal grasp proposals . Generative methods are thereby usually tailored towards the robot or gripper used during training. In contrast, we infer the full hand-object interaction. Given the appropriate mapping from human hand to gripper, this can be applied to various robots and grippers.
A few methods also rely on reinforcement learning to teach grasping to a robot[53, 54, 46, 55]. However, reinforcement learning based approaches require a large amount of training data and feedback from real robots to learn grasping of objects. To simplify the training process and enable the robot to grasp specific novel objects, several works utilize human demonstration to learn robotic grasping. Early work such as ,  store trajectories in robot configuration-space during the demonstration phase. The trajectories are recorded using either kinesthetic teaching  or teleoperation . Follow-up papers ,  adopt deep learning approaches to solve the task. Similarly, Yu et al.  teach the robot to perform tasks from a demonstration video. In this work, the robot policy is directly predicted from hidden layers of the network. In comparison, focuses on inferring the detailed hand pose and object geometry from the network. Although in this paper we transform the human hand pose to 2-finger gripper pose for evaluation purposes, our pipeline has the potential to be leveraged for dexterous robotic hands in humanoid robots.
Our pipeline consists of four consecutive steps, detailed hereafter: In stage one we segment the hand and object on a set of human demonstration RGB-D images and reconstruct their incomplete shapes leveraging the recorded depth maps. In stages two and three, we complete the shape of the object and fit a hand pose. In the last stage, we estimate the 6D pose of the current object in the robot’s view which enables transforming the hand model accordingly. The model can subsequently be used to infer plausible grasping instructions for the robot. See Fig.2 for a schematic overview of our method.
Iii-a 3D Reconstruction of Human-Object Interaction
To reconstruct the shown hand-object interaction, we first segment the hand and object using Mask R-CNN . In the absence of a dataset with real labels for human-object interaction, we rely on synthetic images from the ObMan dataset . To enhance segmentation quality, we follow  and apply a binary cross entropy for each class independently, thus preventing inter-class competition. During the demonstration phase we feed our trained Mask R-CNN with the RGB image sequence as perceived by the robot camera. The segmented hand and object depth images are back-projected into 3D where KinectFusion  creates a corresponding TSDF volume. Drift free tracking is achieved by continuously aligning the current input frame against the TSDF volume using ICP [61, 62]. Nevertheless, as most household objects often possess rather simple geometry, camera tracking can become challenging. We thus simultaneously track the hand and object together in a shared TSDF as the additional hand geometry significantly stabilizes the process and makes the camera pose estimation more robust. Finally, we separate hand from object by means of two individual TSDFs reconstructions using the previously estimated relative poses together with the segmentation results. The additional information from the 2D segmentation is essential in this process as the dynamic motion of the human demonstrator’s arm and body in relation to the static environment can be filtered out as to solely fuse the relevant hand and object information.
Iii-B Object Shape Completion
After reconstruction, the object model is not yet complete due to self-occlusion and partial visibility. As a detailed and complete mesh has a positive influence on the subsequent tasks, we first apply shape completion. While there are different approaches to achieve this, in this work we harness a 3D CNN to directly correct the TSDF volume, prior to shape extraction via marching cubes . Despite the cubic complexity scaling of TSDFs, they still provide a suitable solution in our case where both object and hand have a limited spatial extent. We apply a 3D variant of U-Net  with skip-connections for improved accuracy and feed it with the fused object TSDF volume of resolution. The 3D volume is encoded to a 512-dimensional feature descriptor which is additionally concatenated with RGB features to provide complementary image information in addition to the geometry. The decoder consists of 6 up-convolutions with a kernel size of
followed by a max-pooling layer to eventually reach the input resolution. The object mesh prediction is modeled as a classification task where each voxel in the output volume represents its prediction score, denoting if a voxel is occupied. Since only a small portion of voxels in the volume is occupied, the training data is highly imbalanced. We thus apply a class average loss and additionally use the focal loss as objective function to re-balance the loss contributions according to
where Pos and Neg respectively represent the occupied and empty voxels and as originally proposed by Lin et al. . To train our shape completion model we render multiple images from different views of various hand-object interactions using the synthetic simulator of . We then extract the associated object TSDFs using the previously described procedure. These extracted object TSDFs are eventually fed into the network as training input, while the object meshes from simulation are transformed to their voxel representation in order to obtain the associated groundtruth.
Iii-C Hand Pose Estimation
We retrieve parameters for a statistical hand model from the reconstructed hand shape. To this end we use the MANO  hand model which maps parameters for hand pose and shape to a mesh. The MANO hand model is developed from multiple scanned real hands featuring realistic deformation.
As prediction under occlusion is difficult, the authors of  propose to jointly train their CNN for hand pose and object mesh estimation using auxiliary contact and collision losses to encourage natural predictions without collision, which mutually improves both tasks. We eventually leverage the hand branch of their architecture for a complete hand mesh reconstruction. As the hand is part of a joint reconstruction, the pose is already close to the actual grasping location. Nevertheless, to further improve grasping, we additionally employ ICP to tightly align the hand mesh with the point cloud extracted from the TSDF volume of the partial hand reconstruction. This allows to calculate the transformation of the hand mesh in robot view coordinates ready to calculate grasping instructions. The final hand and object mesh after alignment is visualized in Fig. 3.
Iii-D Grasping Point & Instruction Retrieval
After accurate calculation of both object and hand shape from the demonstration, we want to determine the grasping instructions given the robot view. We first retrieve the object pose which is then used to transform the hand mesh and calculate the grasping points with thumb and index finger of the hand model. The first step is realized with PPF-FoldNet  which encodes the information of the local geometry in a high dimensional feature descriptor. Thereby, we calculate features for both the demo reconstruction and the current robotic observation. We then align both point clouds through robust feature matching using RANSAC.
As our extracted hand mesh is aligned with the object pose, we can infer the associated human grasp for the given input. Now that we estimated hand and object pose in the robotic view coordinates, we can determine the grasping instructions depending on the gripper hardware. We exemplify this with the Toyota Human Support Robot (HSR) and a two-sided gripper in the evaluation section below.
We evaluate in both simulation and the real world. To enable a comparison with state-of-the-art, we introduce a new synthetic dataset and an evaluation benchmark which we make publicly available.
Iv-a Evaluation Protocol
We select four objects of different shape — drill, hole punch, cookie box and shampoo bottle — for our in silico tests (see Fig. 5). To obtain the ground truth object shapes, we scanned the cookie box and shampoo bottle using a 3D scanner (Shining3D EinScan-SP), and acquired the hole punch and drill mesh model from LineMOD dataset 
. For evaluation we use Toyota HSR, a service robot platform that provides several integrated functionalities for a wide range of applications. We leverage the open-sourced simulation environment for HSR on which we build our benchmark.
Each test scene is set up with a table one meter away in front of the robot, in the view of the robot head RGB-D camera. The scanned object models are placed at different locations and viewing angles on the table. We consider a grasp successful only if the robot is able to grasp the object with its gripper, retrieve it from the table, and hold it still for three seconds without dropping it. In addition, we also record a human demonstration for each object, which we release with the benchmark, consisting of short RGB-D sequences of less than 1 minute length ( 100-400 frames) demonstrating a hand-object interaction at a distance of 0.5 m. During demonstration we rotated the objects once around by hand as shown in Fig. 4.
As for the real experiments, we evaluate our pipeline in a physical grasping task using Toyota HSR. The goal of the task is again for HSR to grasp a previously-learned household object and retrieve it from the table. We use four household objects: a cup, a can, a bottle, and a remote control (see Fig. 8). For each of these objects, we record two human demonstration sequences, one in which the demonstrator holds the object from its side, and one from the top. For each demonstration sequence, the experiments are conducted as follows: we record the RGB-D demonstration sequence using HSR’s head camera. Then, we run our pipeline to create a model for the object and the corresponding hand model for grasping the object. Finally, we place the object on a table in front of the robot and let HSR grasp the object and retrieve it from the table. To assess the robustness of the learned models, we vary the placement of the objects on the table. In particular, we divide the front half of the table — that is, the closer half of the table w.r.t. the initial position of HSR – in three equal areas. In each run, we select one of the three areas, randomly place the object there and let HSR attempt to grasp it. We repeat the grasping experiment three times for each of the areas for each demonstration sequence. We thus perform 9 runs per demonstration sequence, 18 runs per object, and 72 runs overall. A run is considered successful if HSR is able to grasp the object, retrieve it from the table, and move back to a neutral position with the object still in its gripper. Notice that the rear half of the table is excluded from the experiments, as reaching it would create challenges for the path planner of HSR. A representation of the experimental environment is provided in Fig. 6.
Iv-B Extracting Grasping Instructions for Toyota HSR
Given the hand predicted in the test scene by , we generate the target 6D grasp pose for HSR’s gripper based on the locations of the index finger, the thumb, and the wrist. The final grasp location is then calculated as the 3D middle point between the index and thumb locations.
For rotation, we create a coordinate system using three vectors:
The vector between the wrist location and the calculated middle point between the index and thumb locations, which is the vector that HSR’s hand will follow to approach the object;
The vector perpendicular to the plane described by and the the vector between the thumb and the index finger location, which is the direction in which HSR will tighten the grip; and
The vector perpendicular to the plane described by vectors and as
An illustration of this logic is provided in Fig. 7.
|Successful Grasps / Average Planning Attempts for||Shampoo||Drill||Hole Punch||Cookie Box||Mean|
|||86.7 / 5.6||73.3 / 7.2||86.7 / 11.2||60.0 / 108.3||76.7 / 41.8|
|w/o Shape Completion||73.3 / –||100.0 / –||66.7 / –||100.0 / –||85.0 –|
|86.7 / –||93.3 / –||86.7 / –||100.0 / –||91.7 / –|
Iv-C Evaluation Results in Simulation
During evaluation we use the aforementioned training sequences to train our method. We then run according to the described evaluation protocol and report the average success rate after 15 different grasping trials in TABLE I. In particular, we note robust grasping results with an average success rate of over all objects, clearly showing that is capable of producing reliable grasps. Disabling the shape completion module, we saw a decrease in our success rate of approximately , which indicates the importance of completion for reliable grasping instructions. Interestingly, we are on par or better for all objects when using shape completion except for the drill. As this object is fairly large, matching results with PPF-FoldNet are reliable even without completion. Completion can add some noise in this case which lowers the performance for the large object. Nevertheless, for most hand-sized objects such as shampoo and hole punch, our results with shape completion are clearly superior, as the hand covers most of the object and thus matching without the completion cannot be reliably conducted.
We additionally compare with , a state-of-the-art approach for model-free grasping. We outperform by with an average success rate of compared to our . Moreover, is a discriminative approach that samples up to 900 plausible grasps with an associated confidence. These grasp poses are then sorted from high to low scores and tested with a motion planning algorithm one after the other to assess if the pose is feasible. In our experiments, tested 41.8 candidate grasps per trial, on average. In contrast, our method evaluates only one single grasp pose, which makes the motion planning on average about 40 times faster. To enable future comparison, we will publicly release our benchmark suite as well as the training sequences that we recorded.
Iv-D Real World Evaluation Results on Toyota HSR
Fig. 8 summarizes the results of our real experiments. The grasp success rate is when grasping objects from their side, while a grasp from the top is successful of the time. We postulate that this difference is due to the fact that i) most of the objects considered are horizontally symmetrical, which sometimes can cause our pipeline to place the predicted hand at the bottom rather than at the top of the object, eventually leading to a grasping failure, and ii) the considered objects offer a larger surface for grasping on their side compared to their top. The drop in success rate for the bottle is mostly due to the partial transparency of the object, which makes the object shape reconstruction, and consequently the registration, more challenging. Nonetheless, HSR was able to successfully grasp the object in of the runs simply by observing a single interaction between a person and the object. In addition, a live demo of the real experiment is shown in the attached video.
Iv-E Restricting the Learning Sequence to partial Visibility
During the teaching phase, records and fuses images by demonstrating the object in a single motion from one side to the other providing different views of the hand-object interaction. To understand the effect and limitation of partial hand-object demonstration on the result of , we run the teaching phase for our simulated grasp experiments with decreasing amount of demonstration length using different fractions of the original demonstration sequences. For each of the fractions, we then run the grasp experiments with the Toyota HSR in our simulation environment, and compare the success rates in TABLE II. It is worth to note that random seeding for RANSAC caused one grasping mistake in our test for the large drill object. The results show that using fewer frames as learning sequence leads to a drop in the average success rate for the objects by points using half of the demonstration video and points with only . The incomplete demonstration of both the object and the hand geometry results in a less accurate reconstruction, which has a direct effect on the success of the entire pipeline eventually leading to a failure in registration of the reconstructed MANO hand model with the hand point cloud which severely affects the grasping of the objects hole punch and cookie box. We observe that demonstration sequences that show all the fingers of the hand at least once to the camera are key to robust grasping.
In this paper we introduced , a novel method for inferring grasping instructions from a short human demonstration. In the core, a human demonstrates a hand-object interaction in a short RGB-D image sequence. The sequence is leveraged to jointly reconstruct the 3D object as well as the hand mesh. Then, we localize the object in 3D using point-pair features and estimate reliable grasping instruction from the previously reconstructed interaction. Exhaustive evaluations in a real and synthetic environments demonstrate the applicability of our approach to learning grasping instructions from a single human demonstration. In the current pipeline, our model is restricted to the prediction of a single object in the scene. Future research will focus on extending with incremental learning: the grasping instructions for new objects can be incrementally learned and added to the existing object library. Overall, we believe that our methodology can pave the way towards natural and interpretable human-robot collaboration by imitation, and our new dataset can serve as a basis to compare future approaches.
-  M. Alonso, A. Izaguirre, and M. Graña, “Current research trends in robot grasping and bin picking,” in Soft Computing Models, 2018.
-  C.-J. Liang, V. R. Kamat, and C. C. Menassa, “Teaching robots to perform quasi-repetitive construction tasks through human demonstration,” Automation in Construction, 2020.
-  M. Esposito, B. Busam, C. Hennersperger, J. Rackerseder, A. Lu, N. Navab, and B. Frisch, “Cooperative robotic gamma imaging: Enhancing us-guided needle biopsy,” in MICCAI, 2015.
-  B. Busam, M. Esposito, S. Che’Rose, N. Navab, and B. Frisch, “A stereo vision approach for cooperative robotic movement therapy,” in ICCVW, 2015.
-  K. Kleeberger, R. Bormann, W. Kraus, and M. F. Huber, “A survey on learning-based robotic grasping,” Current Robotics Reports, 2020.
-  C. Sahin, G. Garcia-Hernando, J. Sock, and T.-K. Kim, “A review on object pose recovery: from 3d bounding box detectors to full 6d pose estimators,” Image and Vision Computing, 2020.
-  A. Mousavian, C. Eppner, and D. Fox, “6-dof graspnet: Variational grasp generation for object manipulation,” in ICCV, 2019.
-  S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, et al., “Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera,” in symposium on User interface software and technology, 2011.
-  W. E. Lorensen and H. E. Cline, “Marching cubes: A high resolution 3d surface construction algorithm,” ACM siggraph computer graphics, 1987.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
-  J. Romero, D. Tzionas, and M. J. Black, “Embodied hands: Modeling and capturing hands and bodies together,” ACM ToG, 2017.
H. Deng, T. Birdal, and S. Ilic, “Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors,” inECCV, 2018.
-  W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab, “Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again,” in ICCV, 2017.
-  X. Deng, Y. Xiang, A. Mousavian, C. Eppner, T. Bretl, and D. Fox, “Self-supervised 6d object pose estimation for robot manipulation,” in ICRA, 2020.
-  D. G. Lowe, “Object recognition from local scale-invariant features,” in ICCV, 1999.
-  A. C. Romea, M. M. Torres, and S. Srinivasa, “The moped framework: Object recognition and pose estimation for manipulation,” International Journal of Robotics Research, 2011.
-  S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, N. Navab, P. Fua, and V. Lepetit, “Gradient response maps for real-time detection of textureless objects,” TPAMI, 2012.
-  S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,” in ACCV, 2012.
-  F. Tombari, S. Salti, and L. Di Stefano, “Unique signatures of histograms for local surface description,” in ECCV, 2010.
-  B. Drost, M. Ulrich, N. Navab, and S. Ilic, “Model globally, match locally: Efficient and robust 3d object recognition,” in , 2010.
-  S. Hinterstoisser, V. Lepetit, N. Rajkumar, and K. Konolige, “Going further with point pair features,” in ECCV, 2016.
-  E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother, “Learning 6D object pose estimation using 3D object coordinates,” in ECCV, 2014.
-  A. Krull, E. Brachmann, F. Michel, M. Ying Yang, S. Gumhold, and C. Rother, “Learning analysis-by-synthesis for 6D pose estimation in RGB-D images,” in ICCV, 2015.
-  T. Hodan, F. Michel, E. Brachmann, W. Kehl, A. GlentBuch, D. Kraft, B. Drost, J. Vidal, S. Ihrke, X. Zabulis, et al., “Bop: Benchmark for 6d object pose estimation,” in ECCV, 2018.
-  C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in CVPR, 2017.
-  P. Wohlhart and V. Lepetit, “Learning descriptors for object recognition and 3d pose estimation,” in CVPR, 2015.
-  M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and R. Triebel, “Implicit 3d orientation learning for 6d object detection from rgb images,” in ECCV, 2018.
-  Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” RSS, 2017.
-  C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, and S. Savarese, “Densefusion: 6d object pose estimation by iterative dense fusion,” in CVPR, 2019.
-  Y. Li, G. Wang, X. Ji, Y. Xiang, and D. Fox, “Deepim: Deep iterative matching for 6d pose estimation,” in ECCV, 2018.
-  F. Manhardt, W. Kehl, N. Navab, and F. Tombari, “Deep model-based 6d pose refinement in rgb,” in ECCV, 2018.
-  Y. Labbé, J. Carpentier, M. Aubry, and J. Sivic, “Cosypose: Consistent multi-view multi-object 6d pose estimation,” in ECCV, 2020.
-  M. Rad and V. Lepetit, “Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth,” in ICCV, 2017.
-  B. Tekin, S. N. Sinha, and P. Fua, “Real-time seamless single shot 6d object pose prediction,” in CVPR, 2018.
-  S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel-wise voting network for 6dof pose estimation,” in CVPR, 2019.
-  T. Hodan, D. Barath, and J. Matas, “Epos: estimating 6d pose of objects with symmetries,” in CVPR, 2020.
-  T. Hodaň, V. Vineet, R. Gal, E. Shalev, J. Hanzelka, T. Connell, P. Urbina, S. N. Sinha, and B. Guenter, “Photorealistic image synthesis for object instance detection,” in ICIP, 2019.
-  B. Busam, H. J. Jung, and N. Navab, “I like to move it: 6d pose estimation as an action decision process,” arXiv preprint arXiv:2009.12678, 2020.
-  G. Wang, F. Manhardt, J. Shao, X. Ji, N. Navab, and F. Tombari, “Self6d: Self-supervised monocular 6d object pose estimation,” in ECCV, 2020.
-  J. Mahler, F. T. Pokorny, B. Hou, M. Roderick, M. Laskey, M. Aubry, K. Kohlhoff, T. Kröger, J. Kuffner, and K. Goldberg, “Dex-net 1.0: A cloud-based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated rewards,” in ICRA, 2016.
-  J. Redmon and A. Angelova, “Real-time grasp detection using convolutional neural networks,” in ICRA, 2015.
-  I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,” The International Journal of Robotics Research, 2015.
-  J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” arXiv preprint arXiv:1703.09312, 2017.
-  J. Mahler, M. Matl, X. Liu, A. Li, D. Gealy, and K. Goldberg, “Dex-net 3.0: Computing robust vacuum suction grasp targets in point clouds using a new analytic model and deep learning,” in ICRA, 2018.
-  J. Mahler, M. Matl, V. Satish, M. Danielczuk, B. DeRose, S. McKinley, and K. Goldberg, “Learning ambidextrous robot grasping policies,” Science Robotics, 2019.
-  S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” The International Journal of Robotics Research, 2018.
L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,” inICRA, 2016.
-  V. Satish, J. Mahler, and K. Goldberg, “On-policy dataset synthesis for learning robot grasping policies using fully convolutional deep networks,” RAL, 2019.
-  Y. Jiang, S. Moseson, and A. Saxena, “Efficient grasping from rgbd images: Learning using a new rectangle representation,” in ICRA, 2011.
-  D. Morrison, P. Corke, and J. Leitner, “Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach,” arXiv preprint arXiv:1804.05172, 2018.
-  ——, “Learning robust, real-time, reactive robotic grasping,” The International Journal of Robotics Research, 2020.
-  Y. Qin, R. Chen, H. Zhu, M. Song, J. Xu, and H. Su, “S4g: Amodal single-view single-shot se (3) grasp detection in cluttered scenes,” in Conference on robot learning, 2020.
-  D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al., “Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation,” arXiv preprint arXiv:1806.10293, 2018.
-  S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis, “Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks,” in CVPR, 2019.
-  S. Song, A. Zeng, J. Lee, and T. Funkhouser, “Grasping in the wild: Learning 6dof closed-loop grasping from low-cost demonstrations,” IEEE Robotics and Automation Letters, 2020.
-  P. Pastor, L. Righetti, M. Kalakrishnan, and S. Schaal, “Online movement adaptation based on previous sensor experiences,” in IROS, 2011.
-  S. Calinon, P. Evrard, E. Gribovskaya, A. Billard, and A. Kheddar, “Learning collaborative manipulation tasks by demonstration using a haptic interface,” in International Conference on Advanced Robotics, 2009.
-  T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine, “One-shot imitation from observing humans via domain-adaptive meta-learning,” arXiv preprint arXiv:1802.01557, 2018.
-  Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. J. Black, I. Laptev, and C. Schmid, “Learning joint reconstruction of hands and manipulated objects,” in CVPR, 2019.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in ICCV, 2017.
-  P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” in Sensor fusion IV: control paradigms and data structures. International Society for Optics and Photonics, 1992.
-  S. Rusinkiewicz and M. Levoy, “Efficient variants of the icp algorithm,” in Proceedings third international conference on 3-D digital imaging and modeling, 2001.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
-  S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab, and V. Lepetit, “Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes,” in 2011 international conference on computer vision, 2011.
-  T. Yamamoto, K. Terada, A. Ochiai, F. Saito, Y. Asahara, and K. Murase, “Development of human support robot as the research platform of a domestic mobile manipulator,” ROBOMECH journal, 2019.