Versatile manipulation benefits from the capacity to flexibly control an end-effector in 3D space and dynamically react to changes in the environment. In the case of grasping, 6 degrees of freedom (6DoF: where the gripper is free to change in x, y, z position and in roll, pitch, yaw) closed-loop algorithms enable robots to pick up objects from a wider range of unstructured settings beyond tabletop scenarios: from moving in 6DoF to retrieve diagonally positioned plates in a dishwasher or harvest berries from a bush, to using closed-loop visual feedback for grasping objects moving along a conveyor belt or handed off by people. Despite the practical value of both 6DoF control and closed-loop feedback, most data-driven grasping algorithms today are only able to achieve one of these capabilities. Most methods only infer top-down grasps (4Dof: x, y, z, yaw) in simple tabletop settings [23, 37, 36], or detect grasps in 6DoF but with open-loop execution [17, 21, 7].
One major obstacle for achieving both 6DoF and closed-loop grasping is the challenge of acquiring effective training data. Collecting data on real robots through self-supervised trial and error is expensive. As the action space approaches higher dimensions (e.g. 4DoF to 6DoF grasping) and as the state space reaches higher diversity (e.g. images of static scenes to dynamic scenes), the exploration search space grows exponentially. In this large search space, the chances of stumbling on useful grasping trajectories through random search becomes exponentially slim. While prior work alleviates some of these issues by training on demonstration data collected from human teleoperation of robots , these approaches remain limited to a small range of environments that are physically accessible for those robots.
In this work, we develop a system for collecting grasping demonstrations in the wild by equipping a handheld grabbing tool with an RGB-D camera mounted on its “wrist” in the same way it would be on a real robot arm (Fig. 1). This device (which in total costs $600) is a low-friction tool that can be used by people to pick up objects while carrying out everyday tasks real-world environments (e.g. picking up trash, sorting dishes, etc.). During these tasks, the camera captures RGB-D gripper-centric videos from which we recover full 6DoF grasping trajectories using classic visual tracking algorithms. This setup provides grasping demonstration data with substantially more data diversity and lower cost than prior work.
Leveraging this data, we show that it is possible to bootstrap and train a robust end-to-end 6DoF closed-loop grasping model with reinforcement learning that transfers to real robot platforms. The system uses a deep network to model a value function that maps from a visual observation of the state (i.e. gripper-centric images) to the expected rewards in that state. A key aspect of our grasping model is that it uses “action-view” based rendering to simulate future states with respect to different possible actions (e.g. what the gripper camera would see if it moves forward or sideways). It evaluates these states using the learned value function in a closed-loop while executing grasps to predict how the gripper should move in the next time-step to maximize rewards.
In summary, our main contributions are 1) a new low-cost hardware interface for collecting grasping demonstrations in diverse environments, and 2) a visual 6DoF closed-loop grasping algorithm that uses action-view based rendering to achieve 92% grasping success rates in static scenes and 88% in dynamic scenes with moving objects. Our experiments demonstrate that the capacity to move in 6DoF enables our system to grasp novel objects in a variety of environments: from grasping objects sideways from a wall to picking from inclined bins. We also show that the performance and learning efficiency substantially improves by training on demonstration data collected with our tool. Qualitative results are available in our supplemental video.
2 Related Work
In this section, we review relevant work on vision-based grasping and data collection for data-driven grasping.
Classic vision-based grasping solutions often explicitly model contact forces with prior knowledge of object geometry, pose, and dynamics [25, 31, 6, 38]. However, this kind of prior knowledge is difficult to obtain for novel objects in unstructured environments.
More recent data-driven methods explore the prospects of training object-agnostic grasping policies that detect grasps by exploiting learned visual features, without explicitly using object-specific knowledge [26, 23, 24, 8, 18, 37, 17, 21, 7]
. This problem formulation enables these methods to generalize to novel objects without the need for scanning the objects to obtain 3D models or estimate their poses. However, since most of these approaches perform open-loop grasp execution, they are sensitive to calibration errors and fail to handle dynamic environments.
Another line of work tackles closed-loop grasping by designing algorithms that continuously gather visual observations during grasp execution and predict next actions using visual servoing [30, 20] or reinforcement learning . However, these methods are characterized by constrained state-action spaces in order to reduce the amount of training data required. For example, QT-Opt  learns only top-down grasping policies (action space) with images from a fixed static camera (state space). As a result, the system cannot immediately generalize to different task configurations (e.g. grasping from shelves) without extensive retraining. Specifically, QT-Opt trains using a total of 580k off-policy + 28k on-policy grasping trials to learn an effective policy for the current setup, which makes it challenging to generalize to larger state-action spaces. In this work, we propose to use human demonstration data and view-based action representations to improve learning efficiency.
Grasping data acquisition.
Learning-based grasping algorithms heavily depend on acquiring high-quality training data. However, most prior self-supervised grasping systems are often constrained to learning in simulation [30, 18, 19, 27] or structured lab environments [23, 36, 15, 14]. Gupta et al.  improves the data collection process by physically moving a robot into different environments. However, the data is still limited to simple scenarios (e.g. picking up toys from the floor) due to inefficient bootstrap exploration algorithms (with low initial grasping success rates) and constrained physical robot access to diverse environments.
Learning from demonstration is a popular approach to address sample efficiency problems. With human experts directly annotating the training data [37, 14] or controlling the robot via teleoperation [39, 28], the system can quickly obtain positive examples to speed up the training process. However, both settings (annotation or teleoperation) require human experts to be familiar with the robot hardware and grasping mechanisms in order to correctly annotate the grasp poses or successfully teleoperate the robot. Training human experts for such tasks can be expensive and difficult to scale. On the other hand, recording videos of direct interactions between human hands and objects does not require expert knowledge from the subject [3, 1]. However, there is often a big domain gap between the kinematics between the human hand and the robot gripper, which makes it challenging to learn transferable knowledge to robot manipulation policies. In this paper, using a handheld grabber, our data collection process is designed to be accessible to inexperienced users, scalable to any environment, applicable to any task, and transferable to real robot manipulation.
Our goal is to achieve reliable 6DoF closed-loop grasping in a framework that is flexible enough to handle novel objects and dynamic scene configurations. We show that this goal is achievable by training visual grasping value functions (using view-based rendering for data augmentation) on a large dataset of human demonstrations (collected from a handheld gripper equipped with a wrist-mounted camera). Sec. 4 describes our hardware setup and data collection process for gathering human grasping demonstrations from a diverse set of tasks and environments (i.e. in-the-wild). Sec. 5 describes our 6DoF closed-loop grasping model and how it is trained with this demonstration data.
4 Grasping Demonstrations In-the-Wild
To collect grasping data from human demonstrations, we built a low-cost portable handheld grabber tool equipped with a wrist-mounted RGB-D camera (illustrated in Fig. 2). We then asked willing participants to use the tool in place of their hands for everyday pick-and-place tasks, e.g. picking items from shelves, bins, refrigerators, sorting dishes in a dishwasher, or picking trash on the floor, etc. Our data collection system is driven by 3 key motivations:
Accessibility for diversity. Our handheld tool is a low-friction interface that allows untrained people to collect manipulation data in almost any environment (e.g. various homes, offices, warehouses, grocery stores), many of which would otherwise be difficult for robots to acquire physical access to. This substantially improves the diversity of the robot learning data that we can acquire.
Data for challenging tasks. For more challenging manipulation tasks like searching for dishes in a dishwasher, data collection through robot trial and error can be expensive – robot failures may lead to negative irreversible consequences (e.g. broken dishes). In contrast, our setup enables skilled humans to easily collect manipulation data for these tasks with negligible failure rates.
Minimized domain gap. Our gripper tool is designed to be as similar as possible to our real robot’s end effector: binary actuated parallel-jaw fingers with a wrist-mounted RGB-D camera. This similarity narrows the domain gap between the data collected from human demonstrations and the data that the robot encounters at test time.
4.1 Hardware Setup
Our handheld data collection device (Fig. 2) consists of: 1) a Royal Medical Solutions (RMS) plastic grabber reacher tool forearm, 2) a Dynamixel servo that twists the grabber’s internal cable to control the opening of the fingers, 3) a 3D printed grip that attaches to the back end of the grabber, 4) a binary push button on the grip that connects to an Arduino to trigger the Dynamixel servo, 5) an Intel RealSense D415 camera mounted 25cm from the gripper fingertips, streaming RGB-D images to 6) an Intel compute stick running Linux OS with data capturing software, 7) a portable 12V battery to power the tool for 5 hours on a single charge, and 8) an optional touch screen monitor. All components are either purchased off-the-shelf or 3D printed with PLA. The cost of the entire unit sums to around $600.
We designed the handheld gripper to be analogous to the end effector of the real robot setup (shown in Fig. 2 Right), which consists of a 6DoF UR5 robot arm with an binary RG2 gripper, and an wrist-mounted Intel RealSense D415 camera. The handheld gripper uses binary control (triggered by the push button) to mimic the RG2’s binary open/close behavior: on button push, the handheld gripper fingers close; on button release, the fingers open.
4.2 Data Collection and Processing
We distributed data collection among 8 participants, who were tasked with collecting grasping data while performing various pick-and-place tasks (e.g. picking from shelves, picking from bins, rearranging objects, picking up trash, etc.) in different environments (e.g. apartments, kitchens, offices, warehouses). The varying tasks and environments naturally encourage human demonstrators to perform different grasping strategies, which subsequently lead to more diverse demonstration data. Our dataset in total contains 12 hours of recorded gripper-centric RGB-D videos, labeled with the binary signal of when the user was the pushing the button to close the gripper.
To recover 6DoF grasping trajectories from the RGB-D videos of demonstrations, we use classic frame-to-frame visual tracking  to estimate the camera pose and trajectory over time. Since the camera is fixed on the gripper and the rigid transform between the camera and gripper is calibrated and known beforehand, this tracking process also enables us to recover the gripper pose and trajectory over time. Specifically, to estimate the relative pose transform between two RGB-D frames, we detect SIFT keypoints  on both frames and use RANSAC on correspondences with SVD to compute a rigid transform. We then refine that estimate by using ICP on the 3D point clouds projected from the frames. This algorithm makes the assumption that the environment is static – hence to reduce noisy estimates, we mask out the pixels that belong to the gripper and grasped objects.
Additionally, we split the RGB-D videos into short clips that correspond to each picking attempt by using a set of heuristics on the binary gripper closing signal. The frames that occur before a button push (to close handheld gripper fingers) record the pre-grasp trajectory as the gripper approaches the target object, while the frames that occur between the button push and the following button release record the post-grasp trajectory as the gripper acquires the target object. We can also recover and track the pixel mask of the target object by using background subtraction to detect pixel regions in the images that are stationary throughout the frames captured between button push and release.
In summary, we extract the following information from each RGB-D video segment corresponding to each picking attempt: 1) pre-grasp gripper trajectory, 2) final gripper grasping pose, 3) target object pixel mask, 4) post-grasp (placing) gripper trajectory, 5) and picking order. In total, the dataset contains 7,797 valid picking attempts and grasping trajectories. Fig. 3 illustrates several example demonstrations in the dataset and the grasping trajectories computed from visual tracking.
5 6DoF Closed-loop Vision-based Grasping
The task of closed-loop grasping requires an action policy that enables the robot to move its gripper towards an object, approach it from an angle that is likely to lead to a stable grasp, and potentially execute beneficial pre-grasp manipulations along the way (e.g. pushing an object into position between the fingers). This pre-grasp approaching process is a time-varying sequence of actions, for which rewards are loosely defined, and has previously been shown to be more effectively learned through reinforcement than from direct supervision [36, 13].
We formulate this vision-based grasping problem as a Markov decision process: given stateat time , the robot chooses and executes an action according to a policy , then transitions to a new state and receives a reward . The goal of reinforcement learning is to find an optimal policy that selects actions which maximize the total expected rewards , i.e. -discounted sum over an infinite-horizon of future returns from time to . In this work, we use off-policy Q-learning to learn the optimal parameterized Q-function (i.e. state-action value function), where
might denote weights of a neural network. Formally, our learning objective is to iteratively minimize the temporal difference errorbetween and a target value :
where is the set of all available actions at time .
Within our formulation, we represent each state as a visual observation (i.e. downsampled RGB-D image) captured from the wrist-mounted camera on the robot’s end effector at time . We parameterize each action as a 6DoF rigid transform that encodes the relative rotation and translation from the current robot end effector pose to the next target pose. Motion planning between end effector poses is autonomously executed on the real robot using standard proportional-derivative (PD) control with inverse kinematics (IK) solvers. Each episode (i.e. grasping trajectory) begins with the end effector initially positioned approximately 50cm away overlooking the scene of objects, and terminates after 40 state transitions or after a successful grasp (mechanically detected by thresholding on the distance between gripper fingers). Rewards are provided for successful grasps and otherwise.
5.1 View-based Rendering as Predictive Models
The key aspect of our formulation is that at each time step , we use view-based rendering to forward-simulate the set of possible future states conditioned on the current state and action taken . In other words, view-based rendering is used as a predictive model where approximates . Since states are represent by wrist-mounted camera views, and possible actions represent relative 6DoF rigid transforms of the end effector from its current pose, forward-simulating future states amounts to rendering a new camera view as if the end effector had moved according to . We train our Q-function (modeled with a deep network) from human demonstration data and fine-tune with real world trial and error (described in Sec. 5.2). During test time, at any given state , our system evaluates state-action pairs using our trained Q-function (which uses the view-based rendering model), and executes the action that maximizes the predicted Q-values i.e. .
This model improves the data efficiency of our deep Q-function by removing the need to learn to interpret (or in many cases, memorize) how an action (e.g. gripper movement) should correspond to changes in the state space. For cases in which actions are represented by continuous values that hold abstract meaning (e.g. end effector Cartesian offsets, joint angles, or motor torques) – the mapping between the action space and state (e.g. image) space needs to be explicitly decoded or learned. Our view-based rendering model helps to bypass this requirement. While predictive models like these have generally been shown in prior work (e.g. model predictive control) to improve the sample efficiency of reinforcement learning algorithms [5, 33], in this work we show that view-based rendering can serve as a strong proxy for predictive models in ego-centric visual grasping, where much of the task involves actively servoing to the best grasp on a target object.
Our action-view based grasping system consists of three components: 1) a 3D reconstruction pipeline that accumulates camera observations over time to generate a complete and persistent 3D representation of the scene, 2) a method for quickly rendering 3D scenes from arbitrary viewpoints using ray-casting, and 3) a deep neural network that models the value function . The following paragraphs describe the details of these components:
Aggregating visual observations.
As the end effector approaches a target object, the wrist-mounted camera continually gathers new visual observations (RGB-D images) of the scene. Each observation is partial due to object occlusions and clutter, hence the system requires an algorithm that can aggregate these partial observations into a complete and persistent 3D scene representation. Meanwhile, the representation should be flexible and continually update itself with new observations to handle dynamic environments.
To this end, we adopt the Truncated Signed Distance Function (TSDF) representation for fusing observations into a 3D voxel grid, where each voxel stores a value that represents its distance to the closest surface. The sign of that value indicates whether the voxel is in free space or occluded space [4, 22, 35]. We extend classic implementations by storing the color of the closest surface in each voxel as well, to support ray casting for downstream view-based rendering. At the beginning of each grasping attempt (episode), our system initializes a 3D voxel grid in robot coordinates, with voxel size set to 5mm. Given each new visual observation (RGB-D image) and estimated camera extrinsics (by computing end effector pose with known robot IK and using a previously calibrated transformation between the camera and end effector), the system transforms the observed surface from camera coordinates into TSDF voxel grid coordinates, and updates the TSDF values for all observed voxels respectively using an exponential moving average with that biases towards new observations. The region that is not directly observed by the camera (missing depth, occluded, or outside camera field of view) will remain unchanged with its original TSDF value. The fusion algorithm is implemented with GPU acceleration and runs at 30 frames per second asynchronously with the rest of the grasping framework.
In this way, the algorithm is not only able to build a more complete 3D representation of a static scene by aggregating past observations, but is also able to update the scene representation for dynamic environments with new observations. Compared to other methods of aggregating past observations such as using recurrent neural networks or LSTMs, our TSDF fusion explicitly leverages accurate industrial-grade robot motion in order to reduce the burden of learning view point registration or 3D reconstruction inside the network. Moreover, this explicit 3D scene representation also enables us to easily render views from arbitrary camera viewpoints by ray-casting the zero level set of the voxel grid, which supports the action-view generation.
At each time step , our formulation chooses between a set of possible action candidates where and each action encodes the relative rotation and translation between the current end effector pose and the next target pose ( in our experiments). Translations are bounded between in the x, y, and z axes respectively, while rotations are bounded between for Euler angle rotations around the x, y, and z axes. These bounds linearly decrease with respect to the median depth value observed in . Actions that cause the robot to collide with itself or move outside the workspace are automatically filtered from the list of action candidates.
By ray-casting through our TSDF representation of the scene, we render a virtual observation captured from the view of the camera on the robot’s end effector as if it had moved accordingly to candidate action . The rendered view contains an RGB-D image and surface normal image (i.e
. per-pixel normal vectors encoded into 3 channels). After that, all generated viewsare then fed into our trained Q-function. The state-action pair with the highest Q-value is selected and executed on the robot.
Given a set of candidate views , the goal of the network is to evaluate the Q-value with respect to each candidate and select the best corresponding action to perform. We model our Q-function as a feed-forward fully convolutional network that has two input branches and one output branch. One input branch takes as input the visual observation of the state , the other branch takes in one of the candidate views . The encoded current and future state features are then concatenated and fed into the action selection network to output a dense pixel-wise map of Q-values with the same image size and resolution as that of . Both the state encoding branch and the action selection networks are modeled by ResNet-18 network architectures. The training objective is to minimize the error between the predicted and target Q-values. Similar to Kalashnikov et al. , we constrain the Q-value to be bounded in [0, 1] and use the cross-entropy function for D for training stability. Section 5.2 will provide more details on how the target Q-value is assigned during training and finetuning. During testing the network will evaluate the Q-value for all valid candidate actions and then pick the one with highest Q-value to execute.
5.2 Learning from Human Demonstrations
Our system learns the value function for the real-world robot platform from human demonstration data. While human demonstrations provide a diverse set of examples for learning grasping strategies, there are still two major issues that need to be addressed in order to make these demonstrations an effective data source for training robot grasping algorithms: first, like most learning from demonstration datasets (e.g. used for self-driving cars), the training data distribution is naturally unbalanced: it consists of mostly positive examples, with very few negative examples. Second, despite efforts on making the hardware setup similar, there is still a small domain gap between the demonstration data and real robot data. We address the first issue through negative trajectories synthesis, and tackle the second issue by pre-training on demonstrations then fine-tuning on the real robot platform using trial and error.
Synthesizing negative trajectories via rendering.
Each successful grasping demonstration trajectory (i.e. episode) consists of a sequence of RGB-D images captured up until the gripper closing signal that terminates the episode. Each RGB-D image is associated with a 6DoF camera pose computed from RGB-D visual tracking (described in Sec. 4). At each time step of the sequence, we use TSDF fusion to aggregate camera observations up until the current frame, then use view-based rendering with the fused volume to generate a set of action-views around the current camera pose (in the same fashion as our algorithm described in Sec. 5). The action-views that move the gripper closer to the ground truth trajectory are labeled as positive examples, while all other action-views are treated as negative examples. To balance training, we randomly sample negative views to maintain a 1:5 positive to negative example ratio. The target value of positive views are assigned as , where is number of steps in this grasping attempt, is the total step length of the grasping episode, and our discount factor . The value for all negative actions are assigned as . Rather than predict one Q-value per image observation, we predict dense pixel-wise Q-values where supervision is provided to the pixel of the final grasping pose (i.e
. 3D gripper position) back-projected onto the current action-view image. This formulation serves as an attention mechanism that provides stronger supervision for our Q-function by specifically backpropagating gradients on the local visual features that contribute most to its Q-value prediction.
Fine-tuning with robot trial and error.
To address the domain gap between data collected from human demonstrations and data from the real robot, we further fine-tune our grasping models on the real robot platform through trial and error. During fine-tuning, the robot executes grasping trajectories that follow the action-view Q-function predictions (pretrained from human demonstrations) with -greedy exploration, where fixed at 0.1. This exploration step enables the algorithm to explore other possible grasping trajectories beyond what it has learned from human demonstrations. After each grasping attempt (i.e. episode), the observations, action trajectories, and the final binary grasping label (success or failure) is stored into the replay buffer for fine-tuning. In both training and fine-tuning, the model is trained using Adam optimizer, using fixed learning rates of and weight decay
. Our models are implemented in PyTorch. Both models with and without this fine-tuning step are evaluated in our experiments.
In this section we evaluate the effectiveness of our proposed algorithm compared to alternative approaches, as well as its ability to adapt to different test environment settings. For all the experiments, our evaluation metric is the grasping success rate:
Grasping in a variety of settings.
We first investigate our algorithm’s grasping performance across various static environment settings and scene configurations:
Tabletop. Robot grasps from a pile of objects randomly dumped on a flat tabletop.
Bin. Robot grasps from a pile of objects randomly dumped into a bin. This is more challenging than the Tabletop setting as it requires the grasping algorithm to avoid collisions with the bin while grasping.
Wall. Robot grasps from object hung on a flat wall 1m in front of the robot.
Random. Robot grasps from a pile of objects randomly dumped into a bin that is randomly positioned in the workspace with a random height (0-15cm to tabletop) and random tilt angle (0-30 to tabletop).
For each configuration, we run a total of 10 test runs, where each run consists of 10 (Wall) or 20 (others) grasping episodes. Objects are replaced in the scene after each test run. Each grasping episode begins with the robot’s initial gripper positioned in a pose such that all target objects are visible to the wrist mounted camera.
Since the algorithm formulation predicts only relative 6DoF position, it works out-of-the-box with any initial starting position. Row [pretrain only] in Tab. 2 shows the same model trained with only human demonstration data without any fine-tuning on the real robot. We can see that this model is able to perform reasonably well out-of-the-box across different scene configurations, due to the diversity of the demonstrations. Fine-tuning under each specific setting further improves the algorithm’s performance around on average ([+finetune] in Tab. 2).
Grasping in dynamic settings.
We also test our algorithm’s grasping performance in dynamic settings using the same experimental setup as Morrison et al. . During each test run, we arrange a pile of 10 objects (Fig. 6) on a movable sheet on a tabletop. The robot attempts multiple grasps – any objects that are grasped are removed. During each grasping attempt (i.e. episode), the pile is moved once by hand randomly (using the movable sheet). The movements have translations m and rotations (Fig. 6). This continues until all objects in the pile are grasped, or at least three consecutive grasps fail. We execute 10 test runs and average the grasping performance across the runs. Tab. 3 column [Dynamic] reports these result and their comparisons to alternative approaches in the same dynamic setting. These results show that our algorithm is able to achieve higher grasping success rates compared to alternative approaches for both static and dynamic settings.
Effect of pretaining with demonstration data.
To evaluate the benefits of pretraining on human demonstration data, we compare the our algorithm’s performance with a model directly trained from on-robot self-supervised trial and error (described in Sec. 5). Fig. 7 plots grasping success vs. training iterations, where each iteration consists of one trial and error grasping episode. The diverse training data collected from human demonstrations not only helps the algorithm learn faster (higher performance in the early training stage), but also helps the algorithm learn better (higher performance after fine-tuning). This experiment shows that human demonstration data is more effective than trial and error data since the demonstration data contains significantly more positive and more diverse grasping examples than the trial and error data collected on the robot. This diversity is important for pretraining grasping policies that can generalize to different grasping scenarios.
7 Conclusions and Future Work
We introduce a new low-cost hardware interface for collecting grasping demonstrations in diverse environments, and a visual 6DoF closed-loop grasping algorithm that uses action-view based rendering. Our experiments demonstrate that training on the demonstration data improves both grasping performance and learning efficiency, and the capacity to move in 6DoF and adaptive closed-loop control enabled the algorithm to handle a variety of environments.
Our system is not without limitations. First, our approach uses simple view-based rendering as a forward predictive model. While this approach is fast and accurate in modeling possible motions and passive observations, it does not model the physics of objects, which may be important during in-contact manipulation. As future work, it would be interesting to extend our predictive model with a learnable function that considers object and contact physics . More broadly, view-based rendering may also be applicable for other tasks with ego-centric visual states and locomotive action spaces – investigating its benefits for other applications (e.g. in navigation [29, 2]) would be interesting future work. Second, we only use pre-grasping trajectories from the demonstration data to learn a 6DoF closed-loop grasping model. It would be interesting to investigate how to make use of the other information captured in this dataset, such as picking order and placing trajectories .
Acknowledgements. We would like to thank Stefan Welker and Ivan Krasin for their help on designing the data collection device, Adrian Wong, Julian Salazar, and Sean Snyder for operational support, Chad Richards for helpful feedback on writing, and Ryan Hickman for managerial support. We are also grateful for financial support from Google and Amazon.
-  (2012) Generalization of human grasping for multi-fingered robot hands. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2043–2050. Cited by: §2.
-  (2017) Matterport3d: learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158. Cited by: §7.
-  (2014) Pre-grasp interaction for object acquisition in difficult tasks. In The Human Hand as an Inspiration for Robot Hand Development, Cited by: §2.
-  (1996) A volumetric method for building complex models from range images. In Special Interest Group on Computer GRAPHics and Interactive Techniques (SIGGRAPH), Cited by: §5.1.
-  (2018) Visual foresight: model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568. Cited by: §5.1.
-  (2009) The columbia grasp database. In ICRA, Cited by: §2.
-  (2018) Learning 6-dof grasping and pick-place using attention focus. arXiv preprint arXiv:1806.06134. Cited by: §1, §2, Table 1.
High precision grasp pose detection in dense clutter. In IROS, Cited by: §2, Table 1.
-  (2018) Robot learning in homes: improving generalization and reducing dataset bias. In Advances in Neural Information Processing Systems, pp. 9094–9104. Cited by: §2.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: Appendix B.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §5.1.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: Appendix B.
-  (2018) Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation. CORL. Cited by: §2, Table 1, §5.1, §5.
-  (2015) Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34 (4-5), pp. 705–724. Cited by: §2, §2, Table 1.
-  (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research 37 (4-5), pp. 421–436. Cited by: §2, Table 1.
-  (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §4.2.
-  (2018) Planning multi-fingered grasps as probabilistic inference in a learned deep network. arXiv preprint arXiv:1804.03289. Cited by: §1, §2, Table 1.
-  (2017) Dex-net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. RSS. Cited by: §2, §2, Table 1.
-  (2016) Dex-net 1.0: a cloud-based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1957–1964. Cited by: §2, Table 1.
-  (2018) Closing the loop for robotic grasping: a real-time, generative grasp synthesis approach. Robotics: Science and Systems. Cited by: §2, Table 1, Figure 6, §6, Table 3.
-  (2019) 6-dof graspnet: variational grasp generation for object manipulation. arXiv preprint arXiv:1905.10520. Cited by: §1, §2, Table 1.
-  (2011) KinectFusion: real-time dense surface mapping and tracking. In IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Cited by: §5.1.
Supersizing self-supervision: learning to grasp from 50k tries and 700 robot hours. In ICRA, Cited by: §1, §2, §2, Table 1.
-  (2017) Learning to push by grasping: using multiple tasks for effective learning. In ICRA, Cited by: §2.
-  (2008) Grasping. In Springer Handbook of Robotics, Cited by: §2.
Real-time grasp detection using convolutional neural networks. In ICRA, Cited by: §2.
-  (2019) ClearGrasp: 3d shape estimation of transparent objects for manipulation. arXiv preprint arXiv:1910.02550. Cited by: §2.
-  (2018) Multiple interactions made easy (mime): large scale demonstrations data for imitation. arXiv preprint arXiv:1810.07121. Cited by: §2.
-  (2018) Im2pano3d: extrapolating 360 structure and semantics beyond the field of view. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3847–3856. Cited by: §7.
-  (2017) Learning a visuomotor controller for real world robotic grasping using simulated depth images. CoRL. Cited by: §2, §2, Table 1, Figure 6, Table 3.
-  (2012) Pose error robust grasping from contact wrench space metrics. In ICRA, Cited by: §2.
-  (2013) Sun3d: a database of big spaces reconstructed using sfm and object labels. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1625–1632. Cited by: §4.2.
-  (2019) DensePhysNet: learning dense physical object representations via multi-step dynamic interactions. arXiv preprint arXiv:1906.03853. Cited by: §5.1, §7.
-  (2019) Form2Fit: learning shape priors for generalizable assembly from disassembly. arXiv preprint arXiv:1910.13675. Cited by: §7.
-  (2017) 3DMatch: learning local geometric descriptors from rgb-d reconstructions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.1.
-  (2018) Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. arXiv preprint arXiv:1803.09956. Cited by: §1, §2, Table 1, §5.
-  (2018) Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. ICRA. Cited by: §1, §2, §2, Table 1, Table 3.
-  (2017) Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge. In ICRA, Cited by: §2.
Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §1, §2.
Appendix A Data Collection Device: Hardware Details
Table 4 provides a list of hardware components (and associated costs) used to build our handheld data collection device. Figure 8 shows CAD models for 3D printed parts, which can be download from our project webpage.
|Part Names||Price ($)|
|3D Printed Parts||30||-|
|Intel Compute Stick||280||link|
|Intel RealSense D415||150||link|
|Buck Converter 12V->5V (5A)||3||link|
|Battery 12V 6000mAh/5V 12000mAh||34||link|
|Monitor (1024x600 Touch Screen)||60||link|
|Dynamixel AX-12A Serial Servo||45||link|
|PP-Nest 12mm Push Button||1||link|
Appendix B Network Architecture
The input to the current state encoder is a RGB-D image and its corresponding surface normal map. The encoder uses the following network architecture (Conv2d represents one 2D convolution layer, ResBlock represent one residual block  with BatchNorm ):
The future state encoder uses the following:
Conv2d(input=7, filter=64, kernel=3, stride=2, padding=1)
The action selection network uses the following:
Conv2d(input=64, filter=1, kernel=1, stride=1, padding=1)