This paper focuses on the problem of learning 6-DOF grasping with a parallel jaw gripper in simulation. We propose the notion of a geometry-aware representation in grasping based on the assumption that knowledge of 3D geometry is at the heart of interaction. Our key idea is constraining and regularizing grasping interaction learning through 3D geometry prediction. Specifically, we formulate the learning of deep geometry-aware grasping model in two steps: First, we learn to build mental geometry-aware representation by reconstructing the scene (i.e., 3D occupancy grid) from RGBD input via generative 3D shape modeling. Second, we learn to predict grasping outcome with its internal geometry-aware representation. The learned outcome prediction model is used to sequentially propose grasping solutions via analysis-by-synthesis optimization. Our contributions are fourfold: (1) To best of our knowledge, we are presenting for the first time a method to learn a 6-DOF grasping net from RGBD input; (2) We build a grasping dataset from demonstrations in virtual reality with rich sensory and interaction annotations. This dataset includes 101 everyday objects spread across 7 categories, additionally, we propose a data augmentation strategy for effective learning; (3) We demonstrate that the learned geometry-aware representation leads to about 10 percent relative performance improvement over the baseline CNN on grasping objects from our dataset. (4) We further demonstrate that the model generalizes to novel viewpoints and object instances.READ FULL TEXT VIEW PDF
Learning to interact with and grasp objects is a fundamental and challenging problem in robot learning that combines perception, motion planning, and control. The problem is challenging because it not only requires understanding geometry (the global shape of an object, the local surface around the interaction space) but it also requires estimating physical properties, such as weight, density, and friction. Furthermore, it requires invariance to illumination, object location, and viewpoint. To handle this, current data-driven approaches[17, 29, 19, 22, 21] use hundreds of thousands of examples to learn a solution.
While further scaling may help improve performance of these methods, we postulate shape is core to interaction and that additional shape signals to focus learning will boost performance. The notion of using shape and geometry has been pioneered in grasping research [11, 18, 1, 20, 36].
Inspired by these approaches, we propose the concept of a deep geometry-aware representation (e.g., [40, 9, 2, 39, 23, 30, 41, 34, 10, 8]) for grasping. Key to our approach is that we first build a mental representation by recognizing and reconstructing the 3D geometry of the scene from RGBD input, as demonstrated in Figure 1. With the built-in 3D geometry-aware representation, we can hallucinate a local view of the object’s geometric surface from the gripper perspective that will be directly useful for grasping interaction. In contrast with black-box models that do not have explicit notion of 3D geometry and prior shape-based grasping approaches, our approach has the following features: (1) it performs 3D shape reconstruction as an auxiliary task; (2) it hallucinates the local view using a learning-free physical projection operator; and (3) it explicitly reuses the learned geometry-aware representation for grasping outcome prediction.
In this work, we design an end-to-end deep geometry-aware grasping network for learning this representation. Our geometry-aware network has two components: a shape generation network and a grasping outcome prediction network. The shape generation network learns to recognize and reconstruct the 3D geometry of the scene with an image encoder and voxel decoder. The image encoder transforms the RGBD input into a high-level geometry representation that involves shape, location, and orientation of the object. The voxel decoder network takes in the geometry representation and outputs the occupancy grid of the object. To further hallucinate the local view from gripper perspective, we propose a novel learning-free image projection layer similar to [41, 30]. Building upon the shape generation network, our grasping outcome prediction network learns to produce a grasping outcome (e.g., success or failure) based on the action (i.e. gripper pose), the current visual state (e.g., object and gripper), and the learned geometry-aware 3D representation. Unlike our end-to-end multi-objective learning framework, existing data-driven grasping pipelines [29, 22, 21] can be viewed as models without a shape generation component. They require either an additional camera to capture the global object shape or extra processing steps, such as object detection and patch alignment. Furthermore, these methods learn over a constrained grasp space, typically either 3-DOF or 4-DOF. We relax this constraint to learn fully generalized 6-DOF grasp poses.
We have built a large database consisting of 101 everyday objects with around 150K grasping demonstrations in Virtual Reality with both human and augmented synthetic interactions. For each object, we collect 10-20 grasping attempts with a parallel jaw gripper from right-handed users. For each attempt, we record a pre-grasping status which includes the location and orientation of the object and gripper, as well as the grasping outcome (e.g., success or failure given if the object is between the gripper fingers after closing and lifting). To acquire sufficient data for learning, we generate additional synthetic data by perturbing the gripper location and orientation from human demonstrations using PyBullet . More information about our geometry-aware grasping project can be found at https://goo.gl/gPzPhm.
Our main contributions are summarized below:
To best of our knowledge, we are presenting for the first time a method to learn a 6-DOF deep grasping neural network from RGBD input.
We build a database with rich visual sensory data and grasping annotations with a virtual reality system and propose a data augmentation strategy for effective learning with only modest amount of human demonstrations.
We demonstrate that the proposed geometry-aware grasping network is able to learn the shape as well as grasping outcome significantly better than models without notion of geometry.
We demonstrate that the proposed model has advantages in guiding grasping exploration and achieves better generalization to novel viewpoints and novel object instances.
built a robotic system for learning grasping from large-scale real-world trial-and-error experiments. In this work, a deep convolutional neural network was trained on 700 hours of robotic grasping data collected from the system.
Fine-grained grasping planning and control often involves 3D modeling of object shape, modeling dynamics of robot hands, and local surface modeling [11, 18, 14, 37, 20, 36, 22, 21]. Some work focused on analytic modeling of robotic grasps with known object shape information [11, 18]. Varley et al.  proposed a shape completion model that reconstructs the 3D occupancy grid for robotic grasping from partial observations, where ground-truth 3D occupancy grid is used during model training. In comparison, our approach does not require full 3D volume supervision for training (e.g., occupancy grid). Similar to our work,  use a learned shape-context to help predict grasps. Unlike their work, we use the shape to build a virtual global geometric representation along with a local gripper centric model to sequentially propose and evaluate grasp proposals. Li et al.  investigated the hand pose estimation in robotic grasping by decoupling contact points and hand configuration with parametrized object shape. Building upon the compositional aspect of everyday objects, Vahrenkamp et al.  proposed a part-based model for robotic grasping that has better generalization to novel object. Very recently, effort was also made in building DexNet [22, 21], a large-scale point cloud database for planar grasping (from top-down). In addition to general robotic grasping, several recent work investigated the semantic or task-specific grasping [4, 15, 25].
In contrast to existing learning frameworks applied to robotic grasping (either top-down grasping or side-grasping), our approach features (1) providing a method to learn a 6D grasping network from RGBD input (2) an end-to-end deep learning framework for generative 3D shape modeling and leveraging it for predictive 6D grasping interaction, and (3) learning-free projection layer that links the 2D observations with 3D object shape which allows for learning the shape representation without explicit 3D volume supervision.
In this section, we develop a multi-objective learning framework that performs 3D shape generation and grasping outcome prediction.
Being able to recognize and reconstruct the 3D geometry given RGBD input is a very important step during grasping planning. In our formulation, we propose a reconstruction of a 3D occupancy grid [40, 9, 2, 39, 30, 41, 34, 10, 8] that encodes the shape, location, and orientation of the object as our geometry-aware representation. Previous work generate normalized 3D occupancy grids centered at the origin. Our formulated geometry-aware representation differs in that (1) it takes location and orientation into consideration (the orientation of a novel object is usually undefined); (2) it is invariant to camera viewpoint and distance (we obtain the same representation from arbitrary camera setting).
Given an RGBD input and a corresponding 3D occupancy grid , the task is to learn a functional mapping . Simply following this formulation, previous work [40, 9, 2, 39, 23] that use 3D supervision obtained reasonable quality in generating normalized 3D volumes by using thousands of shape instances. However, in our problem setting, these methods would require even more data considering the entangled factors from shape, location, and orientation.
Recent breakthroughs in reconstructing 3D geometry with 2D supervision [30, 41, 34, 43, 10, 8, 6, 35] suggest that (1) the quality of reconstructed 3D geometry is as good as previous work with 3D supervision; (2) the learned representation generalizes better to novel settings than previous work with 3D supervision; and (3) learning becomes more efficient with 2D supervision. Inspired by these findings, we tackle the 3D reconstruction in a weakly supervised manner without explicit 3D shape supervision. In , an in-network projection layer is introduced for 3D shape learning from 2D masks (e.g. 2D silhouette of object). Unfortunately, 2D silhouette is usually insufficient supervision signal to reconstruct objects with concave 3D parts (e.g., containers). For these reasons, we chose to use a depth signal in our shape reconstruction. Additionally, RGBD sensors are commonly available in most robot platforms.
To enable depth supervision in our shape generation component, we propose a novel in-network OpenGL projection operator that utilizes a 2D depth map as supervision signal for learning to reconstruct the 3D geometry. We formulate the projection operation by that transforms a 3D shape into a 2D depth map with the camera transformation matrix . Here, the camera transformation matrix decomposes as , where is the camera intrinsic matrix, is the camera rotation matrix, and
is the camera translation vector. In our implementation, we also use a 2D silhouette as an object maskfor learning. Empirically, this additional objective makes the learning stable and efficient.
Following the OpenGL camera transformation standard, for each point in 3D world frame, we compute the corresponding point in the normalized device coordinate system () using the transformation: . Here, the conversion from depth buffer to real depth is given by where and . Here, and represents the far and near clipping planes of the camera.
Similar to the “transformer networks” proposed in[41, 13], our depth projection can be seen as: (1) performing dense sampling from input volume (in the 3D world frame) to output volume (in normalized device coordinates); and (2) flattening the 3D spatial output across one dimension. Again, -th point in output volume (-th point is indexed by in the volume space) and corresponding point in input volume are related by the transformation matrix . Here, and are the width, height, and depth of the input and output volume, respectively. We define the dense sampling step and channel-wise flattening step as follows:
In our implementation, we pre-computed the actual depth given the difficulty that is not back-propagatable. As we will see in the following section, the network will be trained to match these predictions and to the ground-truth and . Please note that our in-network projection layer is learning-free as it implements the exact ray-tracing algorithm without extra free parameters involved. We note that the concept of depth projection is also explored in some very recent work [38, 33, 43], but their implementations are not exactly the same as our OpenGL projection layer in Eq. LABEL:eqn:opengl_transform.
Learning to reconstruct 3D geometry from single-view RGBD sensory input is a challenging task in computer vision due to shape ambiguity. We adopt the shape consistency learning that enforces viewpoint-invariance across multi-view observations[2, 41, 34]. More specifically, we (1) use the averaged identity units from multiple viewpoints as input to shape decoder network and (2) provide multiple projections for supervising the 3D shape reconstruction during training. Such shape consistency learning encourages an image taken from one viewpoint sharing the same representation with the image taken from another viewpoint. At testing time, we only provide RGBD input from single viewpoint. Given a series of observations of the scene, the 3D reconstruction can be formulated as . Similarly, the projection operator from -th viewpoint is , where and are the depth and camera transformation matrix from corresponding viewpoint, respectively. Finally, we define the shape reconstruction loss in Eq. 2.
Here, and are the constant coefficients for the depth and mask prediction terms, respectively.
As demonstrated in previous work [26, 7, 5, 42, 28] that learn interactions from demonstrations, prediction of the future state can be a metric for understanding the physical interaction. In our grasping setting, we define the RGBD input as current state , the 6D pre-grasping parameters (position and orientation of the parallel jaw gripper) as action, and the grasping outcome (e.g., binary label representing a successful grasp or not) as future state. The future prediction task can be solved by learning a functional mapping . We refer to this method as a baseline grasping interaction prediction model, which has been a basis of several recent state-of-the-art grasping methods using deep learning (e.g., [17, 19, 21]). These work managed to learn such mapping with either (a) millions of randomly generated grasps, (b) additional view from eye/hand perspective, or (c) additional processing steps such as object detection and image alignment.
In comparison, our geometry-aware model is an end-to-end architecture which constrains its prediction with geometry information. As we learn to reconstruct the 3D geometry, we argue that the local surface view (typically from a wrist camera perspective) can be directly inferred from our viewpoint-invariant geometry-aware representation , where . Here, we treat the gripper as a virtual camera with the transformation matrix with its world-space coordinates given by the 6D pre-grasping parameters . In addition to the local view, our geometry-aware representation provides a global view of the scene that takes a shape prior, location, and orientation of object into consideration. Finally, given a current observation , proposed action , and inferred 3D shape representation , we fit a functional mapping , where is the binary outcome.
To implement the two components proposed in the previous sections, we introduce DGGN (deep geometry-aware grasping network) (see Figure 2), composed of a shape generation network and an outcome prediction network. The shape generation network has a 2D convolutional shape encoder and a 3D deconvolutional shape decoder followed by a global projection layer. Our shape encoder network takes RGBD images of resolution 128 128 and corresponding 4-by-4 camera view matrices as input; the network outputs identity units as an intermediate representation. Our shape decoder is a 3D deconvolutional neural network that outputs voxels at a resolution of 32 32 32. We implemented the projection layer (given camera view and projection matrices) that transforms the voxels back into foreground object silhouettes and depth maps at an input resolution (128 128). Here, the purpose of generative pre-training is to learn viewpoint invariant units (e.g., object identity units) through object segmentation and depth prediction. The outcome prediction network has a 2D convolutional state encoder and a fully connected outcome predictor with an additional local shape projection layer. Our state encoder takes RGBD input (the pre-grasp scene) of resolution 128 128 and corresponding actions (position and orientation of the gripper end-effector) and outputs state units as intermediate representation. Our outcome predictor takes both current state (e.g., the pre-grasp scene and gripper action) and geometry features (e.g., viewpoint-invariant global and local geometry from the local projection layer) into consideration. Note that the local dense-sampling transforms the surface area around the gripper fingers into a foreground silhouette and a depth map at resolution 48 48.
This section describes our data collection and augmentation process, as well as experimental evaluation on grasping outcome prediction and grasping trials.
We collected grasping demonstrations on seven categories of objects, which include a total of 101 everyday objects. To collect grasping demonstrations, we set up the HTC Vive system in Virtual Reality (VR) and assign target objects randomly to five right-handed users (three males and two females). In total, 1597 human grasps are demonstrated, with an average of 15 grasps per object (with lowest and highest number of grasps at 7 and 39 for a plate and a wine glass, respectively). We randomly split 101 objects into three sets (e.g., training, validation and testing) and make sure each set covers the seven categories (70% for training, 10% for validation and 20% for testing).
In order to collect sufficient grasping demonstrations for model training and evaluation, we generate synthetic grasps by perturbing the human demonstrations using PyBullet . This significantly helps in increasing the number of grasps by adding perturbations to the demonstrations. In total, we collected 150K grasping demonstrations covering 101 objects. Figure 3 illustrates examples of objects in the dataset, successful and unsuccessful grasping trials from human demonstrations, and synthetic grasps (visualized by gripper positions) for successful and unsuccessful trials that were generated by this augmentation process. More details are described in the Appendix.
For each demonstration, we take a snapshot of the pre-grasping scene (e.g., before closing the two gripper fingers). by randomly setting the camera at a distance (ranging between 35 centimetres and 45 centimetres). We draw a camera target position from a normal distribution with its mean as the object center and a desired variance (in our experiment, we use 3 centimetres as standard deviation). Furthermore, we set up the camera around the target position from 8 different azimuth angles (with steps of 45 degrees) and adjust the elevation from 4 different angles (e.g., 15, 30, 45, and 60 degrees). Finally, we save a state of the scene without a gripper, which is used for shape pre-training; this will be referred to as the static scene throughout the paper. We include only two elevation angles (e.g., 15 and 45 degrees) in the training set while leaving the rest for evaluation.
We adopt the current data-driven framework as our grasping baseline by removing the shape encoder and shape decoder from our deep geometry-aware grasping model. This baseline can be interpreted as the grasping quality CNN  without an additional view from a top-down camera. We trained the model using the ADAM optimizer with a learning rate of for 200K iterations and a mini-batch of size of 4. As an ablation study, we added view and static scene as an additional input channel on top of the baseline model but didn’t observe significant improvements.
We adopted a two-stage training procedure: First, we pre-trained the shape generation model (shape encoder and shape decoder) using the ADAM optimizer with a learning rate of for 400K iterations and a mini-batch of size of 4. In each batch, we sample 4 random viewpoints for the purpose of multi-view supervision in the training time. We observed that this setting led to a more stable shape generation performance compared to single-view training. In addition, we used loss for foreground depth prediction and loss for silhouette prediction with coefficients and . In the second stage, we fine-tuned the state encoder and outcome predictor using the ADAM optimizer with a learning rate of for 200K iterations and a mini-batch of size of 4. We used cross-entropy as our objective function since the grasping prediction is formulated as a binary classification task.
In our experiments, all the models are trained using 20 GPU workers and 32 parameter servers with asynchronized updates. Both baseline and our geometry-aware model adopt convolutional encoder-decoder architecture with residual connections. The bottleneck layer (e.g., the identity unit in the geometry-aware model) is adimensional vector.
We evaluate the quality of the shape generation model by visualizing the geometry representations through the shape encoder and decoder network. In our evaluations, we used single-view RGBD input and corresponding camera view matrix as input to the network. As shown in Figure 4(a), our shape generation model is able to generate a detailed 3D occupancy grid from single-view input without 3D supervision during training. As shown in Figure 4(b), our model demonstrates reasonable generalization quality even on novel object instances.
|baseline CNN (15)||72.81||73.36||73.26||66.92||72.23||70.45||66.13||71.42|
|our DGGN (15)||78.83||79.32||77.60||68.88||78.25||76.09||73.69||76.55|
|baseline CNN (45)||71.02||74.16||73.50||63.31||74.23||72.70||64.19||71.32|
|our DGGN (45)||78.77||80.63||78.06||70.13||79.29||77.52||72.88||77.25|
|baseline CNN (30)||71.15||72.98||71.65||61.90||71.01||70.06||61.88||69.50|
|baseline CNN (60)||68.45||73.05||72.50||61.27||74.40||71.30||63.25||70.18|
One advantage of our shape generation component is that we can obtain additional local geometry information (see the red-dashed box in Figure 2(c)) from our geometry-aware representation. This is the key difference between our work and the related work that require additional camera from the gripper. With 3D geometry as part of the intermediate representation, we hallucinate the local geometry by running a projection from the gripper’s perspective (i.e., simply treat the gripper as another virtual camera). To further understand the advantages of our shape generation component, we visualized the intermediate local geometry projected from generated 3D occupancy grid. As shown in Figure 4(c), our shape generation component provides accurate local geometry estimation that is useful for grasping outcome prediction.
To evaluate the actual advantages in grasping outcome prediction from our modeling, we computed the average classification accuracy over 30K demonstrations from novel object instances (from testing set) with diverse observation viewpoints. For each human demonstration, we generated 100 synthetic grasps through perturbation (among which 50% of them are success grasps) and computed the average accuracy on 100 grasps (i.e., random guess achieves 50% accuracy). To investigate the model performance due to viewpoint changes, we repeat the evaluation experiment for four different elevation angles (e.g, 15, 30, 45, and 60 degrees). We use parallel computing resources ( machines) during evaluation and the entire evaluation took about day. The results are summarized in Table II and Table II. Overall, the deep geometry-aware model consistently outperforms the deep CNN baseline in grasping outcome classification. As we can see, “teapot” and “plate” are comparatively more challenging categories for outcome prediction, since “teapot” has irregular shape parts (e.g., tip and handle) and “plate” has a fairly flat shape. When it comes to novel elevation angles (e.g., compare Table II and Table II), our deep geometry-aware model is less affected, especially in categories such as “teapot” and “plate” where viewpoint-invariant shape understanding is crucial.
|baseline CNN + CEM||48.60||64.28||55.44||45.99||61.00||53.97||63.08||55.85|
|our DGGN + CEM||56.73||68.84||60.31||50.09||67.21||59.87||69.22||61.46|
|rel. improvement (%)||16.72||7.09||8.77||8.92||10.18||10.92||9.73||10.03|
As we improve the classification accuracy over the grasping outcome, a natural question is whether this improvement can be used to guide better grasping planning. Given a grasping proposal (defined as target gripper pose) seed, we conducted grasping planning by sequentially adjusting the grasping pose guided by our deep grasping network until a grasp success. In each optimization step, we performed cross-entropy method (CEM) [31, 19] as follows. (1) We initialized with a failure grasp in order to force the model to find better grasping pose. (2) To obtain the gradient direction in the 6D space, we sample 10 random directions and selected the top one based on the score returned by the neural network (output of outcome predictor). We repeat the iterations until success (we set an upper bound of 20 steps). We conducted the same grasping explore evaluation for both the baseline CNN and our deep geometry-aware model. To account for the variations in observation viewpoints and initial seeds, we repeat the evaluation for eight times per testing demonstration in our dataset and reported the average success rate after 20 iterations (marked as failure only if there is no success in 20 steps). As shown in Table III, CEM guided our geometry-aware model performance consistently better than the baseline CNN model. We believe the improved performance comes from the explicit modeling of the 3D geometry as intermediate representation in our deep geometry-aware model. Our model achieved the most significant improvement in the “bottle” category, since a bottle shape is relatively easy to reconstruct. Our improvement in the “bowl” category is less significant, partly due to the difficulty of predicting its concave shape in novel object instances. Figure 5 demonstrates example grasping planning trajectories on different objects. The baseline CNN is less robust compared to our deep geometry-aware model, which is more likely to transit from one side of the object to the other side with a clear notion of 3D geometry.
In this work, we studied the problem of learning the grasping interaction with deep geometry-aware representation. We proposed a deep geometry-aware network that performs shape generation as well as grasping outcome prediction with a learning-free physical projection layer. Compared to the CNN baseline, experimental results demonstrated improved performance in outcome prediction thanks to generative shape modeling. Guided by the geometry-aware representation, we obtained better planning via analysis-by-synthesis grasping optimization.
We believe the proposed deep geometry-aware grasping framework has many potentials in advancing robot learning in general. One interesting future direction is to apply the learned geometry-aware representation to perform tasks using other types of hands (e.g., hands with very different kinematics). In addition, we would like to explore some alternative model designs (e.g., learn to grasp without the auxiliary state encoder) such that the learned geometry-aware representation might be easily adapted to other domains (e.g., real robot setup).
We take advantage of the Pybullet simulator  by switching between two modes: simulation and log playback. In essence, we use the following protocols to generate additional grasps:
Start the demonstration log playback in the Bullet physics engine.
Pause the log playback once a grasp is detected (pre-grasp state).
Store the current scene state (position and orientation of all objects).
Repeat random grasping exploration 100 times. (1) Draw a new grasp pose from a normal distribution with its mean being the value of the demonstrated grasp pose and a desired variance (in our experiment we use 5 centimetres as standard deviation for position and 20 Euler degrees as standard deviation for orientation). (2) Switch to simulation mode, open the gripper finger, place it at the new drawn random pose, close the gripper, and lift it. (3) Check whether the object is still between gripper fingers. Based on the outcome, we add a new pose to the list of successful or failed grasps (see Figure 3(b)(c)). (4) Reset the simulation environment to the previously stored state.
Resume log playback until next grasp is detected.
With this protocol, we collected a total of 150K grasping synthetic grasps based on human demonstrations. As shown in Figure 3(d), we visualize the gripper positions using colored dots (we omit the gripper orientations): green ones representing successful grasps and red ones representing failure grasps.
Experiments with hierarchical reinforcement learning of multiple grasping policies.In International Symposium on Experimental Robotics, pages 160–172. Springer, 2016.
Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours.In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 3406–3413. IEEE, 2016.
Mofa: Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction.In The IEEE International Conference on Computer Vision (ICCV), volume 2, 2017.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.