Being able to grasp objects is a fundamental robot capability. Given a 3D model of an object, there are numerous approaches for computing the end-effector pose for grasping the object [19, 29]. In many applications however, a complete 3D model of the object is not available. Therefore grasps must be computed from sensor input which is typically in the form of an image or a partial point cloud of the object.
Recent research on grasp planning focuses on end-to-end approaches which can generate grasp proposals directly from the sensor input. These methods show impressive results, and have been especially successful for specific grasp styles and set up such as top-down grasping with an overhead camera. Extending these methods to 6-DOF grasp planning in more general setting remains an active research topic [21, 33].
In parallel, with the availability of large datasets such as ShapeNet , learning-based methods have been developed to generate complete 3D models from a single image [38, 10, 15, 3, 28, 20, 37, 25, 18]. Such methods makes it possible to generate complete 3D object models as an intermediate step for grasp planning. However, most of these methods are object-centric and generate the object shape in a normalized object frame. To obtain a camera-centric reconstruction that a robot can act on, pose registration and scaling of the reconstructed object must be performed to align it to the partial pointcloud observed from the camera. This additional step is computationally expensive and a potential source of errors. Motivated by robotics applications, variations of these methods are trained to directly generate camera-centric reconstructions [36, 40, 17]. However, when trained on small but diverse grasping datasets such as YCB, the models fail to achieve good reconstruction quality and their generalizability is compromised.
In this paper, we propose an intermediate solution. We introduce object shell representation for 3D geometry and a method to reconstruct the object shell from a single depth image (Figure 1). The depth image of an object captures information about where the camera rays enter the object expressed as depth from the camera center. The shell representation augments this information with the depth of the points where the rays would exit the object. The set of all entry and exit points form the Object Shell.
The shell representation has many computational advantages. The entry and exit points can be represented as depth images which also encode neighborhood information. Since there is a one-to-one correspondence between the entry and exit points, we use well established UNet-style image-to-image architecture to infer the object shell in the camera frame directly from the input depth image. The simplicity of the shell representation leads to superior generalizability to novel objects and views. Our experiments show that the shell reconstruction network trained only on depth images of simple synthetic shapes, outperforms the state-of-the-art object reconstruction methods when tested on realsense depth images of novel household objects from YCB dataset . Furthermore, since the shell depth images already include neighborhood information of the points on the object, they provide partial meshes of the object which can be stitched together in linear time using the contours of the entry and exit images to provide a complete object mesh in the camera frame in a fraction of a second.
Even though the object shell is an approximation to the true object shape, it captures sufficient geometric information for 6-DOF grasp planning. Specifically, we show that an object shell allows generating dense grasp proposals with grasp width and quality estimation (Fig.2). We geometrically evaluate the accuracy of thousands of densely sampled grasps on object reconstructions from five methods, including our shell reconstruction. With hundreds of grasps in the Mujoco simulator, we show that the estimated grasp quality from shell reconstruction correlates with the rate of grasp success under external disturbance. Experiments on real robot demonstrate that the grasps planned using shell reconstruction provide % grasp success rate on singulated objects and over success rate in clutter.
In summary, the main contributions of this paper are:
A novel 3D object representation and a method to generate it from a single depth image. Our method provides camera frame reconstruction of the object and outperforms state-of-the-art methods in real-world environment.
Data augmentation scheme that improves Sim2Real generalization for both shell and baseline methods.
Grasp planning experiments to characterize the object reconstruction quality for grasp planning with emphasis on grasp precision and quality. Rigorous evaluations in simulation and on real setup show that high quality grasps planned on shell reconstruction succeed more than % of the time.
We start with an overview of related work to highlight the novelty and significance of our contributions.
Ii Related Work
Single-view 3D Reconstruction: Generating a 3D reconstruction from a single view has been extensively studied using various representations such as voxel grids [3, 28, 38, 10, 15], implicit functions [25, 18], point-based representations [1, 7, 20] and mesh-based representations [27, 37, 9]. These methods are shown to produce high-quality reconstructions when trained and tested on large synthetic datasets [5, 39]. The objects in these datasets are placed close to the center of the workspace and rotated only about vertical/Z axis of canonical object frame. Consequently, the methods trained on these datasets learn strong object category and orientation prior and produce reconstructions in the canonical object frame. For robotics applications, e.g. for a robot to pick up an object or to avoid collision with an object, we need object reconstructions in the robot frame, or in the camera frame when the camera extrinsics is known. Pose registration required to align the object-frame reconstruction to the robot or camera frame is computationally expensive and can serve as a source of errors.
A very few works in the literature look into camera-frame object reconstructions. Yao et al.  presented a method which estimates the object symmetry plane and predicts front and back orthographic views. Their method, trained separately for every new object class, uses a GAN component which requires a large number of images for training. Nicastro et al.  develop a method to estimate the thickness of the object. They report that their method achieves good performance on the a subset of YCB objects in the training data, but fails to generalize to novel object types. Merwe et al.  follow  for implicit surface representation (PointSDF) of an object, but train it on YCB objects for camera frame reconstruction. In Section V-A, we show that our object shell reconstruction method outperforms PointSDF and provides better generalization capability.
Shape Completion Based Grasping: Shape completion produces a full 3D representation of an object given a partial 3D pointcloud. Varley et al.  and Yan et al.  use voxel-grids to represent the object shape and use GraspIt!  and a custom ConvNet respectively to predict grasps on the reconstructed shapes. Voxel-grids have been shown to be inefficient in terms of memory footprint and reconstruction quality [25, 18, 20]. Merwe et al.  investigated the use of implicit shape representation for grasp planning. They report that while signed distance function (SDF) representation  significantly improves the reconstruction quality, this improvement does not improve the grasp success rate.
Grasp Prediction: An active area of research in grasp planning is to generate grasp proposals directly from an input image. Mahler et al.  use millions of grasps in simulation and Pinto et al. 
use self-supervision over 50K grasp trials to train neural networks to generate a grasp proposal from an image. These methods are limited to top-down 3-DOF grasping. tenPas et al. demonstrated a method to detect local affordance for 6-DOF grasping on the visible pointcloud of an object. Recent work by Mousavian et al.  Sundermeyer et al.  present state-of-the-art methods to generate dense 6-DOF grasps on objects from a single view.
For 6-DOF grasping in general setting as in  and , the information needed to evaluate whether a robotic gripper could fit over an object is missing in the input from single view as shown in Fig. 3. The SOTA grasping methods fail to implicitly reason about the true object shape and instead the grasps are overfit to the observed partial pointcloud in the input. The goal of this paper is not to propose a new grasp planning framework, but to understand the use of object reconstruction methods for grasp planning. As shown in Fig. 3, our object shell reconstruction can support grasp planning methods to further improve their performance.
Iii Object Shell Reconstruction
In this section we formalize the shell representation and discuss its advantages. We also describe the training procedure to generate the shell of an object from its (masked) depth image. Lastly, we explain our data augmentation procedure for better Sim2Real performance.
Object Shell Representation - Consider the image of an object from an arbitrary view and the associated backprojection rays originating from the optical center. These rays would intersect with the object as shown in Fig. 1. The Object Shell is given by the entry and exit points on the object surface that the camera rays pass through. The shell can be compactly represented as a pair of depth images corresponding to these entry and exit points respectively.
The shell representation offers a few key characteristics. First, the shell is a view dependent description of the object. This property enables the critical advantage of generating the reconstruction directly in the camera frame. Second, as we will see in grasping experiments, the shell representation with only a pair of depth images provides sufficient information for outer grasps of many household objects. Third, most importantly, the image-based shape representation allows posing 3D reconstruction problem as 2D pixel prediction problem, and enables using efficient 2D convolutions and well-proven image-to-image network architectures. Finally, since the shell layers (entry and exit images) contain the neighborhood information given by pixel adjacency, they provide partial meshes of the object which can be stitched together using the contours of the entry and exit images in linear time to generate an object mesh in the camera frame. The details of this stitching process are explained in Appendix A-A with examples from Fig. 6.
We note that the shell representation with a pair of depth images is not limited to convex objects. If we view the image plane as the X-Y plane and the optical axis as the Z axis, the shell representation poses no restrictions on the object’s projection onto the X-Y plane. It does impose a monotonicity constraint along the Z-axis that each ray enters and exits only once. We believe that this constraint is not overly restrictive: only 5 (wine glass, mug, bowl, pan, stacking block) out of 77 objects in YCB dataset violate this monotonicity constraint from some views. In Appendix A-J we compare the shell and a SOTA method (PointSDF) reconstructions on adversarial objects and views and find that qualitatively shell reconstruction performs better.
Shell Reconstruction Network Architecture - Our reconstruction method generates the object shell representation as a pair of depth images, given a masked depth image of an object observed from a camera. We formulate the shell reconstruction network based on UNet - a popular image-to-image network architecture 
with skip connections. We use a 4-level UNet architecture and mean square error image similarity loss for training. The skip connections in UNet architectures is a powerful tool for tasks with direct relationship of input and output pixels. They enable feature extraction with large receptive fields while preserving low level features. In TableIII we show that using these skip connections is essential for high quality shell reconstructions and generalization to novel objects.
Training Data and Novel Augmentations - We use synthetically generated simple object mesh models and render depth images for training. We propose a data augmentation scheme for better Sim2Real generalization performance on real world test images and objects. For the pre-render augmentation stage, random noise is added to the XYZ positions of a subset of mesh vertices of synthetic object geometries followed by a smoothing operation to better resemble the uneven object surfaces observed in real-world depth images. The post-render augmentations consist of novel data dropout in addition to the additive and multiplicative Gaussian noise augmentations proposed in . Depth images obtained by commodity sensors (Fig. 4 (a)) often contain missing values around object boundary as well as some regions inside the object. Naive modeling of the missing values by dropping out random pixels from the image does not result in realistic images (Fig. 4 (c)). Our dropout scheme involves removing pixels with angles sharper than randomly sampled threshold and a small amount of “pepper” noise followed by multiple rounds of stochastic erosion of boundary pixels. The final result of the proposed augmentation scheme (Fig. 4 (d)) displays realistic dropout and object edge patterns. Table III shows that these augmentations improve the reconstruction quality on test dataset. The shape and view generation procedure and data augmentation are described in more detail with figures in Appendix A-B and Appendix A-C respectively.
Iv Grasping on Object Reconstruction
We explore the application of object shell reconstruction for planning parallel-jaw grasps. A parallel-jaw grasp can be defined by a grasp pose - the position and orientation of the gripper, and grasp width - the distance between the fingers. In this section, we develop methods that exploit the object reconstruction to generate dense grasp proposals with their grasp width and grasp quality.
Grasp Feasibility and Grasp Width Computation: As shown in Fig. 2-(d), we sample grasps on the visible pointcloud of the object and use the object reconstruction to evaluate geometric grasp feasibility. We assign a grasp pose such that the axis joining the fingers (finger axis) is along the point normal and one of the fingers is just over the visible pointcloud with the maximum ( mm) gripper opening. A grasp is geometrically feasible at the sampled pose if the local object geometry is within the gripper opening and does not collide with the gripper. For added grasp stability, we further impose a constraint that the normals at the set threshold number of points on the object reconstruction inside the gripper envelope need to be along the finger axis. For a fixed grasp position, the grasp orientation is changed about the finger axis (to eight discrete angles making a full rotation) to evaluate if any of these grasp poses lead to a feasible grasp. For all the feasible grasps, we compute the required grasp width from the bounds of the reconstruction inside the gripper envelope along the finger axis.
In Section V-B, we show that evaluating the accuracy of the grasp width prediction serves as a good metric to understand how object reconstruction accuracy affects grasp planning. From the application point of view, precise grasp width estimation is useful for grasping in clutter and tight arrangements to avoid potential collision with the other objects or picking up multiple objects.
Grasp Quality: A commonly accepted grasp quality metric is a measure of the external wrench the grasp can resist . From the mechanics of a parallel-jaw grasp  (details in Appendix A-E), the grasp quality is inversely proportional to the Euclidean distance of the grasp position from the center of geometry of the object which can be obtained from the object reconstruction. We generate a Grasp Quality Map (Fig. 2e) where every feasible grasp is assigned a relative quality score between 0 and 1 based on its Euclidean distance from the center of the object reconstruction.
In Section V-C, with experiments in Mujoco simulator as well as on a real robot, we show that under external disturbance, the grasps of higher estimated quality succeed more than those with lower quality.
In this section, we experimentally evaluate shell reconstruction against several state-of-the-art methods in terms of geometric accuracy of the reconstruction. With experiments in simulation and on a real robot, we demonstrate the effective application of shell reconstruction for grasp planning.
Baselines: We choose Occupancy Network (OCC) , Higher Order Function Network (HOF)  and PointSDF  as baselines. OCC-a representative implicit-function method and HOF-a representative direct method are shown to outperform state-of-the-art TSDF and Voxel based reconstruction methods respectively. For these two baseline, we train a version that produces output directly in the camera frame and a version with output in the object frame, labeled as “(cam)” and “(obj)” version respectively. Shell and these baselines are trained on the synthetic dataset explained in Section III. PointSDF  is a camera frame reconstruction method previously explored for grasping application. For our experiments, we use a pre-trained PointSDF model trained on YCB objects. Since our test object dataset is also comprised of YCB objects, we expect the pointSDF to perform well.
Test Data: We evaluate the shell and baseline methods on a set of real household objects. Our test object set, shown in Fig. 5, contains objects in total, mostly taken from YCB object set . These objects are not seen by any reconstruction methods except PointSDF. For experimental evaluations, we collect 10 different views for each of our test objects using a RealSense camera. The object is moved to a different position and orientation on a table in front the camera for every new view. The ground truth object pose in the camera frame is recorded using AprilTags  for experimental evaluations.
V-a Reconstruction Quality
Table I shows that shell reconstruction significantly outperforms baseline reconstruction methods in terms of Chamfer scores on the test dataset. For methods that output reconstructions in the object frame, we align those reconstructions to the camera frame using Iterative Closest Point (ICP)  and more recent Coherent Point Drift (CPD)  methods with multiple initializations to get best fitting candidate as the final result. We observe that object frame reconstructions with CPD alignment achieve the best Chamfer score for HOF and OCC baselines. Shell reconstruction outperforms the best baseline [HOF(obj)+CPD] by and best camera frame baseline (PointSDF) by .
In Fig. 6, we observe that HOF and OCC methods overfit to training object shapes and provide limited generalization. PointSDF often generates thin reconstructions and looses the geometric details in the input. On the other hand, our shell method demonstrates better reconstruction accuracy and generalization.
In Appendix A-D, we discuss ablation studies with different variations of shell and HOF methods. We observe that removing the proposed data augmentation and keeping only the baseline augmentation adversely affects the Sim2Real robustness. The UNet skip connections considerably improve the performance of shell reconstruction.
V-B Geometric Grasp Evaluation
In this section we study the effect object reconstruction accuracy on grasp planning. We generate dense set of feasible grasps and their grasp widths using the object reconstructions on the test dataset. The accuracy of a grasp is evaluated by checking if the gripper geometry at the grasp pose envelops the ground truth object geometry without colliding with it. We define geometric grasp precision success rate as the percentage of feasible grasps that pass the collision check with a gripper opening equal to estimated grasp width + mm tolerance.
As shown in Table II, the grasp precision success rate on shell reconstruction is high across all the objects, often more than %, showing better reconstruction accuracy and generalization. We observe that the performance for HOF and OCC reconstructions suffers for novel objects that are not similar to the training objects. The PointSDF reconstructions are often thinner than the actual object (as seen in Fig. 6) resulting into smaller estimated grasp width and large number of grasp failures. The last column shows the success rate when using the only visible pointcloud without reconstruction, but with full gripper opening (
mm). Kleenex box and Paper roll are bigger than full gripper opening. Thin reconstructions from PointSDF results into false positive grasps on a couple of views, but all other methods correctly classify these objects as infeasible to grasp.
V-C Evaluation in Mujoco and on a Real Setup
Grasp testing in Mujoco: The geometric center from object shell reconstruction allow us to estimate grasp quality. To demonstrate the significance of estimated grasp quality we evaluate the stability of grasps under external disturbance in Mujoco simulator. Appendix A-G explains the details of the simulation setup and the summary of hundreds of grasp simulations across five different grasp quality ranges for every object. The grasp quality estimated using the shell object reconstruction reflects well in the grasp success rate. More grasps in the high quality range resist the external disturbance force and succeed than the grasps in the low quality range.
Grasp testing on a real robot: We further validate the findings from Mujoco experiments with real robot experiments by testing the high quality grasps. Given a masked depth image of a singulated object on a table, we generate the object shell representation, sample five grasps from the high quality range, and execute them using the robot. A grasp attempt is counted as successful if the object is lifted up and held in the grasp for seconds.
The high quality grasps succeed more than % of the time with average success rate of (Table in Appendix A-H). On similar experimental setup and objects, Merwe et al. report average success rate of grasps planned using PointSDF reconstructions.
We observe a couple of failure modes. First, if the predicted grasp width is smaller than the ground truth, the approaching gripper pushes the object and can not grasp it. Second, for a few test views, the noise in the input depth images leads to incorrect surface normals in some parts of the object which adversely affects the grasp samples and proposals.
V-D Shell reconstruction and Grasping in Clutter
To demonstrate the application of object shell reconstruction for grasp planning, we consider the task of clutter removal. Given an RGBD image of a scene, we use fine-tuned mask R-CNN  method to segment the objects in the scene. We pass the masked depth images of the objects for shell reconstruction and populate the scene with the reconstructed meshes. Dense set of grasps with their grasp quality are generated using the shell reconstruction of the target object as explained in Section IV and the highest quality grasp that is reachable without collision with the other objects is executed. We use C2G-HOF motion planner  to generate collision-free robot motions.
Fig. 7 shows a scenario with some clutter. Since front of all the objects are mostly visible from the camera, the object shell accurately generates the complete object geometry. Note that the shell reconstruction augments the visible side of the object with the corresponding back side and stitches them together to generate the objects mesh. We observed that the shell reconstruction method does not fill in the occluded parts of the object. For example, in a dense clutter scene shown in Fig. 8-(a), we can see that the Weiman bottle and Kleenex box are mostly hidden behind the Mustard and Clorox bottle respectively. The shell reconstructions for these objects are only based on small parts of these objects visible in the clutter. 111As noted by Shin et al., we observe that camera frame reconstruction method such as our method provide superior generalization over object types since they do not learn category level strong shape priors. Therefore unless trained on specific category of objects, object shell reconstruction (and most other methods) can not generate complete object reconstruction if the input image is contains only small part of the object due to occlusion or bad views. Appendix A-J shows object reconstructions from Shell and PointSDF methods for adversarial objects and views.
For some grasping applications, these partial reconstructions might be sufficient. Practically, as the objects in the foreground are picked up by the robot, the objects in the background become more visible and more accurate shell reconstructions of these objects can be produced. Fig. 9 shows compilation of object reconstructions used by the robot for grasp planning as the robot cleared the scene in Fig. 8. As we can see in Fig. 9-(a) most of the object surfaces become visible as the clutter is slowly cleared. This allows the robot to generate accurate shell reconstructions as shown in Fig. 9-(b).
For clutter removal experiments, we generated dense clutter scenes with six to seven objects in each. The robot was able to successfully remove all objects (except the non-graspable big objects) from the clutter and place them in a bin in all but one scene. In the failed scene, the Jello box was toppled during the grasp attempt and was no longer graspable. The robot was able to grasp the objects in the first attempt, except for only grasp attempts where the object was successfully grasped in the second attempt. Average grasp success was (64 out of 68) when considering all the grasp attempts and (61 out of 65) when considering only the first grasp attempt. The supplementary video shows the clutter removal experiments and visulaizations of object shell reconstructions in clutter.
In this paper we presented “object shell” as an effective geometric representation along with a method for generating the shell of an object from a masked depth image. The key merits of our shell reconstruction method are: 1) it eliminates the need for explicit pose estimation since the reconstruction is performed directly in the camera frame, and 2) despite being trained on a relatively small amount of synthetic data, the method generalizes well to novel objects and is robust to noise encountered in real depth images. We showed that both of these advantages directly benefit the grasp planning process and leads to high grasp success rate across the novel test objects.
We believe the shell representation provides new opportunities to exploit image-to-image networks architecture for 3D shape prediction and support 6-DOF grasp planning. In our future work, will investigate whether we can learn to predict the grasp quality map directly from the input masked depth image by supervising over the grasp quality as well as the underlying shell reconstruction.
Learning representations and generative models for 3d point clouds.
International Conference on Machine Learning (ICML). Cited by: §II.
-  (1987-10) Least-squares fitting of two 3-d point sets. ieee t pattern anal. Pattern Analysis and Machine Intelligence, IEEE Transactions on PAMI-9, pp. 698 – 700. Cited by: §V-A.
Generative and discriminative voxel modeling with convolutional neural networks. Note:
NeurIPS Workshop contribution: 3D Deep LearningCited by: §I, §II.
-  (2015) The ycb object and model set: towards common benchmarks for manipulation research. In 2015 International Conference on Advanced Robotics (ICAR), Vol. , pp. 510–517. Cited by: §I, §V.
-  (2015) ShapeNet: An Information-Rich 3D Model Repository. Technical report Stanford University — Princeton University — Toyota Technological Institute at Chicago. Cited by: §I, §II.
-  (2016) 3D-r2n2: a unified approach for single and multi-view 3d object reconstruction. In ECCV, Cited by: §A-B.
-  (2017) A point set generation network for 3d object reconstruction from a single image. In CVPR, Vol. , pp. 2463–2471. Cited by: §II.
-  (1992) Planning optimal grasps. In Proceedings 1992 IEEE International Conference on Robotics and Automation, pp. 2290–2295 vol.3. Cited by: §A-E, §IV.
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §II.
-  (2017) Hierarchical surface prediction for 3d object reconstruction. In 2017 International Conference on 3D Vision (3DV), Vol. , pp. 412–420. Cited by: §I, §II.
-  (2017) Mask r-cnn. Cited by: §V-D.
-  (2021) Cost-to-go function generating networks for high dimensional motion planning. Cited by: §V-D.
-  (2006) Poisson surface reconstruction. In Proceedings of the Fourth Eurographics Symposium on Geometry Processing, SGP ’06, pp. 61–70. Cited by: §A-C.
-  (1980) Two algorithms for constructing a delaunay triangulation. International Journal of Computer and Information Sciences. Cited by: §A-A.
Deep marching cubes: learning explicit surface representations.
Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II.
-  (2017-07) Dex-net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. In Robotics: Science and Systems(RSS), pp. . Cited by: §II, §III.
-  Learning continuous 3d reconstructions for geometrically aware grasping. IEEE ICRA, 2020, pp. 1516–11522. Cited by: §A-F, §I, §II, §II, §V-C, §V.
-  (2019) Occupancy networks: learning 3d reconstruction in function space. In CVPR, Cited by: §I, §II, §II, §V.
-  (2004) Graspit! a versatile simulator for robotic grasping. IEEE Robotics Automation Magazine 11 (4), pp. 110–122. Cited by: §I, §II.
-  (2020) Higher-order function networks for learning composable 3d object representations. In International Conference on Learning Representations (ICLR), Cited by: §I, §II, §II, §V.
-  (2019) 6-dof graspnet: variational grasp generation for object manipulation. In International Conference on Computer Vision (ICCV), Cited by: §A-F, §I, §II.
-  (2010) Point set registration: coherent point drift. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (12), pp. 2262–2275. Cited by: §V-A.
-  (2019) X-section: cross-section prediction for enhanced rgbd fusion. In ICCV, Cited by: §II.
-  (2011) AprilTag: a robust and flexible visual fiducial system. In IEEE ICRA, Cited by: §V.
-  (2019-06) DeepSDF: learning continuous signed distance functions for shape representation. In CVPR, Cited by: §I, §II, §II, §II.
Supersizing self-supervision: learning to grasp from 50k tries and 700 robot hours. In IEEE International Conference on Robotics and Automation (ICRA), pp. 3406–3413. Cited by: §II.
Generating 3D faces using convolutional mesh autoencoders. In ECCV, pp. 725–741. Cited by: §II.
-  (2016) Unsupervised learning of 3d structure from images. In NeurIPS, pp. 5003–5011. Cited by: §I, §II.
-  (2014-07) Grasp quality measures: review and performance. Autonomous Robots 38, pp. 65–88. Cited by: §A-E, §I, §IV.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), Cham, pp. 234–241. Cited by: §III.
-  (2018-06) Pixels, voxels, and views: a study of shape representations for single view 3d object shape prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: footnote 1.
PyVista: 3d plotting and mesh analysis through a streamlined interface for the visualization toolkit (VTK).
Journal of Open Source Software4 (37), pp. 1450. Cited by: §A-B.
-  (2021) Contact-graspnet: efficient 6-dof grasp generation in cluttered scenes. Cited by: §I, Fig. 3, §II, §II.
-  (2017) Grasp pose detection in point clouds. The International Journal of Robotics Research 36 (13-14), pp. 1455–1473. Cited by: §A-F, §II, §II.
-  (2012) MuJoCo: a physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §A-G.
-  Shape completion enabled robotic grasping. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, Cited by: §I, §II.
-  (2018) Pixel2Mesh: generating 3d mesh models from single rgb images. In ECCV, Cited by: §I, §II.
-  (2016) Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In NeurIPS, Vol. 29, pp. 82–90. Cited by: §I, §II.
-  (2015-06) 3D shapenets: a deep representation for volumetric shapes. In CVPR, Cited by: §II.
-  (2018-05) Learning 6-dof grasping interaction via deep geometry-aware 3d representations. ICRA, pp. 1–9. Cited by: §I, §II.
-  Front2Back: single view 3d shape reconstruction via front to back prediction. Cited by: §II.
Appendix A Appendix
A-a Object Mesh Generation from Shell Representation
The pixels in the entry and exit depth images of the object shell can be back projected in 3D space to produce a point cloud representing the object geometry. The object geometry captured by the object shell is a subset of the complete object geometry. Particularly, it does not include the part of the object geometry that the camera rays do not intersect with. Directly using this object point cloud can be sufficient for some applications; however, for some applications it is desirable to have an object mesh.
Shell representation provides an efficient approach to stitch together the surfaces encoded by the two layers of shell representation to generate an object mesh. Operating on Shell representation is more efficient than operating on point clouds. For example, meshing or triangulating a point cloud involves an iterative process of locating the best triangle candidates, which takes time , where is the number of points. In contrast, the triangulation can be done in time using shell representation since the proximity relationship between the points is encoded in shell representation as the pixel neighborhoods and the one-to-one correspondence between entry and exit depth pixels. The linear time meshing algorithm is as follows. First, as shown in Fig. 1 and Fig. 11 , the Entry and Exit depth images of the shell are triangulated independently by adding a triangle between each set of 3 neighboring pixels. Next, the Entry and Exit sides are stitched together by adding triangles between Entry and Exit pixels along the object boundary within the two images. This shell-specific mesh generation algorithm is asymptotically faster than point cloud meshing methods, and it is one of the advantages of shell representation.
A-B Training Shape Generation
Training shapes are produced by applying a set of augmentation to one of the two base shapes: cube or cylinder. The first augmentation step is randomly resizing (“squishing”) the base shape along each axis. The resizing is constrained to keep the ratio between the smallest and largest sides of the shapes bounding box between : and :. After that, the object is uniformly re-scaled so that the biggest side of the bounding box would be between cm and cm. Next, of the object are augmented further by either shrinking a part of the object by up to or adding a Gaussian protrusion to the mesh of at most m. The set of shapes generated with the given procedure is diverse and balanced in terms of shape types and ratios. At the same time, the shape set is simple to generate and does not topologically cover all of the test set objects. This dataset allows us to study generalization properties of the reconstruction methods. Fig. 10 shows a few few examples of objects in our training dataset.
We render out 5 depth views for each shape, making the dataset of views in total. We produce 2D views of the shapes using PyVista  rendering engine. We place shapes into the world coordinate frame in random orientation. The shapes are placed such that the center of the shape bounding box is located within a sphere centered around a point m along the camera axis and with diameter of m. It is worth emphasising that here variation of object position is up to 10 times larger than than the object size, in contrast with the approach commonly taken in computer vision community  where variation of the object position is just a fraction of the object size. Large variation position makes it more difficult for deep learning models to directly predict the object position, but also makes the resulting models applicable to robotic applications with large workspaces.
A-C Data Augmentations
For the pre-render data augmentation, we first sample a random number of points from the mesh surface. The number of points is drawn from a uniform distribution betweenand . Then, we add noise to X, Y and Z locations of each of the sampled points. The noise magnitude for each axes of each point is drawn from IID uniform distribution ranging from mm to mm. The sampled points are then converted to a mesh using Poisson surface reconstruction process . We set the depth of the Poisson reconstruction to .
For the post-render data augmentations, we first randomly sample the maximum threshold angle from the range of to . The angle of each pixel is estimated by comparing pixel depth values to pixel neighbor depth values in the depth image. Next, we randomly drop out from to pixels as “pepper” noise. Next, we do several rounds of border erosion. First step of border erosion is identifying all of the border pixels. Note that the pepper noise added in the previous step will create additional internal borders. Next, a small percentage of the border pixels are removed from the image. This procedure is repeated for to rounds. We find that border erosion produces most realistic results when several first rounds are done at a coarser image resolution. Lastly, we choose from to of the pixels removed to be exempt from removal.
A-D Ablation studies with Variations of Reconstruction Methods
In this section we consider variations of our shell reconstruction method and HOF method to gain better insights into their performance on our test dataset.
Table III lists the Chamfer scores on our test dataset for the variations of these methods. We observe that removing the proposed data augmentation and keeping only the baseline augmentation adversely affects the Sim2Real robustness. The UNet skip connections considerably improve the performance of shell reconstruction. The HOF(obj)+CPD baseline is able to outperform shell reconstruction on the synthetic validation set. This shows that although HOF achieves good performance in settings similar to training, it lacks the generalization to novel objects in a real-world environment. Method Modification Chamfer Shell None 6.43E-3 Shell Baseline Augmentations 7.76E-3 Shell No Skip Connections 8.99E-3 Shell Synthetic Validation Set 5.99E-3 HOF None 8.34E-3 HOF Baseline Augmentations 8.75E-3 HOF Trained on ShapeNet 1.57E-2 HOF Synthetic Validation Set 5.42E-3
A-E Mechanics of Grasp Quality
A commonly accepted grasp quality metric is a measure of the maximum external wrench the grasp can resist . The grasp feasibility evaluation provides a set of feasible grasps over the visible object surface. However, not all feasible grasps are equally robust under external disturbances.
The stability of a parallel-jaw grasp against an external wrench depends on the frictional resistance offered by the grasp (finger contacts) and the distance between the grasp location and the external wrench application point . From the Coulomb friction law, the frictional resistance at finger contacts is proportional to the finger contact area. In the absence of any task-related external wrenches, only gravitational and inertial forces act at the center of mass of the object. Therefore, the grasp stability (quality) is directly proportional to the area of finger contacts and inversely proportional to the Euclidean distance of the grasp location from the center of mass of the object. We assume the objects are of uniform density, so the center of mass is the same as the center of geometry. Since the feasible grasps from the grasp feasibility map have at least the set threshold number of object points inside the grasp envelope, the finger contact area for these grasps is of the same order of magnitude and can be neglected when computing the grasp quality. Finally, the grasp quality is inversely proportional to the Euclidean distance of the grasp from the object’s center of geometry, which can be inferred from the object reconstruction.
Intuitively, an object grasped farther from the center of geometry is likely to rotate and fall out out the grasp under an external disturbance, therefore, such a grasp has low quality.
A-F Significance of Grasp Precision Success Rate
The geometric grasp precision success rate characterizes how closely the object reconstruction matches with the ground truth object model and its impact on the grasp planning process and grasp success. Moreover, it removes the dependence of grasp success on the relative sizes of the gripper and test objects used for the experiments which is overlooked in most the recent work on grasping [34, 21, 17]
. A gripper with large opening can successfully grasp an object between fingers even with large errors in object model or grasp proposal. We believe this could partially explain why simple heuristic based grasping on only the visible pointcloud leads to high success rate in. In our experiments as well, we confirm that grasping only based on the visible pointcloud with full gripper opening achieves good success rate as show in the last column of Table II. Therefore, particularly when evaluating reconstruction-based grasp planning methods, it is more informative to report the grasp precision success rate computed over the entire reconstruction than a top grasp success percentage with no constraint on the gripper opening used.
A-G Mujoco Setup and Experimental Procedure
We evaluate the physical significance of the quality of grasps estimated using shell reconstruction by testing the stability of those grasps under external disturbance .
|Object||Top [1.0-0.8)||T-Mid [0.8-0.6)||Mid [0.6-0.4)||Mid-L [0.4-0.2)||Low [0.2-0.0)|
We built a setup in Mujoco simulator  with a Kinova Gen3 robot arm and a Robotiq 2F-85 gripper to replicate our real robot system. To minimize the effect of the arm motion planning limitations on diverse grasp evaluation, we place the test object in the robot’s dexterous workspace. Similar to the geometric grasp experiments, we evaluate the performance of precise grasping in Mujoco. For every test grasp sample, the robot approaches the object with a gripper opening only mm more than the estimated grasp width. The gripper is closed with a grip force of about N. The friction between the fingers and all the objects is set to . To evaluate the grasp stability, the grasped object is lifted up and an external force of N is applied in multiple directions at the center of geometry of the object. If the object is not moved in the grasp more than degrees, the grasp attempt is counted as success. Otherwise, the grasp attempt is counted as a failure.
Table IV shows the percentage of successful grasps across five grasp quality ranges from the grasp quality maps. For 10 test views per object, we perform 250 grasp simulations per object in every grasp quality range and report the average success rate in Table IV. The grasp quality estimated using the shell object reconstruction reflects well in the grasp success rate. More grasps in the high quality range resist the external disturbance force and succeed than the grasps in the low quality range. These results highlight that the shell object representation captures the key measures of the object geometry such as overall shape and the center well. These geometric features provide the scope for generating sets of grasps and their relative quality accurately.
A-H Grasping Singualted Objects on Real Robot Setup
Fig. 5 shows our experimental setup with all the objects tested put together. Given a masked depth image of a singulated target object on a table, we generate the object shell representation, sample top five grasps from the high quality range, and execute them using the robot. A grasp attempt is counted as successful if the object is lifted up and held in the grasp for seconds.
For all graspable test objects in out test dataset, we execute high quality grasps when the object is placed in four different views. Table V shows the number of successful grasps from these experiments. We see that the high quality grasps estimated from shell reconstruction succeed more than % of the time (average success rate = ).
A-I Detailed Comparison of our Shell Reconstruction Method with PointSDF
In this section we compare our Shell method and PointSDF baseline in detail with multiple examples.
Table VI compares the reconstructions from these two methods quantitatively as well as qualitatively. We observes that the shell reconstructions capture the object geometry details much more accurately. The geometric details from the input are lost in the PointSDF reconstructions. For example, the PointSDF reconstructions for Clorox and Weiman bottle appear to be some generic bottle shape prior the network has learnt. The Chamfer scores quantify the superior recosntruction accuracy of shell method.
A-J Adversarial object geometries and views for object reconstruction
Shell representation with a pair of depth images imposes a monotonicity constraint along the axis perpendicular to the image plane. We believe that this constraint is not overly restrictive: only 5 (wine glass, mug, bowl, pan, stacking block) out of 77 objects in YCB dataset violate this monotonicity constraint from some views. We sacrifice the performance on this small set of objects for the characteristics the shell representation provides for achieving generalization and superior performance on the larger class of objects. Reconstructing such complex objects is difficult for most of the reconstructions methods unless trained for those specific classes of objects.
For object and views where camera rays enter and exit the object more than once, the shell reconstruction may generate the outer geometry of the object or produce incomplete reconstruction as shown for a cup object in Table VII. However, note that shell reconstructions look much better compared to PointSDF since they maintain the overall geometry of the object in the input intact.
As discussed in Section V-D, the object shell reconstruction can not generate complete object geometry if only small portion of the object surface is observed in the input data or if the input data is ambiguous. This limitation is present for any other reconstruction method as well, unless a strong shape prior is used or the method is overfitted to the training data and tested on the subset of the same data.
In Table VII, we show object reconstructions for adversarial views where objects are seen from only narrower side. Both Shell reconstruction and PointSDF methods struggle for these views, since neither of them are trained to have strong shape prior. The shell reconstructions are qualitatively better than PointSDF reconstructions since they maintain the details in the input while completing the hidden back of the visible object using the average shape prior learnt. For the examples of Cheeze-it box and Ziploc box, the input contains information about only one face of the box-shaped object with no information from any other face to bound the backside of the object reconstruction. Both Shell and PointSDF generate a box of small thickness based on the prior learnt.