Robot grasping in household environments is challenging because of sensor uncertainty, scene complexity and actuation imprecision. Recent results suggest that Grasp Pose Detection (GPD) using point cloud local features  and manually labeled grasp confidence 
can be applied in generating feasible grasp poses over a wide range of objects. However, domestic environments include a great amount of transparent objects, ranging from kitchen utilities (e.g. wine cups and containers) to house decoration (e.g. windows and tables). The reflective and transparent material on those objects will produce invalid readings from depth camera. This problem becomes more significant in the real world where there are piled transparent objects which will lead to unexpected robot manipulation behaviors if the robot was trying to interact with the objects. A correct estimation of transparency is necessary to protect the robot from performing hazardous actions and extend robot applications to more challenging scenarios.
The problem of performing grasping in transparent clutter is complicated by the fact that robots cannot perceive and describe the transparent surfaces correctly. Several previous methods [14, 15] tried to approach this problem by finding invalid values in depth observation, but they were limited to top-down grasping and made assumption that target objects establish distinguishable contour (formed by invalid points) in depth map. Recently, several approaches employed light field camera to observe the transparency and showed promising results. Zhou et al.  used single shot light field image to form a new plenoptic descriptor named Depth Likelihood Volume (DLV). They succeeded in estimating the pose of single transparent object or object behind translucent surface by given the corresponding object CAD model. Based on that, we extend the idea to a more general-purpose grasp detection scenario with transparent objects clutter.
We make several contributions in this paper. First, we propose GlassLoc algorithm for detecting six-DoF grasp poses of transparent objects in both separated and minor overlapping cluttered environments. Next, we propose a generalized model for constructing Depth Likelihood Volume from multi-view light field observations with multi-ray fusion and reflection suppression. Finally, we integrate our algorithm with a robot manipulation pipeline to perform tabletop pick and place tasks over eight scenes and five different transparent objects. Our results show that the grasping success rate over all test objects is 81% in 220 grasp trials.
Ii Related Work
Ii-a Grasp Perception In Clutter
It remains a challenging task for robots to perform perception and manipulation in cluttered environments considering the complexity of the real world. We consider there are two major categories of methods for robots to perform grasp perception in clutter. The first category is model-based pose estimation methods. By estimating object poses, grasp configurations calculated based on the local model can be further transformed to the robot environments. Collet et al.  utilized color information to estimate poses of object in cluttered environments. Their proposed algorithm clusters and then matches the local color patch from object model to robot observations to generate pose hypotheses. Sui et al. [25, 24] constructed generative models to evaluate pose hypotheses against point cloud using object CAD models. The generative models perform object detection followed by particle filtering for robot grasping in the highly cluttered tabletop environments. With a similar idea, Papazov et al.  leveraged RANSAC-based bottom-up approach with Iterative Closest Point registration to fit 3D geometries to the observed point cloud.
On the other hand, rather than associating a grasp pose with a certain object model, Grasp Pose Detection (GPD) tries to characterize grasp poses based on the local geometry or appearance features directly from observations. Several early works [12, 21] represented the grasp poses as oriented rectangles in RGB-D observations. Further, given a number of manually-labelled grasp candidates, the system will learn to predict whether a sampled rectangle is graspable or not. One major restriction of those systems is that the approaching directions of generated grasp candidates need to be orthogonal to the RGB-D sensor plane. Fischinger and Vincze 
tried to lessen the restriction by integrating hightmap-based features. They also designed a heuristic for ranking the grasp candidates in a clutter bin settings. ten Pas and Platt directly detected grasp poses in space by estimating curvatures and extracting handle-like features in local point cloud neighborhoods. Gualtieri et al.  proposed more types of local point cloud features for grasp representation and projected those features to 2D image space for classification. Our work with GlassLoc extends these ideas to transparent clutter with a different grasp representation and a new plenoptic descriptor.
Ii-B Light Field Photography
The models describing the light field rendering proposed by Levoy and Hanrahan  introduced foundations of light field captured from multi-view cameras. Based on this work, [18, 6] succeeded in producing commercial level hand-held light field camera using the microlens array structure. Building on the property that the plenoptic camera can capture both intensity and direction of light rays, light field photography has shown significant advancement in different applications. Wang et al.  explicitly modeled the light field image pixel angular consistency to generate accurate depth map for the object with occlusion edges. Jeon et al.  performed sub-pixel shifting in image frequency domain in tackling the microlens camera narrow baseline problem for accurate depth estimation. Maeno et al.  introduced distortion feature in light field to detect and recognize the transparent object. Johannsen et al.  leveraged multi-view light field images to reconstruct multi-layer translucent scenes. Skinner and Johnson-Roberson  introduced a light propagation model suited to underwater perception using plenoptic observations.
The use of light field perception in robotics is still relatively new. Oberlin and Tellex  proposed a time-lapse light field capturing pipeline for static scenes by mounting a RGB camera on the end-effector of the robot and moving in a designed trajectory. Dorian et al.  introduced a algorithm for distinguishing refracted and Lambertian features from light field image. Zhou et al.  used a Lytro camera to take a single shot of the scene and construct a plenoptic descriptor over that. Given the target object model, their methods can estimate single object six-DoF pose in layered translucent scenes. Our GlassLoc pipeline extends the idea proposed in  for more general-purpose manipulation over transparent clutter.
Iii Problem Formulation and Approach
GlassLoc addresses the problem of grasp pose detection for transparent objects in clutter from plenoptic observations. For a given static scene, we assume there is a latent set of end-effector poses that will produce a successful grasp of an object. A successful grasp is assumed to result in the robot obtaining force closure on an object when it moves gripper and closes its fingers. The plenoptic grasp pose detection problem is then phrased as estimating a representative set of valid sample grasp poses .
Within the grasp pose detection problem, a major challenge is how to classify whether a grasp pose is a member of , and, thus, will result in a successful manipulation. For grasp pose classification, we assume as given robot end-effector pose and a collection of observations from a plenoptic sensor. It is assumed that each observation captures a raw light field image of a static scene from camera viewpoint . The classification result calculated from these inputs is a likelihood
that relates the probability of end-effector pose,, resulting in a successful grasp. Described later, our implementation of GlassLoc
will perform the classification using a neural network.
Illustrated in Figure 3, grasp pose classification within GlassLoc is expressed as a function that maps transparency occupancy likelihood features to grasp pose confidence . Transparency occupancy features are computed with respect to the subset of a Depth Likelihood Volume (DLV) that is within the graspable volume of pose . The DLV estimates how likely a point belongs to a transparent surface. To test all sampled grasps, a Depth Likelihood Volume is computed from observations over an entire grasping workspace within the visual hull of . We assume the grasping workspace is discretized into a set of 3D points, with each element of this set expressed as .
Iv Plenoptic Grasp Pose Detection Methods
An outline of the GlassLoc algorithm is described in Algorithm 1. GlassLoc begins by computing a Depth Likelihood Volume from multi-view light field observations. By integrating different views, we can further post-process the DLV by suppressing reflection caused by non-Lambertian surfaces. Details of DLV construction are presented in Section IV-A and IV-B. In Step 2, we uniformly sample the grasp candidates in workspace . For each grasp candidate, we extract grasp representations (see Section IV-C) and corresponding transparency likelihood features given the robot gripper parameter . The generated features will then be classified with a grasp success labels and confidence scores by a neural network. The training data generation strategy for learning this mapping is introduced in Section IV-D. Given classified grasp poses, we use a multi-hypothesis particle-based search to find a set of end-effector poses with high confidence for successful grasp execution (see Section IV-E). The finalized set of grasp poses will be ready for the robot to perform grasping.
Iv-a Multi-view Depth Likelihood Volume
The Depth Likelihood Volume (DLV) is a volume-based plenoptic descriptor which represents the depth of a light field image pixel as a likelihood function rather than a deterministic value. The advantage of this representation is to keep the transparent scene structure by assigning different likelihoods to surfaces with different transparency. In , DLV is formulated in a specific camera frame indexed with pixel coordinates and depths. The formulation is restricted to single-view scenarios. In this paper, we generalize the expression which takes sample points in 3-D space as input and integrates multi-view light field observations.
The DLV is defined as:
where is the depth likelihood of sampled points . is the set of sub-aperture images. is a light ray that goes through or emitted from point and is received by view point at in center view image plane. indicates the number of view points in observations. is the triangulation function finding the light ray corresponding to in sub-aperture images indexed with that yields depth . can be explicitly calculated using camera intrinsic matrix given point and view point. is the ray difference which is calculated by color and color gradient differences. Denote , then is a normalization function mapping color cost to likelihood. There are multiple choices of . In our implementation, we choose:
To better explain the formulation presented above, we consider the example shown in Figure 3. A cluster of transparent objects are placed on a table with opaque surface. We have two light field observations and with center view image plane and respectively. There are two points and sampled in the space and each of them emits light rays captured by both views. In view , Ray emitted from both points are received by the same pixel , while and are received by and respectively. Then we can express the depth likelihood of point as:
Function calculates the color and the color gradient difference between center view (rectangle with solid line in Figure 3) and sub-aperture view (rectangle with dot line in Figure 3). The location of red pixel is calculated by function . For micro-lens based light field camera, the pixel shift between center and sub-aperture images are usually in sub-pixel level. The realization of function is based on frequency domain sub-pixel shifting method proposed in .
Iv-B Reflection Suppression
A transparent surface produces non-Lambertian reflectance, which induces specular highlight to light field observations. Those shiny spots tend to produce the saturated color or virtual surface with larger depth than the actual transparent surface. This phenomenon will generate a high likelihood region in DLV that indicates a non-existing surface. To deal with this problem, we calculate the variance of ray differences for DLV points which has saturated color and high likelihood over different view points:
where can be expressed as:
where is the number of sub-aperture images extracted from raw light field image. For a point that has variance larger than a threshold , we check whether it has the largest likelihood value among all other points that lie on the light rays it emits out. Specifically, we first find light rays emitted from and received by pixel with depth that has large variance over different view points. Then we locate all light rays received by with depth less than , and check whether the following equation holds:
If Equation 7 holds, it indicates this light ray has high possibility of coming from strong reflection area and will be excluded from the calculation of DLV. Figure 4 (left) is the sliced feature from DLV before reflective suppression which we can observe incorrect large values caused by specular highlight. Figure 4 (right) shows the result after processing and the previous high value area is suppressed.
Iv-C Grasp Representation and Classification
We represent a graspable area as a 3D cuboid with length, width, and height as respectively. The width and height of the cuboid is equal to the width and height of the volume when the robot finger close while the length is extended for capturing more feature spaces. The cuboid is voxelized into
grid, and for each grid we interpolate the likelihood value by finding the nearest eight points in DLV. Rather than feeding into classifier with a large amount of points, we extract 2D features from the volume by projection and slicing.
We first define the three axes of the graspable volume. The axis of the volume is defined as the approach direction of the gripper. The axis is defined along the direction the gripper fingers close along. The axis is the cross product of the previous two axes. We then calculate three types of features and project them to the three axes: a center slice of likelihood volume, , an average likelihood map over all points, , a sliced difference likelihood map, , which is calculated by recursively comparing the difference between current slice of the graspable volume with the previous slice. More specifically, we can express the three types of feature as follows (take projection to axis as example):
We resize the images to the same size and concatenate them into different channels. Since we have three types of features and three axes to project, we have nine channels in total.
Iv-D Training Data Generation
For depth-based grasp pose detection algorithms, the training data generation process relies on grasp pose sampling and labeling on point cloud. Unfortunately, depth sensors cannot provide correct point cloud for transparent objects. Instead, we wrap the object with opaque material and generate training samples by mapping grasp poses from point cloud to DLV. The detailed steps are illustrated in Figure 9 (a) - (d).
We have two sources to produce training samples from point cloud. One is depth-based grasp pose detection algorithms. We input those algorithms with our depth observations and label the result grasp candidates as . In the meantime, we restore the grasp poses filtered out in those algorithms and label them as . The other is transforming pre-defined grasp pose in the local frame to the observation. By checking the gripper collision with the environment, we label the collision free grasp poses as and the others as .
|scene (a)||scene (b)||scene (c)||scene (d)||scene (e)||scene (f)||scene (g)||scene (h)|
Iv-E Grasp Search
After we perform classification of our samples, we try to find a graspable region with relatively high classification confidence score. Our grasp optimization builds on the particle filtering work proposed by Dellaert et al. , which is based on sequential Bayesian filter:
where the weighted particles represent the sampled six-DoF grasp poses with confidence score given by classifier. The initial hypothesis of particles are uniformly generated in the 3D workspace with the identical weights. For each hypothesis, we extract the grasp features and compute the weight by normalizing the confidence score output by classifier. Importance sampling is then performed with resampling process to concatenate grasp hypothesis to high weights region. In our case, we don’t have actual action between two states, instead, we model the state transition in action model as zero-mean Gaussian noise over . In other words, after we obtain resampled grasp poses (particles), we diffuse the particles by adding Gaussian noise over to generate the new set of particles. Our convergence criterion is a fixed number of iterations.
V-a Experimental Setup
To evaluate GlassLoc, we ran a series of experiments with a first generation Lytro camera and a Michigan Progress Fetch robot. The Lytro camera is mounted on the wrist of the robot and triggered by on-chip Wi-Fi to take images. In the meantime, the robot will record the camera view pose based on the current transformation from robot base to the camera. The Lytro camera intrinsic calibration and distortion correction is conducted using the toolbox created by Bok et al. . The raw light field image is then decomposed into sub-aperture images with resolution of pixels. The boundary sub-aperture images usually have strong color noise because of the lens edge affect. In our implementation, we only keep sub-aperture images and for each image. For each image, we crop 4 pixels at the margin.
We use two objects to construct our training samples: wine cup and short cup (Figure 10). We generate approximate 10k positive grasp samples and 15k negative grasp samples from 50 scenes containing one or more object instances. For each grasp sample, we extract corresponding graspable volume from DLV with actual size and grid density . We further extract gray-scale image features and resize them into
. Features are concatenated into nine channels and trained on LeNet structure. We keep the default structure and parameter settings of LeNet implementation in Tensorflow except the number of nodes in the output layer (2 in our case).
The DLV construction algorithm is implemented in MATLAB with parallel computing. A DLV is sampled in a box with grid density at .
In grasp search step, we use 100 particles with 100 iterations in our experiment. The covariance for diffusing grasp pose after each filtering iteration is set to and for translation and rotation respectively.
Our implementation takes 2 minutes per view to extract sub-aperture images and 10 minutes to construct DLV on an unoptimized MATLAB code. The light field image decoding and ray corresponding are the current bottlenecks.
We evaluate our GlassLoc manipulation pipeline on eight transparent clutter scenes as shown in Figure 19. In each scene, the number of objects ranges from two to four with different pose configurations. For each manipulation run, light field images are taken from two camera poses to construct DLV. After particle filtering reaches the convergence criterion, we randomly select one grasp pose and send it to the execution module. Our robot motion planning and execution module is built on TRAC-IK  and MoveIt! . For each scene, we perform 10 manipulation runs. We will terminate one run whenever all objects are successfully picked or the number of manipulation trials exceed the number of objects.
The manipulation results of each scene are established in Table I. Object grasp percentage is calculated based on how many objects have been successfully picked over the total number of objects that should be picked in all runs of a scene. We also show the pick success rate for each object in Table II.
Table I shows that the object grasp percentage is over 75% in most of the scenes. Our GlassLoc algorithm can generate enough reliable grasp poses based on our DLV constructed from light field observations in complex scenes where four transparent objects are randomly cluttered. The grasp percentages of these two scenes are 100% and 85% respectively.
Notably, our overall grasp success rate is 81% for the transparent cluttered environments in 220 grasps. During our experiment, we find that the short cup has the lowest grasp success rate. In most cases, it was squeezed and then slipped out from the gripper. The reason is two fold: one is that the surface of the short cup is sharply tilted, which prevents the robot from performing force closure grasping, the other is that the parallel jaw gripper hasn’t been equipped with force sensors and is likely to squeeze the cup.
In this paper, we have contributed the GlassLoc algorithm for robot manipulation in transparent clutter. We use multi-view light field observations to construct the Depth Likelihood Volume as a plenoptic descriptor to characterize the environments with multiple transparent objects. We show that by our algorithm, the robot is able to perform accurate grasping in tabletop transparent cluttered environments.
-  P. Beeson and B. Ames. Trac-ik: An open-source library for improved solving of generic inverse kinematics. In IEEE-RAS International Conference on Humanoid Robots, 2015.
-  Y. Bok, H.-G. Jeon, and I. S. Kweon. Geometric calibration of micro-lens-based light field cameras using line features. IEEE transactions on pattern analysis and machine intelligence, 39(2):287–300, 2017.
-  A. Collet, M. Martinez, and S. S. Srinivasa. The moped framework: Object recognition and pose estimation for manipulation. Int. J. Rob. Res., 30(10):1284–1306, Sept. 2011.
-  F. Dellaert, D. Fox, W. Burgard, and S. Thrun. Monte carlo localization for mobile robots. In IEEE International Conference on Robotics and Automation (ICRA), May 1999.
-  D. Fischinger and M. Vincze. Empty the basket-a shape based learning approach for grasping piles of unknown objects. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2051–2057. IEEE, 2012.
-  T. Georgiev, Z. Yu, A. Lumsdaine, and S. Goma. Lytro camera technology: theory, algorithms, performance analysis. In Multimedia Content and Mobile Devices, volume 8667, page 86671J. International Society for Optics and Photonics, 2013.
-  M. Gualtieri, A. Ten Pas, K. Saenko, and R. Platt. High precision grasp pose detection in dense clutter. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 598–605. IEEE, 2016.
-  H.-G. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y.-W. Tai, and I. So Kweon. Accurate depth map estimation from a lenslet light field camera. In , pages 1547–1555, 2015.
-  O. Johannsen, A. Sulc, N. Marniok, and B. Goldluecke. Layered scene reconstruction from multiple light field camera views. In S.-H. Lai, V. Lepetit, K. Nishino, and Y. Sato, editors, Computer Vision – ACCV 2016, pages 3–18, Cham, 2017. Springer International Publishing.
-  D. Kappler, J. Bohg, and S. Schaal. Leveraging big data for grasp planning. In IEEE International Conference on Robotics and Automation (ICRA), pages 4304–4311. IEEE, 2015.
-  Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps. The International Journal of Robotics Research, 34(4-5):705–724, 2015.
-  M. Levoy and P. Hanrahan. Light field rendering. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 31–42. ACM, 1996.
-  I. Lysenkov. Recognition and pose estimation of rigid transparent objects with a kinect sensor. Robotics, 273, 2013.
-  I. Lysenkov and V. Rabaud. Pose estimation of rigid transparent objects in transparent clutter. In IEEE International Conference on Robotics and Automation (ICRA), pages 162–169. IEEE, 2013.
-  K. Maeno, H. Nagahara, A. Shimada, and R.-i. Taniguchi. Light field distortion feature for transparent object recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2786–2793. IEEE, 2013.
-  J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312, 2017.
-  R. Ng. Digital light field photography. stanford university California.
-  J. Oberlin and S. Tellex. Time-lapse light field photography for perceiving transparent and reflective objects. 2017.
-  C. Papazov, S. Haddadin, S. Parusel, K. Krieger, and D. Burschka. Rigid 3d geometry matching for grasping of known objects in cluttered scenes. The International Journal of Robotics Research, page 0278364911436019, 2012.
J. Redmon and A. Angelova.
Real-time grasp detection using convolutional neural networks.In IEEE International Conference on Robotics and Automation (ICRA), pages 1316–1322. IEEE, 2015.
-  K. A. Skinner and M. Johnson-Roberson. Towards real-time underwater 3d reconstruction with plenoptic cameras. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2014–2021. IEEE, 2016.
-  I. A. Sucan and S. Chitta. Moveit! Online Availabl e: http://moveit. ros. org, 2013.
-  Z. Sui, L. Xiang, O. C. Jenkins, and K. Desingh. Goal-directed robot manipulation through axiomatic scene estimation. The International Journal of Robotics Research, 36(1):86–104, 2017.
Z. Sui, Z. Zhou, Z. Zeng, and O. C. Jenkins.
Sum: Sequential scene understanding and manipulation.In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3281–3288, Sept 2017.
-  A. ten Pas and R. Platt. Using geometry to detect grasp poses in 3d point clouds. In International Symposium on Robotics Research, 2015.
-  A. ten Pas and R. Platt. Localizing handle-like grasp affordances in 3d point clouds. In Experimental Robotics, pages 623–638. Springer, 2016.
-  D. Tsai, D. G. Dansereau, T. Peynot, and P. Corke. Distinguishing refracted features using light field cameras with application to structure from motion. IEEE Robotics and Automation Letters, 4(2):177–184, 2018.
-  T.-C. Wang, A. A. Efros, and R. Ramamoorthi. Occlusion-aware depth estimation using light-field cameras. In IEEE International Conference on Computer Vision (ICCV), pages 3487–3495. IEEE, 2015.
-  Z. Zhou, Z. Sui, and O. C. Jenkins. Plenoptic monte carlo object localization for robot grasping under layered translucency. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8. IEEE, 2018.