Transparent and translucent objects are prevalent in daily life and household settings. When compared with opaque and Lambertian objects, they present additional challenges to visual perception systems. The first challenge is that transparent objects do not exhibit consistent RGB color features across varying scenes. Since the visual appearance of these objects depends on a given scene’s background, lighting, and organization, their visual features can drastically differ between scenes thereby confounding feature-based perception systems. The second challenge is the inaccurate depth measurements among RGB-Depth (RGB-D) cameras on transparent or translucent materials due to the lack of reliable reflections. This challenge is especially meaningful for state-of-the-art pose estimation approaches that require accurate depth measurements as input. To overcome these challenges, computational perception algorithms have been proposed for a variety of visual tasks, including image segmentation, depth completion, and object pose estimation. In this paper, our aim is to complement recent work in transparent object perception by providing a large-scale, real-world RGB-D transparent object dataset. Furthermore, we use the new large-scale dataset to perform benchmark analysis of state-of-the-art perception algorithms on transparent object depth completion and pose estimation tasks.
|Trans10K ||RGB||10K||15K(real)||seg only|
There are several existing datasets focusing on transparent object perception with commodity RGB stereo or RGB-D sensors, as summarized in Figure 1. While these datasets target transparent object perception, most are relatively small-scale (no more than 50K real-world frames), include few cluttered scenes (typically with less than 3 objects per image), do not offer diverse categories of commonplace household objects, and record limited lighting changes. These limitations motivate the introduction of ClearPose, a large-scale real-world transparent object dataset labeled with ground truth pose, depth, instance-level segmentation masks, surface normals, etc. that (1) has a comparable size with ordinary opaque object pose estimation datasets like YCB-Video ; (2) contains challenging heavy clutter scenes including multiple layers of occlusion between transparent objects; (3) contains a variety of commonplace household object categories (bottle, cup, wine, container, bowl, plate, spoon, knife, fork, and some chemical lab supplies); (4) covers diverse lighting conditions and multiple adversarial testing scenes. Further details of ClearPose in relation to existing datasets is included in Table. 1 with sample images from ClearPose shown in Figure 2.
The labeling of such a large-scale dataset requires both efficiency and accuracy. To achieve these qualities in ClearPose, we take advantage of a recently published pipeline named ProgressLabeller . The ProgressLabeller pipeline combines visual SLAM and an interactive graphical interface with multi-view silhouette matching-based object alignment to enable efficient and accurate labeling of the transparent object poses from RGB-D videos, which realizes rapid data annotation and exempts from the broken depth problem by transparent objects. Given the unique scale and relevance of ClearPose, we envision it will serve beneficial as a benchmark dataset on transparent perception tasks. In this current paper, we include benchmark analysis for a set of state-of-the-art visual perception algorithms on ClearPose. We target our benchmark analysis on the tasks of depth completion and RGB-D pose estimation.
2 Related Work
2.1 Transparent Dataset and Annotation
As mentioned in Table 1, there are several existing datasets and associated annotation pipelines that focus on transparent objects. With the exception of works such as Trans10k , TransCut  and TOM-Net  that are focused on 2D image segmentation or matting, most recent transparent perception datasets are collected in an object-centric 3D setting using RGB-D, stereo or light-field sensor modalities.
Similar to the case of opaque objects, datasets created in simulation, supporting photo-realistic rendering with ray-tracing, are more readily created and can produce very realistic examples of transparent object appearance. Examples of such simulated datasets include those  rendered by Blender,  by Unreal Engine and  by Nvidia Omniverse platform. While simulated datasets are appealing for their ease of creation, they often lack realistic feature artifacts (e.g. sensor noise, object wear marks, true lighting etc.) which can impact downstream perception systems that rely on the synthetic dataset (i.e. the syn-to-real gap).
Among existing real-world datasets, TOD  and StereObj1M  use RGB stereo cameras together with AprilTags. First, camera pose transforms are solved from AprilTag detection, and then several 2D keypoints on the objects are manually annotated in keyframes that are farthest to each other in the sequence. The corresponding 3D keypoint positions are solved by multi-view triangulation from those labeled 2D keypoints, and finally, the object 6D poses are solved from 3D keypoints as an Orthogonal Procrustes problem and propagated to all frames. In TOD, the authors also introduced a method to record ground truth depth images: they record the positions of transparent objects in the scene, and put their opaque counterparts that share the same shape at the same pose in separate collects. This pipeline was also used for real-world data collection in ClearGrasp , however, it’s extremely inefficient to replace transparent objects and their counterparts repetitively. Moreover, it is unclear how or whether this approach could be applied to data collection in complex scenes with cluttered objects as is typical in household settings. In Xu et al. , transparent objects are placed in several fixed locations relative to AprilTag arrays, with pose distribution diversity achieved by attaching a camera to a robot end-effector. This method still restricts the relative position between objects and is inefficient for complex scenes. In Fang et al. , all objects are attached to a large visual IR marker so that an optical tracking algorithm can estimate the objects’ 6D poses. In this way, the collection can support dynamic scenes. On the other hand, all the object instances in collected data are accompanied by visually obscuring external labels which may not exist in natural environments.
Overall, datasets except for StereObj1M are still not large-scale and require external efforts on hardware, such as deploying robotic arms, calibration between multiple sensors, fiducial or optical markers. Instead of using markers or complex robotic apparatuses, we take advantage of an existing labeling system, ProgressLabeller , that is based on visual SLAM to produce accurate camera poses efficiently on recorded RGB-D videos. There are two assumptions in our labeling pipeline, both of which can be easily met: our pipeline assumes static scenes during video capturing and that scene backgrounds have adequate RGB features for visual SLAM processing.
2.2 Transparent Depth Completion and Object Pose Estimation
Zhang et al.  presented early work on the problem of depth completion from inaccurate depth using deep neural networks. Zhang et al. introduced an approach to estimate surface normal and boundaries from RGB images that then solved for the completed depth using optimization. ClearGrasp  adapted the method to work for transparent objects and demonstrated robotic grasping experiments on transparent objects from completed depth. Tang et al.  integrated the ClearGrasp network structure with adversarial learning to improve depth completion accuracy. Zhu et al.  proposed a framework that learns local implicit depth functions from the inspiration of neural radiance fields and performs self-refinement to complete the depth of transparent objects. Xu et al.  proposed to first complete the point cloud by projecting the original depth using a 3D encoder-decoder U-Net and then re-project the completed point cloud back to depth, and finally complete this depth using another encoder-decoder network given the ground truth mask. Fang et al.  also used a U-Net architecture to perform depth completion and demonstrated robotic grasping capabilities with their approach.
KeyPose  was proposed for keypoint-based transparent object pose estimation on stereo images. It outperformed DenseFusion , even with ground truth depth, and achieved high accuracy on the TOD dataset. Chang et al.  proposed a 3D bounding box representation and reported results comparable to KeyPose in multi-view pose estimation. StereObj1M  benchmarked KeyPose and another RGB-based object pose estimator, PVNet , on more challenging objects and scenes, where both methods achieved lower accuracy with respect to the ADD-S AUC metric (introduced in ) with both monocular and stereo input. Xu et al.  proposed a two-stage pose estimation framework that performs image segmentation, surface normal estimation, and plane approximation in the first stage. The second stage then combines output from the first stage with color and depth RoI features for input to an RGB-D pose estimator originally designed for opaque objects  to regress 6D object poses. This method also outperformed DenseFusion and  fed with ClearGrasp output depth by a large margin. In this paper, we evaluate how well state-of-the-art RGB-D methods  designed for opaque object pose estimation can perform on transparent objects compared with .
3.1 Dataset Objects and Statistics
As shown in Figure 3
, there are 63 objects included in the ClearPose dataset. There are 49 household objects, including 14 water cups, 9 wine cups (with a thin stem compared with water cups), 5 bottles (with an opening smaller than the cross-section of the cylindrical body), 6 bowls, 5 containers (with 4 corners while bowls are classified with round shapes), and several other objects like pitcher, mug, spoon, etc. Moreover, the dataset contains 14 chemical supply objects, including a syringe, a glass stick, 2 reagent bottles, 3 pans, 2 graduated cylinders, a funnel, a flask, a beaker, and 2 droppers.
All the images are collected using a RealSense L515 RGB-depth camera, with a raw resolution of 1280720. After object pose annotation, the central part of each image is cropped and reshaped to 640480 for reduced storage space and faster CNN training and inference. For the training set, we separate all 63 objects into 5 separate subsets and collected 4-5 scenes with different backgrounds for each subset. Each scene is scanned by the hand-held camera moving around the tabletop scene at 3 different heights with 3 different lighting conditions (bright room light, dim room light, dim room light with sidelight from a photography lighting board). For the testing set, as the appearance of transparent objects depends on their context within a scene, we consider 6 different test cases and collect corresponding scenes as follows: (1) different backgrounds: novel backgrounds that never appeared in the training scenes with each object subset. (2) heavy occlusions: cluttered scenes each with about 25 objects that form multiple layers of occlusion when viewed from the table’s side. (3) translucent/transparent covers: scenes with all transparent objects placed inside a translucent box. (4) together with opaque objects: transparent objects placed together with opaque YCB and HOPE objects, which did not appear in the training set. (5) filled with liquid: scenes with transparent objects filled with different colored liquid. (6) non-planar configuration: scenes with objects placed onto different surfaces with multiple heights. Sample RGB images from both training and test sets are included in Figure 2.
We calculate several statistics about the ClearPose dataset. In total there are 362,325 RGB-D frames captured in 56 scenes, with 4,927,806 object instance annotations with 6 DoF poses, segmentation masks, surface normals, and ground truth depth images. The distribution of object instance annotation and camera viewpoint coverage are shown in Figure 4 colored by object category, where we see our dataset has roughly even viewpoint coverage for most objects. Viewpoint coverage is calculated by projecting collected object orientations onto a unit sphere and counting the covered region percentage over the sphere’s surface. For symmetrical objects, regions with the same object appearance are considered together. Some objects like plates, forks, large bowls can only be placed in certain orientations, so they have reduced viewpoint coverage. The 2nd water_cup was broken during the data collection process so it has fewer instance annotations.
3.2 Pose Annotation
We use the ProgressLabeller  to annotate the 6D poses of transparent objects and render object-wise segmentation masks, ground truth depths, surface normals, etc. from the labeled poses. As shown in Figure 5, the first step of the ProgressLabeller pipeline is to run ORB-SLAM3  on collected RGB-D video frames to produce camera pose estimates. During data collection, we notice ORB-SLAM3 sometimes couldn’t estimate camera pose well in case of extreme transparent object clutter, where background RGB features are heavily distorted. In these cases, the camera view needs to capture some background area or other landmark objects that can provide stable features. The next step is to import the reconstructed scene (cameras and point cloud) into the Blender workspace, select several camera views from different orientations, and import the object 3D CAD model into the workspace. Then, the object silhouettes/boundaries can be directly compared and matched with original RGB images from multiple views simultaneously when the user drags the object across the scene to tune its position and orientation. Figure 5 shows an example case of a matched transparent bottle. Optionally, the user can select to first locate the object onto the 2D fitting plane of the support table, etc., as shown on top-left of Figure 5. After labeling all objects’ poses in the 3D workspace, their poses in every camera frame are calculated by dividing them with estimated camera poses. The output ground truth depth images (fixed depth) are generated by overlaying rendered object CAD model depth to the original depth images. Then the surface normals are calculated from the depth images. It takes around 30 minutes to finish labeling one scene using this pipeline, including visual SLAM for camera pose estimation, object pose manual alignment, and output image rendering.
4 Benchmark Experiments
In this section, we provide benchmark results of recent research on depth-related transparent object perception using deep neural networks, including scene level depth completion, and both instance-level and category-level RGB-D pose estimation. As mentioned in Section 3, we test the generalization capability of such systems on 6 aspects of appearance novelty with transparent objects: new background, heavy occlusion, translucent cover, opaque distractor objects, filled with liquid, and non-planar placement. Specifically, around 200K images are selected for training, and for each of 6 test cases, 2K images from corresponding scenes are randomly sampled to compose the testing set for evaluation.
4.1 Depth Completion
We selected two recent depth completion works that are publicly available, ImplicitDepth  and TransCG  as baseline methods for the depth completion task on transparent object scenes. (ClearGrasp  was shown to be less accurate than both works, and Transparenet  was released around the date of this submission.) ImplicitDepth is a two-stage method that learns local implicit depth functions in the first stage through ray-voxel pairs similar to neural rendering and refines the depth in the second stage. TransCG is built on DFNet  which was initially developed for image completion. We trained both networks following their original papers’ training iterations and hyper-parameters. Specifically, ImplicitDepth was trained on a 16G RTX3080 GPU with a 0.001 learning rate and iterated around 2M frames for each of the two stages. TransCG was trained on an 8G RTX3070 GPU with a fixed 0.001 learning rate and iterated around 900K frames in total. Both works use Adam as the optimizer. Then we evaluated the two works on 6 test sets mentioned in Section 3 with metrics defined in . The results are shown in Table 2. TransCG surpassed ImplicitDepth in most tests with fewer training iterations, which implies that methods using DFNet can outperform designs using voxel-based PointNet for transparent depth completion. Across different tests, both methods perform poorly in Translucent Cover scenes and achieved the best performance in New Background scenes. Other scene variations such as Filled Liquid, Opaque Distractor, and Non Planar do not substantially impact the methods’ accuracy. Figure 6 shows examples of qualitative results from both methods compared with the ground truth.
4.2 Instance-Level Object Pose Estimation
There is one recent work of Xu et al.  focusing on transparent object pose estimation using raw RGB and depth images. This work doesn’t have source code publicly available, so we re-implemented the proposed method following the original paper for inclusion in our benchmark analysis. This method is implemented as a two-stage pipeline, for which we trained Mask R-CNN  for instance-level segmentation and DeepLabv3 
for surface normal estimation, and with an XYZ 3D coordinate map of the supporting plane feature in the first stage. The second stage replicates most of the architecture and loss functions described in
to ultimately regress dense pixel-wise 3D translation, 3D rotation delta from a set of fixed rotation anchors, and confidence scores. In practice, we trained the networks on an RTX2080-SUPER GPU. Mask R-CNN has trained 5 epochs with SGD optimizer, batch size of 5, and learning rate of 0.005. DeepLabv3 was trained 2 epochs with Adam optimizer, batch size of 4 and learning rate of 0.0001, and second stage network was trained around 180K iterations with Adam optimizer, batch size of 4, and learning rate of 0.0005. We compare this method with a state-of-the-art RGB-D pose estimator that was originally designed for opaque objects, FFB6D . The FFB6D estimator follows a two-stream RGB and point cloud encoder-decoder architecture with fusion between blocks. FFB6D is trained on a 16G RTX3080 GPU for 5 epochs with batch size 6. All the hyper-parameters follow the default value from the original implementation. For our analysis of FFB6D pose estimation performance, we run experiments with different depth options in training and testing: with raw, ground truth, and completed depth from TransCG, as detailed in Table 3.
|Testset||Metric||Xu et al.|
For evaluation, we use two metrics based on Average-point-Distance (ADD and ADD-S) from . ADD is calculated as the average Euclidean distance of corresponding point pairs from two object point clouds separately at the ground truth pose and predicted pose. ADD-S is calculated as the minimum distance of every point from the predicted point cloud to the ground truth point cloud. In Table 3, ‘Accuracy’ is calculated as the percentage of all pose estimates on the test set with ADD error less than 10cm. ‘ADD(-S)’ is calculated as Accuracy-Under-Curve area integrated from 0-10cm error, which is then scaled from 1 to 100 as the percentage.
As shown in Table 3, from the comparison between different training and testing combination within FFB6D, the upper bound performance appears when the network are both trained and tested on ground truth depth. When both the training and testing data come to raw, the metric drops a lot. Obviously, inaccurate depth would be the difficulty for transparent object pose estimation. It should be mentioned that training on ground truth depth, testing on completed depth (from TransCG) almost display the same accuracy. Although TransCG is good at depth completion, the disparity between ground truth and depth completion would make the network in vain. Generally, the easiest test case is New Background, and the accuracy drops a lot in the other 5 scenarios. When we compare the accuracy of Xu et al. with variants of FFB6D, we find they are comparable in New Background, Heavy Occlusion, Translucent Cover, and Non Planar scenes, while Xu et al. is much better in Opaque Distractor and Filled Liquid scenes. One possible reason is that there are some unseen colors mixing in the transparent objects, adding remarkable noise to object keypoint regression during the FFB6D inference process, which is not used by Xu et al. Overall, the pose estimation accuracy of current methods is still much worse than that on opaque objects with RGB features (with ADD-S around 90 on public datasets ). Some qualitative examples of pose estimates are shown in Figure 7.
There are some common classes of objects with transparency/translucency not included in our dataset, for example, those with colored transparent/translucent materials, with markers or labels, together with opaque parts, etc. Instead, our focus in the ClearPose dataset is to investigate pure transparency that exhibits relatively few features for perception. On the other hand, we anticipate the open-source ProgressLabeller will facilitate more large-scale customized transparent datasets in the future.
As for benchmarking perception models, we didn’t include a complete list of recent state-of-the-art approaches due to resource constraints (i.e. compute and time limitations). Based on our current dataset and benchmark results, there are several possible extensions: (1) Comparison of RGB-only pose estimators with RGB-D methods that are free of transparent object broken depth issues. (2) Category-level pose estimation for transparent objects, for which the ClearPose dataset has categories of bowls, bottles, wine_cups, etc. that are with similar 3D shape and topology. (3) Neural rendering on transparent objects considering environment contexts, such as varied lighting and occlusions. (4) Transparent object grasping and manipulation experiment in practical scenes, including the 6 test scenarios mentioned in the benchmark.
Besides, an especially interesting problem emerging from our heavy cluttered, and translucent covered test scenes is the multi-layer appearance of transparent objects. As shown in Figure 8, because of transparency/translucency, some image pixels could belong to more than one object. New detection and segmentation annotation rules, such as bounding box non-maximum suppression threshold, or segmentation mask format over the image, could be proposed and explored based on our dataset as future work.
In this paper, we described the contribution of ClearPose, a new large-scale RGB-D transparent object dataset with annotated poses, masks, and associated labels created using a recently proposed pipeline. We performed a set of benchmarking experiments on depth completion and object pose estimation tasks using state-of-the-art methods over 6 different generalization test cases that are common in practical scenarios. Results from our experiments demonstrate that there is still much room for improvement in some cases, such as heavy clutter, transparent objects filled with liquid, or being covered by other translucent surfaces. The dataset and benchmark code implementations will be made public with the intention to support further research progress in transparent RGB-D visual perception.
-  (2021) Orb-slam3: an accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics 37 (6), pp. 1874–1890. Cited by: §3.2.
-  (2021) GhostPose*: multi-view pose estimation of transparent objects for robot hand grasping. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5749–5755. Cited by: §2.2.
-  (2018) Tom-net: learning transparent object matting from a single image. In , pp. 9233–9241. Cited by: §2.1.
-  (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §4.2.
-  (2022) ProgressLabeller: visual data stream annotation for training object-centric 3d perception. arXiv preprint arXiv:2203.00283. Cited by: §1, §2.1, §3.2, §5.
-  (2014) Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems 27. Cited by: §4.1.
-  (2022) TransCG: a large-scale real-world dataset for transparent object depth completion and grasping. arXiv preprint arXiv:2202.08471. Cited by: Table 1, §2.1, §2.2, §4.1.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §4.2.
-  (2021) Ffb6d: a full flow bidirectional fusion network for 6d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3003–3013. Cited by: §2.2, §4.2.
-  (2020) BOP challenge 2020 on 6d object localization. In European Conference on Computer Vision, pp. 577–594. Cited by: §4.2.
-  (2019) Deep fusion network for image completion. In Proceedings of the 27th ACM international conference on multimedia, pp. 2033–2042. Cited by: §4.1.
-  (2021) Stereobj-1m: large-scale stereo image dataset for 6d object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10870–10879. Cited by: Table 1, §2.1, §2.2.
-  (2020) Keypose: multi-view 3d labeling and keypoint estimation for transparent objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11602–11610. Cited by: Table 1, §2.1, §2.2.
-  (2019) Pvnet: pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4561–4570. Cited by: §2.2.
-  (2020) Clear grasp: 3d shape estimation of transparent objects for manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3634–3642. Cited by: Table 1, §2.1, §2.1, §2.2, §4.1.
-  (2021) DepthGrasp: depth completion of transparent objects using self-attentive adversarial network with spectral residual for grasping. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5710–5716. Cited by: §2.2.
-  (2020) Robust 6d object pose estimation by learning rgb-d features. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 6218–6224. Cited by: §2.2, §4.2.
-  (2019) Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3343–3352. Cited by: §2.2.
Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. Robotics Science and Systems. Cited by: §1, §2.2, §4.2.
-  (2020) Segmenting transparent objects in the wild. In European conference on computer vision, pp. 696–711. Cited by: Table 1, §2.1.
-  (2020) 6dof pose estimation of transparent object from a single rgb-d image. Sensors 20 (23), pp. 6790. Cited by: §2.2, §4.2.
-  (2021) Seeing glass: joint point cloud and depth completion for transparent objects. arXiv preprint arXiv:2110.00087. Cited by: Table 1, §2.1, §2.2, §4.1.
-  (2015) Transcut: transparent object segmentation from a light-field image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3442–3450. Cited by: §2.1.
-  (2018) Deep depth completion of a single rgb-d image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 175–185. Cited by: §2.2.
-  (2020) Lit: light-field inference of transparency for refractive object localization. IEEE Robotics and Automation Letters 5 (3), pp. 4548–4555. Cited by: Table 1, §2.1.
-  (2021) RGB-d local implicit function for depth completion of transparent objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4649–4658. Cited by: Table 1, §2.1, §2.2, §4.1.