BIDCD - Bosch Industrial Depth Completion Dataset

by   Adam Botach, et al.

We introduce BIDCD - the Bosch Industrial Depth Completion Dataset. BIDCD is a new RGBD dataset of metallic industrial objects, collected with a depth camera mounted on a robotic manipulator. The main purpose of this dataset is to facilitate the training of domain-specific depth completion models, to be used in logistics and manufacturing tasks. We trained a State-of-the-Art depth completion model on this dataset, and report the results, setting an initial benchmark.


page 2

page 4

page 7

page 8

page 9


Depth Completion with RGB Prior

Depth cameras are a prominent perception system for robotics, especially...

Learning Joint 2D-3D Representations for Depth Completion

In this paper, we tackle the problem of depth completion from RGBD data....

Spacecraft depth completion based on the gray image and the sparse depth map

Perceiving the three-dimensional (3D) structure of the spacecraft is a p...

DenseLiDAR: A Real-Time Pseudo Dense Depth Guided Depth Completion Network

Depth Completion can produce a dense depth map from a sparse input and p...

Towards Domain-agnostic Depth Completion

Existing depth completion methods are often targeted at a specific spars...

Least Square Estimation Network for Depth Completion

Depth completion is a fundamental task in computer vision and robotics r...

MIPI 2022 Challenge on RGB+ToF Depth Completion: Dataset and Report

Developing and integrating advanced image sensors with novel algorithms ...

1 Introduction

Figure 1:

Our contributions. Top-left: Data Collection. (a) Our data collection setup consists of a Panda arm (Franka-Emika) and two RealSense D415 (Intel) depth cameras, mounted at different angles and distances from the end-effector. Inputs: (b) RGB, (c) Raw depth - the peripheral holes are out-of-range regions. Multiple Points-of-View ( 60) of each static scene were fused into a 3D mesh. Projecting the mesh back to the camera view points yields the Ground-Truth (GT) estimation. The GT was used to train the NLSPN depth-completion model. (d) Depth completion prediction - the internal holes have been closed, and the table top has been extrapolated outwards.

Robotic manipulation is an important component of the industry 4.0 revolution, both for logistic tasks such as bin-picking and assembly tasks such as peg-in-hole. Operation in unstructured environments requires perception, most commonly implemented with a camera for data acquisition feeding a computer vision model. Perception models need to provide 3D information for task and motion planning, which is typically done in six or more Degree-of-Freedom (DOF). Digital RGB cameras have been used to estimate 3D geometry using Structure-from-Motion (SfM)

[17], single-frame monocular depth estimation [9]

, and 3D pose estimation of known objects


In contrast to RGB cameras, distance sensors can provide a direct measurement of 3D geometry. In outdoor applications, the most notable one being self-driving cars, multiple different modalities have been utilized for depth sensing, including LiDAR, Radar, and Ultra Sound (US). For indoor applications, a highly prevalent modality for 3D acquisition is depth cameras, which provide RGBD images [8]. These dense depth maps have been successfully employed for grasp prediction of unknown objects [14, 26]. Depth cameras can be implemented in several different ways, including stereo vision, structured light, and Time-of-Flight (ToF), but all these methods suffer from large errors and loss of information (”holes”) under optical reflections and deflection. This pitfall is prominent in industrial settings, which routinely include reflective metallic objects.

The shortcomings of depth sensors have been previously addressed with depth completion [11, 13, 18, 25, 20] on single frames. Robotic applications often use mounted cameras, which present the opportunity to utilize multiple views and applying volumetric fusion methods, such as the Truncated Signed Distance Function (TSDF) [5, 21]. A preliminary requirement for TSDF is the registration between the different points of view. However, industrial applications tend to be confounded with a multitude of artifacts, which may in turn cause the registration to fail. For this reason and additional operational considerations, single-frame solutions seem to be a more viable alternative.

Depth completion can be thought of as a type of guided interpolation. The goal is to generate a dense depth map using a partial sampling (raw depth), guided by visual RGB cues such as perspective, occlusions, object boundaries, and surface normals. Depth completion datasets include RGB and raw-depth inputs and an estimated dense Ground-Truth (GT) depth map. In driving applications, the depth is typically obtained with LiDAR

[15]. In contrast, depth cameras are predominant when imaging indoor scenes [16, 6], household objects [2], and other applications [8]. These two acquisition systems pose distinct challenges for depth completion, suggesting that they warrant different solutions. LiDAR provides sparse depth maps where the values of the samples are typically considered to be sufficiently accurate. Thus, generating a dense depth map is largely a question of partitioning the influence regions around the depth samples using RGB cues such as boundaries. Depth cameras, on the other hand, may yield large connected regions with missing or invalid values, for example at window panes [16, 6]. Moreover, specular reflections and transparent recyclable objects [20] may induce large regions of erroneous values that need to be identified and corrected.

1.1 Industrial Object RGBD Dataset

Here, we present an industrial-objects dataset [22] collected with a depth camera mounted on a robotic manipulator. To generate the ground truth, we recorded each scene from approximately 60 Points-of-View (POV) and fused them to produce an estimated GT depth, as described in the methods section. Using multiple POVs improve the coverage of the scene, mainly by resolving occlusions and reflections. We configured the cameras to a depth resolution of 0.1 mm and a dynamic range of 0.2-1.2 m, following characteristic range and accuracy requirements for robotic manipulation. Consequently, certain peripheral regions of the GT image are left empty (zeros) if they are out-of-range for all POVs. The GT might also have internal holes stemming from localized surface reconstruction failures, typically due to poor observability. Notably, most RGBD datasets were designed to avoid depth acquisition issues (for example, see [2]). In support of industrial applications, we created a dataset rich in highly reflective metallic objects. Our raw depth maps have holes that take on average of the image area, within the pertinent workspace. For details, please refer to Sec. 2.2.2.

We trained a depth completion model on our dataset, to provide a baseline for this task. Most deep models for depth completion are based on auto-encoders [10], which are modified to consume RGBD and output a corrected depth map [11]. In [13], the authors added UNet connections [19] and achieved State-Of-The-Art (SOTA) performance at that time. In [7], we took the model from [13] and carried out an extensive architecture search in order to enhance it.

In this work, we trained the NLSPN model from [18]. The authors of the latter study achieved depth-completion SOTA on the datasets of [15] and [16]. The NLSPN model uses an auto-encoder to predict several intermediate maps and passes them to a Convolutional Spatial Propagation Network [4, 3] for iterative refinement of the depth prediction. Further details can be found in Sec. 2.3. The results are reported in Sec. 3.2.

1.2 Contributions

We illustrate our contributions in Fig. 1 and list them here.

  • Data collection - we randomized over 300 scenes of industrial objects and dozens of POVs, collecting a total of 33k RGBD images using a wrist-mounted (eye-in-hand) depth camera. The challenging nature of this setup yielded many holes in the raw depth images ( of the image area).

  • Depth ground-truth - the RGBD images were processed with a customized pipeline for filtering, registration, and Point-Cloud (PCD) fusion. The fused mesh was back-projected to the original POV to yield an estimated ground-truth depth map.

  • Depth completion - we trained NLSPN [18] on our dataset, setting an initial benchmark score.

2 Methods

In this section, we describe how we created the dataset and present the experiments performed for the depth completion.

2.1 Data Collection

radius [mm] elevation [deg] azimuth [deg]
349.7 - 416.8 44.4 - 78.2 -151.3 - 153.4
Table 1: POV geometric characteristics. The table provides spherical coordinates, two orientation angles, and the distance to the work surface. The coordinate ranges are given with their 10 and 90 percentiles.

We randomized scenes of industrial objects scattered within a rectangular workspace on top of a table. Each scene consists of up to 10 objects, predominantly metallic items such as screws, cylinders, and heat sinks. The data was collected with two RealSense D415 depth cameras, mounted on a Panda robotic arm from Franka-Emika. The motion path included 30-70 randomized Points-of-View (POV) with the end-effector poses scattered on a hemisphere, while taking into consideration kinematic limitations of the manipulator. The end-effector was kept oriented towards a shared focal point on the table surface. In Table 1, we describe the POV camera-axis elevation and azimuth, and the distance of POV from table surface taken along the camera view axis.

Due to the reflective nature of the scenes, the raw depth includes many holes at highlights and object edges. The reflections also cause misleading artifacts with erroneous depth values, as demonstrated in the middle column in Fig. 2. Another contributing factor is partial occlusion, i.e. ”shadows”. On average, the raw depth holes within the workspace take up of the image area.

2.2 Ground Truth Generation

For each raw depth map, the ground truth is a corresponding dense depth map containing correct distances, approximately within (see Fig. 2). Pixels with a distance outside the dynamic range of the camera are designated as invalid (zero values). To estimate the ground-truth, we applied to each scene the pipeline described below. For clarity, we break it down into three stages, starting from the single POVs raw inputs, followed by multi-view fusion into a 3D mesh, and back to the separate POVs GT.

Figure 2: Depth Ground Truth. Each row depicts a different sample. From left to right: (a) RGB, (b) raw-depth, (c) GT-depth. In column (b), note how artifacts in the out-of-range regions may appear very close (purple) or very far (yellow). On column (c), we demonstrate how depth-filtering removes artifacts and multi-view fusion adds missing information, and how projecting the pre-defined workspace boundaries onto the depth frames eliminates irrelevant information. On the bottom row column (c), the GT depth has a small inner hole of less than of the image area.

2.2.1 Depth Filtering

Each depth frame was processed individually to remove artifacts, while retaining as much valid information as possible. In addition to providing ”raw-depth” and ”gt-depth” (ground truth), our dataset includes an intermediate result denoted as ”cleaned-depth”. The pipeline is depicted in Algo. 1 and consists of the following steps:

  1. Workspace limits - we project the workspace boundaries onto the raw depth and remove all values outside its perimeter. Tall objects positioned near the outer rim may get trimmed because their top protrudes the visual borderline. Hence, we enlarge the actual workspace by a certain margin when applying this step.

  2. Morphological operations - looking at the binary mask of the depth values, the objects and tabletop typically form one large blob and false artifacts form separate small blobs. As the main blob may touch artifact blobs, we perform thinning and opening to separate them. We then threshold the blob area to filter out small blobs.

  3. Outlier removal - the depth image is first partitioned into a coarse grid and the median depth value is then calculated for each cell. We define a range limit, and depth values exceeding the allowable range from the cell’s median are removed. The reasonable range of depth values within a grid cell depends on grid parameters, POV elevation angle, and object height from the table. We used a fixed range threshold, chosen to accommodate the largest possible range within the data.

Input: N; ; ; ; ; ;
1 for  do
      // Workspace limits
3      initialize
      // Morphological operations
6      apply thinning and binary-open on
7      apply bwlabel on
8      initialize
9      for  do
10           if  then
                // Outlier removal
14                for  do
return , cleaned depth maps of scene
Algorithm 1 Depth Filtering
Input: ; , cleaned depth maps of scene; , homogeneous transform matrices ; , upper limit on transform distances; , minimal cluster size threshold
1 initialize with N nodes, and no edges
2 for  do
      // Create edges
5      for  do
7           for  do
13                if  then
                     // Bundle adjustment
17                     apply global optimization on
                     // TSDF-Volume integration
20                     for  do
21                          if  then
return , 3D model of scene surface
Algorithm 2 Multi-View Fusion

2.2.2 Multi-View Fusion

After pre-processing the depth channels, we fused the RGBD frames using the ”multiway registration” procedure from [27]. For convenience, we outline the steps in Algo. 2. The RGBD frames are organized as a pose-graph, where each pose node corresponds to a POV pose and the robot kinematics are used to initialize it. Each edge of the pose-graph, i.e. relative transform between a pair of POVs, is refined with an extension of Iterative-Closest-Point [1]. This stage is prone to errors. While the authors in [27] recommend pruning edges with low confidence scores, we found this to be insufficient. We, therefore, replaced replaced the exclusion criteria with a limit on the acceptable ICP update, and prune out transforms deviating beyond a certain distance from the nominal kinematic transforms. The pose-graph then undergoes global-optimization with the Levenberg-Marquardt algorithm to resolve conflicting relative-transforms (edges). The final pose-graph is used to cast the RGBDs into a shared Point-Cloud (PCD). The PCD is then triangulated to generate a mesh, using [5]. To remove the remaining artifacts, we applied clustering to the mesh vertices and removed small clusters.

2.2.3 Back Projection and Hole Filling

To generate GT depth maps, the mesh is projected back to each POV, as described in Algo. 3. At this stage, the GT depth maps still contain internal holes, which can be filled using RGB-guided interpolation. The authors of [2] considered Bi-Lateral Filtering (BLF) [23] and the method from [12] and chose to apply the latter. For our dataset, BLF is more suitable, mainly due to the large peripheral out-of-range regions. As shown in (1), we implement a BLF for RGB distances and pixel-coordinate distances.


Where denotes depth, is a binary mask of depth validity, the coordinates span the filter’s support around the central pixel at , ”Euc” is used to denote Euclidean distance in coordinates, and & are configuration parameters.

The filter was applied only to invalid pixels, keeping valid values as-is. To reduce extrapolation into out-of-range regions, we required a minimum occupancy of of valid values within the filter’s support.

Input: ; ; ; ;
1 for  do
return , scene ground truth depth maps
Algorithm 3 Back Projection and Hole Filling

2.2.4 Manual Inspection and Retries

All the GT depth maps were manually inspected to ensure high fidelity. In seldom cases, we vetted a scene but removed up to of the POV due to persistent artifacts. Scenes that failed curation were reprocessed with stricter parameters, i.e. with more aggressive filtering at the cost of higher information loss. We depict this in Fig. 3, where we compare the proportion of invalid GT values for 10 random scenes (total of 650 POV) processed with the strict and lenient configurations. For this subset, the raw-depth holes within the projected workspace (”inner”) constitute, on average, of the image area.

Figure 3: Configuration comparison. We depict the proportional-area of invalid Ground-Truth values, under two different pipeline configurations. The ”lenient” configuration retains more information, while the ”strict” configuration rejects more artifacts. The total size of holes includes peripheral out-of-range regions, which are of less interest. Inner holes are only those within the projected workspace (see Sec. 2.2.1).

2.3 Depth Completion

We implemented the model presented in [18], consisting of two stages. The first stage is an auto-encoder that predicts an initial depth map, affinities, and confidence maps. The second stage uses the intermediate outputs to predict the final depth map via iterative non-local spatial propagation. The propagation is akin to convolution with adaptive weights and kernels, determined by confidence and affinities, respectively. In our implementation, we omitted the confidence prediction to simplify the model. We expect this removal to have a negligible impact since the ablation study in [18] showed that it reduced errors only by . The resolution of the input RGBD frames is 720x1280 pixels, with a 9:16 aspect ratio. The NLSPN model was trained on images reduced to in each dimension, i.e. 432x768 pixels.

Figure 4: Depth completion with NLSPN. Each row depicts a different sample. Left-to-right: (a) RGB, (b) Raw-Depth, (c) Depth Prediction, and (d) Ground-Truth. Notice how the model is able to fill shadows and holes on objects. On the peripheral out-of-range regions the artifacts are subdues, and the table top is correctly extended.

3 Results

3.1 Industrial RGBD Dataset

We were able to collect and generate GT for 319 different scenes of industrial objects. Each scene was acquired from POVs (mean 1 std), resulting in over 33k RGBD frames. Our dataset is organized by scenes. For each POV, we provide the RGB and raw-depth inputs and the corresponding GT-depth. As shown in Fig. 2, the GT depth maps have empty peripheral regions which take up of the image area and might include small internal holes taking up less than .

We also provide the intermediate results: a mask of the projected workspace and the cleaned-depth (see Algo. 1). To support 3D approaches, we include in [22] a corresponding 3D dataset containing camera intrinsic parameters, POVs given as homogeneous matrix, and the colored meshes calculated in Algo. 2.

3.2 Depth Completion

Raw 122.1 49.0
NLSPN 25.3 8.9
Table 2: Depth completion errors. The errors are measured in millimeters, given as RMSE and MAE. Raw refers to acquisition errors and NLSPN to depth-completion.

We summarize the test results of the NLSPN model in Table 2 and show examples of predictions in Fig. 4. The errors are calculated against the GT and only where the GT is valid (non-zero). We report Root-Mean-Square-Error (RMSE) and Mean-Absolute-Error (MAE); the latter is less sensitive to extreme values. The NLSPN model can reduce the acquisition errors approximately by a factor of five. Further work will be necessary to achieve the practical requirements of 1 mm accuracy.

4 Conclusions

To our knowledge our dataset is the first public RGBD dataset for industrial objects. To ensure high variability, we used a few dozens of objects with approximately 10 different backgrounds and randomized Points-of-Views.

We trained the NLSPN depth-completion model on our data and report its errors. From our experience with grasping models, we asses that the achieved accuracy would be adequate for the more robust models such as [26], but might be insufficient for more sensitive models such as [14].

We hope that our contribution will help push forwards robotic manipulation in industrial applications.


  • [1] Paul. J. Besl and N. D. McKay (1992) A method for registration of 3-d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 14 (2), pp. 239–256. External Links: Document Cited by: §2.2.2.
  • [2] B. Calli, A. Singh, J. Bruce, A. Walsman, K. Konolige, S. Srinivasa, P. Abbeel, and A. M. Dollar (2017) Yale-cmu-berkeley dataset for robotic manipulation research. The International Journal of Robotics Research 36 (3), pp. 261–268. External Links: Document Cited by: §1.1, §1, §2.2.3.
  • [3] X. Cheng, P. Wang, C. Guan, and R. Yang (2019) CSPN++: learning context and resource aware convolutional spatial propagation networks for depth completion. CoRR abs/1911.05377. External Links: Link, 1911.05377 Cited by: §1.1.
  • [4] X. Cheng, P. Wang, and R. Yang (2018) Depth estimation via affinity learned with convolutional spatial propagation network. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 103–119. Cited by: §1.1.
  • [5] B. Curless and M. Levoy (1996) A volumetric method for building complex models from range images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, New York, NY, USA, pp. 303–312. External Links: ISBN 0897917464, Link, Document Cited by: §1, §2.2.2.
  • [6] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) ScanNet: richly-annotated 3d reconstructions of indoor scenes. In

    Proc. Computer Vision and Pattern Recognition (CVPR), IEEE

    Cited by: §1.
  • [7] Y. Feldman, Y. Shapiro, and D. D. Castro (2020) Depth completion with rgb prior. External Links: 2008.07861 Cited by: §1.1.
  • [8] M. Firman (2016) RGBD datasets: past, present and future. In 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 661–673. External Links: Document Cited by: §1, §1.
  • [9] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow (2019-10) Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1.
  • [10] Geoffrey. E. Hinton and R. R. Salakhutdinov (2006)

    Reducing the dimensionality of data with neural networks

    Science 313 (5786), pp. 504–507. External Links: Document, ISSN 0036-8075, Link, Cited by: §1.1.
  • [11] M. Jaritz, R. Charette, E. Wirbel, X. Perrotton, and F. Nashashibi (2018-09) Sparse and dense data with cnns: depth completion and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 52–60. External Links: Document Cited by: §1.1, §1.
  • [12] A. Levin, D. Lischinski, and Y. Weiss (2004) Colorization using optimization. In ACM SIGGRAPH 2004 Papers, pp. 689–694. Cited by: §2.2.3.
  • [13] F. Ma, G. V. Cavalheiro, and S. Karaman (2019) Self-supervised sparse-to-dense: self-supervised depth completion from lidar and monocular camera. In 2019 International Conference on Robotics and Automation (ICRA), pp. 3288–3295. Cited by: §1.1, §1.
  • [14] J. Mahler, M. Matl, V. Satish, M. Danielczuk, B. DeRose, S. McKinley, and K. Goldberg (2019) Learning ambidextrous robot grasping policies. Science Robotics 4 (26), pp. eaau4984. Cited by: §1, §4.
  • [15] M. Menze and A. Geiger (2015) Object scene flow for autonomous vehicles. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.1, §1.
  • [16] P. K. Nathan Silberman and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: §1.1, §1.
  • [17] O. Özyeşil, V. Voroninski, R. Basri, and A. Singer (2017) A survey of structure from motion.. Acta Numerica 26, pp. 305–364. External Links: Document Cited by: §1.
  • [18] J. Park, K. Joo, Z. Hu, C. Liu, and I. S. Kweon (2020) Non-local spatial propagation network for depth completion. External Links: 2007.10042 Cited by: 3rd item, §1.1, §1, §2.3.
  • [19] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Cham, pp. 234–241. Cited by: §1.1.
  • [20] S. S. Sajjan, M. Moore, M. Pan, G. Nagaraja, J. Lee, A. Zeng, and S. Song (2019) ClearGrasp: 3d shape estimation of transparent objects for manipulation. External Links: 1910.02550 Cited by: §1, §1.
  • [21] J. G. Schornak (2017) Yak. GitHub. Note: Cited by: §1.
  • [22] Y. Shapiro (2021)(Website) Note: Cited by: §1.1, §3.1.
  • [23] C. Tomasi and R. Manduchi (1998) Bilateral filtering for gray and color images. In Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), Vol. , pp. 839–846. External Links: Document Cited by: §2.2.3.
  • [24] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birchfield (2018) Deep object pose estimation for semantic robotic grasping of household objects. In Conference on Robot Learning (CoRL), External Links: Link Cited by: §1.
  • [25] T. Wang, F. Wang, J. Lin, Y. Tsai, W. Chiu, and M. Sun (2019) Plug-and-play: improve depth prediction via sparse data propagation. In 2019 International Conference on Robotics and Automation (ICRA), pp. 5880–5886. Cited by: §1.
  • [26] A. Zeng, S. Song, J. Lee, A. Rodriguez, and T. Funkhouser (2019) TossingBot: learning to throw arbitrary objects with residual physics. External Links: 1903.11239 Cited by: §1, §4.
  • [27] Q. Zhou, J. Park, and V. Koltun (2018) Open3D: A modern library for 3D data processing. arXiv:1801.09847. Cited by: §2.2.2.

Appendix A Supplementary Figures