In recent years, depth sensing has become essential for a variety of new significant applications. For example, depth sensors assist autonomous cars in navigation and in collision prevention . The physical constraints on active depth sensing mobile devices, such as light detection and ranging (LiDAR), yield sparse depth measurements per scan. This results in a coarse point cloud and requires an additional estimation of missing data.
Traditional LiDARs have a restricted scanning mechanism. Those devices measure distance in specified angle intervals, using a fixed number of horizontal scan-lines (usually 16 to 64), depending on the number of transceivers. A new revolutionary technology is now emerging of solid-state depth sensors. They are based on optical phased-arrays with no mechanical parts, and can thus scan the scene fast in an adaptive manner (programmable scanning) [9, 29]. In addition, those innovative devices are much cheaper than those currently in use. This calls for the development of new, efficient, sampling strategies, which reduce the reconstruction error per sample. Since almost always autonomous platforms are equipped with RGB cameras, we investigate the possibility to improve the depth sampling process by taking the RGB information into account.
In this paper, we address the topic of image-guided depth sampling and reconstruction. First, we introduce the concept of adaptive depth sampling and develop an appropriate model of the data. Then, we introduce a fast and practical image-guided algorithm for depth sampling and reconstruction, based on super-pixels. An example of output of a our algorithm is shown in Fig. 1. We demonstrate in experiments that our framework outperforms state-of-the-art depth completion methods for both indoor and outdoor scenes. Finally, since current solid-state technology is not yet technically-open for reconfiguration of the sampling, we illustrate the concept in real life, by a single-pixel depth camera, which was 3D-printed in our lab.
2 Related Work
Depth completion: The task of depth reconstruction from scattered sparse samples is being increasingly investigated. The main methods can be divided to those which require only the sparse depth input (unguided) and to those assisted by additional information, e.g. color image (guided).
, while others rely on more advanced tools such as deep learning[11, 35].
On the contrary, guided methods exploit the connection between depth maps and their corresponding color image. Earlier methods used traditional image processing tools [6, 14]. Recently, several deep learning-based methods [10, 15, 17, 18, 20, 21, 25, 26] achieved state-of-the-art results.
have offered a non-trivial (i.e. uniformly random or grid) sampling pattern as a previous step to depth reconstruction. Both studies selected sampling at locations which are most probable to have strong depth gradient. Nonetheless, they failed dealing with very low sampling budget of less than 5% of ground-truth pixels.
3 The Space of Depth Images
Any sampling strategy is based on a model of the signal to be sampled. For example, in classical Fourier analysis, the assumption is of band-limited signals. Thus, sampling at the Nyquist frequency guarantees perfect reconstruction. Compressed sensing  assumes a sparse underlying model of the signal (such as in terms of edges). Sub-Nyquist sampling , is based on the ability to manipulate correctly aliased signals, based on prior knowledge of the frequency structure of the data. However, the models above assume a single source of data to be sampled. We would like to examine an appropriate model for depth scenes, as well as the relation to the RGB data of the same scene. We propose a simple depth model and try to validate it experimentally on benchmark data. We then relate it to RGB.
3.1 Piece-wise planar depth model
Our primary objective is to obtain depth information for autonomous navigation. Thus, an appropriate model should represent well the general geometrical setting (roads, walls, sidewalks) as well as the location of significant landmarks and obstacles (poles, signs, rocks and objects in a room). For objects, we would like to obtain their location but not necessarily their precise geometry. This leads us to a piece-wise planar model, which was mentioned in [5, 33] but not yet formulated and tested. Given a depth image our hypothesis is that most of the scene can be well represented by a piece-wise planar approximation. More formally, let be the image domain, where is its area. Let , , be a set of sub-domains which define a partition of . Thus, , , , . Let be a 2D piece-wise linear function, defined on the domain by
where are some constants. Let be a binary function in which indicates validity of the model, where indicates validity and invalidity. We denote by the set of valid points, . We assume , . Our hypothesis is that given some small tolerance parameters , a validity map and a small number of regions , for any depth map there exists a piece-wise planar approximation, defined by Eq. (1), such that
Thus, can be well approximated by a 2D piece-wise linear function, almost everywhere, provided we know the partition set and the plane parameters for each . Our aim is to approximate from the RGB image. In order to recover we need to sample each region 3 times, to estimate its coefficients (in the noiseless case). This gives us a lower bound on the number of samples required to obtain a high quality depth image:
We now turn to experimentally check this hypothesis and examine the values of , and in indoor and outdoor scenes.
To validate the proposed model, we made a piece-wise planar depth approximation for two datasets which have dense ground-truth depth. For outdoor scenes, there are little real-life benchmarks with dense depth, we therefore resorted to a high quality emulation, using 787 images downsampled to pixels in summer sequence 5 (left stereo, front view) of Synthia  dataset. For indoor scenes we used 654 images downsampled and center-cropped to pixels (persisting ) of NYU-Depth-v2  test set.
Examples of piece-wise planar approximations are shown in Fig. 2 (bottom), compared to the ground truth (middle row). Dark-blue indicates regions not in the set . It can be observed that the approximation is quite accurate. Statistical results are presented in Fig. 3. The average model parameters recovered for Synthia are , , . The average model parameters recovered for NYU-Depth-v2 are , , .
In these cases, according to Eq. (3), Synthia can be well approximated in an optimal scenario by an average of only 200 samples, whereas NYU-v2 by an average of 56 samples (in both cases, this translates to about sampling ratio, compared to the ground-truth depth resolution). In Fig. 4 we show examples of objects which fit well a piece-wise planar approximation (top) and counter examples of highly non-convex structures or ones with high curvature.
3.2 Relation of RGB and depth
Next, we want to examine the possibility to estimate the partition set from the RGB data. This is a very challenging task, which is an open problem at this point. We thus turn to a simpler problem of checking the relation between RGB edges and depth discontinuities. Given the set of RGB boundaries (edges) and depth boundaries we would like to calculate empirically, for each coordinate , the following conditional probabilities:
We compute the set for each image by using a generic edge-detector, well suited for natural images . For the set we employed a threshold on the depth gradient, normalized by the depth value. We allowed some tolerance in the registration of the images due to misalignment, so for any , the search is in a pixel neighborhood. For Synthia we got , . For NYU-v2 we got , . The high values of indicate the ability to predict well depth discontinuities, based on RGB edges. The relatively low values of indicate that we should expect many false partitions (which appear only in the RGB, but not in the depth data). Thus we can expect to be able to approximate, to some extent, the partition set based solely on the RGB image, by over-segmentation. As the partition is quite a rough approximation, additional samples are required, above the lower bound expressed in Eq. (3). In Fig. 5 we show examples of boundaries in the RGB and depth for both sets.
We propose a generic and simple method for depth sparse sampling and dense reconstruction. The following assumptions are made:
Measurements are of high quality, such that noise of the range measurement is negligible, compared to the global error of the dense reconstruction.
Sampling budget is limited to samples.
An RGB image of the scene is available to guide the process. The reconstructed depth is registered to this image. Sensitive cameras may be used for night scenes.
Sampling is point-wise. The system can sample at any desired location of the RGB image. The sampling pattern can change for each image.
4.1 Algorithm design requirements
Several requirements are vital to the design of such an algorithm: It needs to capture well the shape and boundaries of objects, to be computationally fast and memory efficient, and to have control on the number of samples. Surprisingly, all these requirements coincide with those for super-pixels (SPs)  - an over-segmentation technique applied to RGB images. This led us to the following algorithm.
4.2 Proposed algorithm
The proposed algorithm is divided into two parts, sampling and reconstruction. It includes the following steps:
A super-pixel map is generated from the RGB image using SLIC . The desired number of SPs is set to . The SPs compactness is adjusted to high value to ensure regularly shaped SPs.
For each SP, the SP center of mass (CoM) is computed by calculating the mean of the coordinates of all pixels in the SP. A depth sample is taken at the CoM location. If the CoM is located outside of the SP (for some non-convex SP), the depth sample is taken at the closest location to the CoM of the SP.
Reconstruction: Our reconstruction is based on the samples and SPs of the sampling stage.
For each SP, a single depth measurement is available, thus a zero-order estimation is performed. That is, the entire SP takes the depth value of the sample. Let be the resulting depth image.
is calculated by .
A bilateral filter  is applied over . The filter’s parameters are fixed for a given number of samples and type of scene (road / room). Let be the bilateral filter result.
The final dense reconstructed depth image, , is calculated by .
See Fig. 6 for a high-level diagram of our framework.
4.3 Principles of the algorithm
Sampling based on RGB segmentation. This follows the model and relations between RGB edges and depth discontinuities, discussed in previous section.
Sampling at center of mass of segment. There are several reasons for this choice: it reduces an inherent uncertainty near discontinuities. Secondly, for piece-wise planar depth regions, the center of mass minimizes the RMSE. Moreover, practical depth sensing technologies have a finite spatial resolution and cannot sample well near depth discontinuities. Fig. 7 demonstrates that sampling at SP CoM leads to a more accurate depth reconstruction results than sampling a random pixel location inside the SP.
Why super-pixels. Sampling with super-pixels enables measuring small elements. It also limits reconstruction error since the size of the segment is limited. When the depth discontinuity is not well reflected in the RGB (we term it camouflaged objects) the sampling reduces to an approximate grid-sampling scheme, which provides a lower bound on the resolution. This is illustrated in Fig. 9.
0-order vs. 2D linear reconstruction in each segment. At a first glance, it seems natural to estimate by SPs the subdomains of the model, which require 3 samples to obtain a plane approximation in the region of the SP. However, we found out that this is not an optimal strategy. A better approach is to increase the number of segments by a factor of 3 and to sample once each segment. This allows to increase the overall resolution (or smallest object size
) of the system while still being able to recover reasonably well large planar segments, with a proper nonlinear filtering operation (see below). The depth of the smallest objects that the system can measure are estimated by a constant value. This facilitates the detection of poles, signs and small obstacles at a low sampling cost. Fig.7 demonstrates that 0-order reconstruction is much more accurate than linear reconstruction in terms of RMSE, for a given sampling budget . In Fig. 8 the ability to detect well small objects by 0-order estimation is illustrated.
Bilateral filtering. Having more SPs with zero-order estimation allows to sample well small objects. However, now large flat regions are heavily degraded by staircasing artifacts. Our proposed solution is to apply a fast, nonlinear, edge preserving filter . It is designed such that actual depth discontinuities are preserved, whereas false edges, which stem from the 0-order estimation, are smoothed out. Due to the log function, smoothing is relative to the depth. This approximates well the piece-wise planar model for large regions, such as walls and roads, yielding also lower RMSE, as seen in Fig. 7. As can be seen in Fig. 8, the artifacts in the reconstruction of the large planar background region are quite minimal.
4.3.3 MTF analysis
We aim to measure the spatial resolution of our sampling and reconstruction strategy. We use modulation transfer function (MTF) - a standard tool for characterizing the resolution of imaging systems . Our chart is based on the Siemens Star testchart modified for RGB-guided depth sampling. We modified the original testchart by a road scene, replacing the white regions with typical background content and the black regions with typical foreground content. Depth is assumed to take binary values, as in the original chart. The MTF calculation is explained in detail in .
Fig. LABEL:fig:mtf_quant presents the results computed from the reconstructed images in Fig. 11. One can observe the clear increase of resolution of the proposed method, compared to RGB-guided and non-guided depth completion approaches.
We evaluate the performance of our depth sampling and reconstruction method and compare it to other approaches. For all other examined methods we simulate uniform random depth samples at different sparsity levels. We define pixel density as the ratio between the number of sampled pixels to the total number of pixels in the image. To demonstrate the generalization of our algorithm, we use two distinct datasets for evaluation - one for outdoor scenarios and one for indoor scenarios. For the outdoor dataset we also evaluate over a subset of image areas focusing on small obstacles in the image. Later, we make a qualitative comparison between different sampling patterns and show that ours leads to better result. Finally, we show initial experimental results, using a real system we built, based on the proposed principles. We measure performance on all experiments with RMSE (root mean squared error), and also report the REL (relative absolute error) metric on NYU-Depth-v2.
5.1 Outdoor data (Synthia)
The Synthia dataset  provides synthetic RGB, depth and semantic images for urban driving scenarios. We use synthetic data as for now there is no large non-synthetic dataset that provides a dense and accurate depth map. We need a dense depth map to be able to sample at any given point. Accuracy is required especially to show the increased resolution we obtain. Thus, two large real-life datasets do not apply: KITTI depth completion benchmark  has semi-dense depth, and Cityscapes  has low resolution depth, which is in some cases inaccurate. More technical details are given in Section 3.1. We make the evaluation on 0-100m depth range, which is similar to the range of a typical vehicle-mounted LiDAR.
Full scene experiment. Quantitative results are presented in Fig. 11(a). We achieve a 30% lower RMSE than the second best at 0.45% density, and keep having the best result through all densities. We also evaluated a deep-learning method  trained on KITTI benchmark , but it failed to obtain any comparable results. Qualitative comparison is shown in Fig. 13, exhibiting precise and sharp reconstruction, especially of small objects.
Obstacles set. To enable evaluation over important objects in the image and to reduce the impact of far background on the performance measurement, we derived from Synthia a set of 100 obstacles, which we refer to as the obstacles dataset. We applied sampling and reconstruction over the entire image, but now evaluate over the obstacle mask. Quantitative results are presented in Fig. 11(b), and qualitative comparison is shown in Fig. 13. Table 1 compares the number of samples required to achieve certain levels of accuracy. We require 3-4 times less samples for a given RMSE.
|Bilateral solver ||8%||8%||8%|
Sampling only. We claim that using only our proposed sampling pattern, any completion method can achieve better results than using other existing sampling patterns, especially for small objects in the scene. Fig. 14 proves this qualitatively for 3 distinct reconstruction methods.
|225||Liao et al. ||0.442||0.104|
|200||Ma et al. ||0.230||0.044|
|200||Li et al. ||0.256||0.046|
5.2 Indoor data (NYU-Depth-v2)
The NYU-Depth-v2 dataset  includes labeled pairs of aligned RGB and dense depth images collected from different indoor scenes. More technical details are given in Section 3.1. Quantitative results are listed in Table 2. Our method outperforms all other methods. Note that 4 out of the other 6 methods are based on deep learning, while ours is not. A qualitative comparison is shown in Fig. 15. Although suffering from slight staircase artifacts, our result preserves edges better and stays precise even in small items.
5.3 Prototype: Single-pixel mechanical sampler
Finally, we performed an experiment over real scene designed in our laboratory. To enable controllable sampling, we built a sampling device (Fig. 16) assembled by a laser rangefinder, a camera, motors and printed parts. We generated ground-truth images for comparison with Kinect 2 sensor. Note that ground-truth is in real-world coordinate, while our system measures range values.
We created two scenes and sampled them with 3 different patterns to demonstrate the superiority of our method. Results are presented in Fig. 16(f). While the first scene (top) is a toy example for testing the sampling resolution, the second scene (bottom) is more realistic. In both cases our method is able to sample all object (even the thinner ones) in the scene and reconstruct them quite accurately.
In this paper, we introduced a novel approach for image-based sparse depth sampling and dense reconstruction. We suggested a parametric piece-wise linear model and have shown its validity for indoor and outdoor datasets. We demonstrated that the correlation between depth and color domains allows to approximate well depth scenes using only an RGB image and a low number of carefully chosen depth samples. A single-pixel depth sampler was constructed as a proof-of-concept, verifying our predictions. We believe that this new direction calls for additional extensive research, in order to develop advanced, cheap and accurate depth sensing systems. In future work, we plan to combine classical and modern learning methods to further improve the performance and accuracy.
-  R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Süsstrunk, et al. Slic superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence, 34(11):2274–2282, 2012.
-  A. Aldroubi and K. Gröchenig. Nonuniform sampling and reconstruction in shift-invariant spaces. SIAM review, 43(4):585–620, 2001.
-  I. Amidror. Scattered data interpolation methods for electronic imaging systems: a survey. Journal of electronic imaging, 11(2):157–177, 2002.
-  P. Babu and P. Stoica. Spectral analysis of nonuniformly sampled data–a review. Digital Signal Processing, 20(2):359–378, 2010.
-  S. Baker, R. Szeliski, and P. Anandan. A layered approach to stereo reconstruction. In , pages 434–441. IEEE, 1998.
-  J. T. Barron and B. Poole. The fast bilateral solver. In European Conference on Computer Vision, pages 617–632. Springer, 2016.
-  G. D. Boreman. Modulation transfer function in optical and electro-optical systems, volume 21. SPIE press Bellingham, WA, 2001.
-  E. J. Candes, J. K. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8):1207–1223, 2006.
-  P. Cheben, R. Halir, J. H. Schmid, H. A. Atwater, and D. R. Smith. Subwavelength integrated photonics. Nature, 560(7720):565, 2018.
-  Z. Chen, V. Badrinarayanan, G. Drozdov, and A. Rabinovich. Estimating depth from rgb and sparse sensing. arXiv preprint arXiv:1804.02771, 2018.
-  N. Chodosh, C. Wang, and S. Lucey. Deep convolutional compressed sensing for lidar depth completion. arXiv preprint arXiv:1803.08949, 2018.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
-  P. Dollár and C. L. Zitnick. Structured forests for fast edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1841–1848, 2013.
-  G. Drozdov, Y. Shapiro, and G. Gilboa. Robust recovery of heavily degraded depth measurements. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 56–65. IEEE, 2016.
-  A. Eldesokey, M. Felsberg, and F. S. Khan. Propagating confidences through cnns for sparse data regression. arXiv preprint arXiv:1805.11913, 2018.
-  S. Hawe, M. Kleinsteuber, and K. Diepold. Dense disparity maps from sparse disparity measurements. In 13th International Conference on Computer Vision, 2011.
-  Z. Huang, J. Fan, S. Yi, X. Wang, and H. Li. Hms-net: Hierarchical multi-scale sparsity-invariant network for sparse depth completion. arXiv preprint arXiv:1808.08685, 2018.
-  M. Jaritz, R. De Charette, E. Wirbel, X. Perrotton, and F. Nashashibi. Sparse and dense data with cnns: Depth completion and semantic segmentation. In 2018 International Conference on 3D Vision (3DV), pages 52–60. IEEE, 2018.
-  J. Ku, A. Harakeh, and S. L. Waslander. In defense of classical image processing: Fast depth completion on the cpu. arXiv preprint arXiv:1802.00036, 2018.
-  Y. Li, K. Qian, T. Huang, and J. Zhou. Depth estimation from monocular image and coarse depth points based on conditional gan. In MATEC Web of Conferences, volume 175, page 03055. EDP Sciences, 2018.
-  Y. Liao, L. Huang, Y. Wang, S. Kodagoda, Y. Yu, and Y. Liu. Parse geometry from a line: Monocular depth estimation with partial laser observation. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 5059–5066. IEEE, 2017.
-  L.-K. Liu, S. H. Chan, and T. Q. Nguyen. Depth reconstruction from sparse samples: Representation, algorithm, and sampling. IEEE Transactions on Image Processing, 24(6):1983–1996, 2015.
-  C. Loebich, D. Wueller, B. Klingen, and A. Jaeger. Digital camera resolution measurement using sinusoidal siemens stars. In Digital Photography III, volume 6502, page 65020N. International Society for Optics and Photonics, 2007.
-  F. Ma, L. Carlone, U. Ayaz, and S. Karaman. Sparse depth sensing for resource-constrained robots. arXiv preprint arXiv:1703.01398, 2017.
-  F. Ma, G. V. Cavalheiro, and S. Karaman. Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera. arXiv preprint arXiv:1807.00275, 2018.
-  F. Ma and S. Karaman. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018.
-  F. Marvasti. Nonuniform sampling: theory and practice. Springer Science & Business Media, 2012.
-  M. Mishali and Y. C. Eldar. From theory to practice: Sub-nyquist sampling of sparse wideband analog signals. IEEE Journal of Selected Topics in Signal Processing, 4(2):375–391, 2010.
-  C. V. Poulton, A. Yaacobi, D. B. Cole, M. J. Byrd, M. Raval, D. Vermeulen, and M. R. Watts. Coherent solid-state lidar with silicon photonic optical phased arrays. Optics letters, 42(20):4091–4094, 2017.
-  G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3234–3243, 2016.
-  B. Schwarz. Lidar: Mapping the world in 3d. Nature Photonics, 4(7):429, 2010.
-  N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pages 746–760. Springer, 2012.
-  H. Tao, H. S. Sawhney, and R. Kumar. A global matching framework for stereo computation. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 1, pages 532–539. IEEE, 2001.
-  C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Computer Vision, 1998. Sixth International Conference on, pages 839–846. IEEE, 1998.
-  J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger. Sparsity invariant cnns. arXiv preprint arXiv:1708.06500, 2017.
-  J. Yen. On nonuniform sampling of bandwidth-limited signals. IRE Transactions on circuit theory, 3(4):251–257, 1956.