Log In Sign Up

Boosting Monocular Depth Estimation with Lightweight 3D Point Fusion

by   Lam Huynh, et al.

In this paper, we address the problem of fusing monocular depth estimation with a conventional multi-view stereo or SLAM to exploit the best of both worlds, that is, the accurate dense depth of the first one and lightweightness of the second one. More specifically, we use a conventional pipeline to produce a sparse 3D point cloud that is fed to a monocular depth estimation network to enhance its performance. In this way, we can achieve accuracy similar to multi-view stereo with a considerably smaller number of weights. We also show that even as few as 32 points is sufficient to outperform the best monocular depth estimation methods, and around 200 points to gain full advantage of the additional information. Moreover, we demonstrate the efficacy of our approach by integrating it with a SLAM system built-in on mobile devices.


page 6

page 7

page 8

page 11

page 12

page 13

page 14

page 15


Depth Estimation by Learning Triangulation and Densification of Sparse Points for Multi-view Stereo

Multi-view stereo (MVS) is the golden mean between the accuracy of activ...

Analysis Computational Complexity Reduction of Monocular and Stereo Depth Estimation Techniques

Accurate depth estimation with lowest compute and energy cost is a cruci...

Revisiting PatchMatch Multi-View Stereo for Urban 3D Reconstruction

In this paper, a complete pipeline for image-based 3D reconstruction of ...

Photometric single-view dense 3D reconstruction in endoscopy

Visual SLAM inside the human body will open the way to computer-assisted...

LiteDepth: Digging into Fast and Accurate Depth Estimation on Mobile Devices

Monocular depth estimation is an essential task in the computer vision c...

Light Robust Monocular Depth Estimation For Outdoor Environment Via Monochrome And Color Camera Fusion

Depth estimation plays a important role in SLAM, odometry, and autonomou...

1 Introduction

Depth estimation from 2D images is a classical computer vision problem that has been mostly tackled with methods from multiple view geometry. Conventional stereo, structure-from-motion and SLAM approaches are already well-established and integrated to many practical applications. However, they rely on feature detection and matching that can be challenging especially when the scene lacks distinct details, and as a result the 3D reconstruction often becomes sparse and incomplete.

More recently, learning-based approaches have been introduced that enable dense depth estimation by exploiting priors learned from training images. In particular, monocular depth estimation that leverages only a single image together with learned priors has become a popular area of research, where deep neural networks are used to implement models that directly predict depth map for given input image 

[ramamonjisoa2019sharpnet, chen2019structure, Hu2018RevisitingSI, liu2018planenet, Yin2019enforcing, huynh2020guiding]

. While the basic idea is simple and attractive, the accuracy of the monocular depth estimation methods is limited by the lack of strong geometric constraints such as parallax. Thus, by far more accurate depth maps can be achieved with deep learning based multi-view stereo methods 

[yao2018mvsnet, yao2019recurrent, Luo-VideoDepth-2020]. However, the accuracy comes at the cost of increased computational complexity as multiple images need to be aggregated by the network to produce a single depth map.

In this paper, we adopt a hybrid approach where we combine geometry-based depth estimation with monocular depth. More specifically, we use a sparse point cloud produced by a conventional pipeline, such as a SLAM system, and feed it as an input to a network together with a single RGB image. We argue that in this way, we can retain low computational cost while achieving state-of-the-art accuracy in dense depth estimation. We also point out that there are already many efficient implementations available for 3D reconstruction such as COLMAP [schoenberger2016mvs, schoenberger2016sfm] that can be plugged in to our framework. Moreover, AR frameworks including ARCore [google_arcore_2019], ARKit [apple_arkit_2015] and AREngine [huawei_arengine_2019] run in real-time in mobile phones, and provide the 3D point cloud basically for free.

Another problem that the learning-based approaches often suffer from is poor generalization to unseen scenes. In our approach, the additional 3D points serve as a skeleton that enforces the network to maintain the overall structure. Therefore, we also argue that this property helps our method to better generalize to different environments.

Figure 1: Dense depth prediction from NYU-v2 [silberman2012indoor] test set. Quality of the output depth is boosted by fusing only 32 sparse points. (Sparse point are enhanced for visualization)

We make the following contributions: i) we propose a lightweight, yet effective, architecture for estimating the dense depth map from an RGB image and a sparse set of depth measurements; ii) we introduce a 3D point fusion network that extracts and fuses dense 2D representations with geometric 3D features at multiple scales; iii) we demonstrate state-of-the-art results on the NYU-v2 dataset while using only a fraction of the parameters compared to the recent baseline methods. In addition, we show that it is possible to obtain reliable dense depth estimation results already from 32 input depth measurements even on unseen data.

2 Related work

Single image depth estimation (SIDE):

SIDE was first introduced by Saxena et al. [saxena2006learning] and it gained momentum from the work by Eigen et al. [eigen2014depth, eigen2015predicting]. Since then, the number of related studies has grown rapidly [laina2016deeper, fu2018deep, qi2018geonet, ren2019deep, lee2019monocular, jiao2018look, Hu2018RevisitingSI, chen2019structure, facil2019cam, ramamonjisoa2019sharpnet, liu2018planenet, liu2019planercnn, lee2019big, huynh2020guiding]. At first, the proposed SIDE methods improved the accuracy by employing large architectures [laina2016deeper, Hu2018RevisitingSI] and more complex encoding-decoding schemes [chen2019structure]. Then, they started to diverge into using semantic labels [jiao2018look], exploiting the relationship between depth and surface normal [qi2018geonet], reformulating as a classification problem [fu2018deep] or mixing both [ren2019deep]. Other studies suggested to estimate relative depth [lee2019monocular] or to learn calibration patterns to improve the generalization ability. Recent SIDE approaches exploit monocular priors such as occlusion [ramamonjisoa2019sharpnet], and planar structures either explicitly [liu2018planenet, liu2019planercnn, Yin2019enforcing] or implicitly [huynh2020guiding]. Despite these efforts, SIDE still generalizes quite poorly to unseen data. In this work, we leverage SIDE’s ability to produce dense depth estimations and inject it with a small set of depth measurements to boost the accuracy while further shrinking the network size.

Dense depth estimation from sparse depth:

Depth completion is a related problem where the aim is to densify or inpaint an incomplete depth map. Diebel and Thrun [diebel2006application] is one of the first studies to tackle this problem using Markov random fields. Hawe et al. [hawe2011dense] estimate disparity using wavelet analysis. The problem gained popularity as commodity depth sensors and laser scanners (or LiDARs) become more available. Uhrig et al. [uhrig2017sparsity] proposed sparse convolution to train a sparse invariant network. Jaritz et al. [jaritz2018sparse] leveraged semantics to train the network at varying sparsity levels. Ma et al. [mal2018sparse] concatenated the sparse depth map to an RGB image, and used this RGBD volume for training. Xu et al. [xu2019depth] filled in the missing depth values using the depth normal constraint. Imran et al. [imran2019depth] addressed the depth completion problem using depth coefficients as a representation. Qiu et al. [Qiu_2019_CVPR] suggested depth and normal fusion using learned attention maps. Methods based on a spatial propagation network (SPN) iterative optimize the dense depth map either in local [cheng2018depth, cheng2019learning] or non-local [park2020non] affinity. Chen et al. [chen2019learning] suggested fusing features from an image and 3D points to produce the dense depth. However, these depth completion methods usually aim for outdoor environments and street views where the points come from a LiDAR.

Figure 2: Overview architecture of the 3D point fusion network.

The difficulty of the depth completion problem much depends on the density of the 3D points used as an input to the algorithm. For example, LiDARs can produce relatively dense and regularly sampled point clouds without large holes, while passive image-based 3D reconstruction techniques, such as stereo or SLAM, result in substantially sparser set of points where the sampling is highly irregular and depends on the surface details. Thus, we argue that depth completion becomes a much harder problem when using a sparse point cloud from image-based reconstruction rather than from a LiDAR, and consequently, it also requires better regularization for the depth. To this end, we introduce a novel 3D fusion point network that efficiently learns to fuse image and geometric features to boost the performance of a monocular depth estimation network to particularly deal with indoor environments that are often more diverse and challenging than street view scenes. Our work is inspired by [chen2019learning], but instead of sequentially fusing features at the same resolution, we build a deeper model to extract and fuse features at multiple-scales. This is crucial since [chen2019learning] has been developed for depth completion of LiDAR data and as shown in our experiments it fails with a sparse set of points whereas thanks to the multi-scale approach our method can produce decent depth maps at a low resolution even from a few depth measurements.

3 Method

An overview of our 3D point fusion network is shown in Figure 2. It is a fully convolutional framework that takes an RGB image and sparse 3D points as inputs to estimate a dense depth map. The 3D points serve as constraints to fix the overall geometry of the depth map produced by the network. To deal with the unstructured 3D point cloud, the points are first projected to the image plane and their coordinates are used to create a sparse depth map. Next, the RGB image is stacked with the sparse depth to form an RGBD image. We also apply two convolutional layers to the sparse depth and the RGBD image separately. The two outputs are concatenated to build the low-level input features that are fed to the first fusion-net. The core network consists of five fusion-nets that operate at different feature resolutions. Each fusion-net contains a features fusion encoder (E), a confidence predictor (C), a decoder (D), and a refinement (R) module as illustrates in Figure 3

. We describe these modules in the following subsections and finish this section by giving some details about our loss function.

Figure 3: Details of a Fusion Net , where is the scale resolution and . Main components include the feature fusion encoder (E), confidence predictor (C), decoder (D) and refinement (R) are color coded as gray, cyan, orange and yellow, respectively.

3.1 Features Fusion Encoder

Convolutional neural networks are good in processing regularly sampled data in a tensor form. Because our input point clouds are sparse and they represent geometric constraints unlike the image data, we cannot just rely on simple concatenation to fuse the information, but we need better representations. Inspired by a recent depth completion method [chen2019learning], we design a feature fusion encoder to extract low-level features from RGB images and 3D points.

Our feature fusion encoder takes a 3D tensor () and a set of sparse points () as inputs. The output is a 3D tensor with a similar shape to the input tensor. Details of the features fusion encoder are shown in a gray box of Figure 3. It consists of two 2D branches, one 3D branch, and one convolutional layer for feature fusion.

The 2D convolutional branches:

The 2D branches are convolved at two different resolutions to learn multi-scale representations from the input 3D tensor. The first 2D branch has one convolutional layer with stride one to extract features at the same size as the input volume. The second 2D branch is a cascade of a stride two convolutional, a stride one convolutional, and an upsampling layer to obtain coarser features of the input tensor. The two outputs are summed to aggregate appearance features at different resolutions.

The 3D point convolution branch: The 3D branch aims to extract structural features from the sparse points. This is difficult for 2D convolutions that operate on local neighbors as 3D points are located on an irregular grid. Therefore, we utilize the feature-kernel alignment convolution (FKAConv) [boulch2020fka]

that operates directly on 3D points to avoid this problem. The key idea of the FKAConv is to learn a linear transformation to align the neighboring points with the grid-like kernel. After that, it performs a weighted sum between this kernel and the features of the 3D points. One can see that 2D convolution is a special case where the learned linear transformation is always an identity matrix.

As shown in Figure 3, our 3D branch consists of two FKAConv layers. We first extract the features of the 3D points from the input tensor using their projected 2D indices on the image plane. This volume has the size of , where is the number of feature channels, and is the number of 3D points. Next, we feed the point features and their 3D coordinates to the FKAConv layers. FKAConv selects a set of k-neighboring points for every input point and learns a transformation matrix to align the 3D points with its kernel. The point features are then convolved with the aligned 3D points to produce a 2D tensor of shape . The output features are projected back to an empty 3D tensor of size using the projected 2D indices. Features of other positions are set to zero.

2D-3D Feature Fusion: Output volumes from the 2D and 3D branches have the same shape as the input tensor (). Therefore, to fuse these features, we sum them together before applying a 2D convolutional layer to output a 3D tensor of the size

. Finally, we add a residual connection to avoid vanishing gradient during training.

3.2 Encoder, Decoder, and Confidence Predictor modules

Encoder and Decoder Module: Designing efficient decoder and refinement modules is essential for depth estimation problem [fang2020towards, wojna2019devil]. One common practice is created large and complex decoders to produce accurate depth maps with sharp edges and fine details. However, we argue that by iteratively fusing relevant depth measurements from the 3D points with appearance features from image pixels, we can significantly reduce the size of our decoder and refinement designs. That is, our decoder and refinement modules have only two convolutional layers for each component. To simplify further, we use the same decoder and refinement designs for all fusion-nets.

As shown in the orange box of Figure 3, the decoder transforms the fused features from the encoder before feeding them to the refinement module (the yellow box in Figure 3). We then initially obtain a depth map and an output volume of the decoder. The estimated confidence map later modifies these two outputs.

Figure 4: Rectification of the depth map using the predicted confidence. Values of the confidence map are at range from 0.0 to 1.0 where 1.0 is the highest confidence.
Architecture #3D pts #params REL RMSE
SharpNet Ramam.’19 [ramamonjisoa2019sharpnet] 0 80.4M 0.139 0.502 0.836 0.966 0.993
Revisited mono-depth Hu’19 [Hu2018RevisitingSI] 0 157.0M 0.115 0.530 0.866 0.975 0.993
SARPN Chen’19 [chen2019structure] 0 210.3M 0.111 0.514 0.878 0.977 0.994
VNL Yin’19 [Yin2019enforcing] 0 114.2M 0.108 0.416 0.875 0.976 0.994
DAV Huynh’20 [huynh2020guiding] 0 25.1M 0.108 0.412 0.882 0.980 0.996
NLSPN Park’20 [park2020non] 32 25.8M 0.114 0.554 0.825 0.947 0.985
Point-Fusion Ours 32 8.7M 0.057 0.319 0.963 0.992 0.998
Sparse and Dense Jaritz’18 [jaritz2018sparse] 200 58.3M 0.050 0.194 0.930 0.960 0.991
S2D Ma’18[mal2018sparse] 200 42.8M 0.044 0.230 0.971 0.994 0.998
NLSPN Park’20 [park2020non] 200 25.8M 0.019 0.136 0.989 0.998 0.999
Point-Fusion Ours 200 8.7M 0.015 0.112 0.995 0.999 1.000
FuseNet Chen’19 [chen2019learning] 500 1.9M 0.318 0.859 0.688 0.789 0.887
CSPN Cheng’18 [cheng2018depth] 500 18.5M 0.016 0.117 0.992 0.999 1.000
DeepLiDAR Qiu’19 [Qiu_2019_CVPR] 500 53.4M 0.022 0.115 0.993 0.999 1.000
Depth Coefficients Imran’19 [imran2019depth] 500 45.7M 0.013 0.118 0.994 0.999 -
DepthNormal Xu’19 [xu2019depth] 500 29.1M 0.018 0.112 0.995 0.999 1.000
CSPN++ Cheng’20 [cheng2019learning] 500 28.8M - 0.116 - - -
NLSPN Park’20 [park2020non] 500 25.8M 0.012 0.092 0.996 0.999 1.000
Point-Fusion Ours 500 8.7M 0.014 0.090 0.996 0.999 1.000
MVSNet Yao’18 [yao2018mvsnet] - 124.5M 0.043 0.162 0.940 0.972 0.996
Consistent depth Luo’20 [Luo-VideoDepth-2020] - 178.2M 0.086 0.345 0.916 0.959 0.984
Table 1: Evaluation on the NYU-Depth-v2 dataset. Metrics with mean lower is better and mean higher is better. Methods with are trained using extra data.

Confidence Predictor: Although the input 3D sparse points provide useful depth measurements, they can also contain noise. Hence, we proposed a simple yet efficient confidence predictor to attenuate the effect of noise. As illustrated in the cyan box of Figure 3

, the output volumes from the feature fusion encoder are fed to three convolutional layers followed by a sigmoid to output the probability for every pixel. This information is then used to alter the initial depth map and the output features of the decoder. Moreover, we add residual connections at the end of the decoder and refinement blocks to prevent the vanishing gradient problem and regularize the confidence map’s errors. The initial depth map is corrected based on the confidence map, as shown in Figure 


3.3 Multi-scale Loss function

We calculate our loss at multiple feature resolutions to train our network. The full loss is defined as:


where is the number of resolution scales and is the loss weight at scale , is a variation of the norm that minimizes error on the sparse depth pixels, optimizes the error on edge structures, and penalizes angular error between the ground truth and predicted normal surfaces. These loss terms were introduced by Hu et al.  [Hu2018RevisitingSI] and widely adopted by state-of-art monocular depth estimation methods [chen2019structure, huynh2020guiding]. Subsection 4.2 describes in detail how the network is trained using these loss functions.

4 Experimental Evaluation

In this section, we evaluate the performance of the proposed method and compare it with several baselines on the NYU-Depth-v2, real world and KITTI datasets.

4.1 Dataset and Evaluation metrics


The NYU-Depth-v2 dataset contains approximately RGB-D images recorded from 464 indoor scenes. We extract the raw RGB frames from the original videos and reconstruct sparse 3D point clouds using the COLMAP [schoenberger2016sfm, schoenberger2016mvs] structure-from-motion software. COLMAP is also used to extract the camera poses for multi-view stereo methods. The 3D points are back-projected to each input view to obtain a sparse set of depth values. We use 60K videos for training and 654 images from the official test set for evaluating the methods. For outdoor data, we utilize 1000 images of the validation set of the KITTI depth completion benchmark [uhrig2017sparsity] to testing our method.

Evaluation metrics.

We report the results in terms of standard metrics, namely, the mean absolute relative error (REL), root mean square error (RMSE), and thresholded accuracy (). The detailed definitions of the measures are provided in the supplementary material.

Figure 5: Qualitative results on NYU-v2 test set. Note that all methods using the 200 random sampling as the input points.

4.2 Implementation details

The proposed model is trained for 150 epochs on a single TITAN RTX using batch size of 32, the Adam optimizer

[kingma2014adam] with , and the loss function presented in Eq. 1. The initial learning rate is , but from epoch 10 the learning is reduced by per epochs. We set the number of scales in Eq. (1) to 5, weight loss coefficients to , and the scale weight losses to and

respectively. For training, we augment the input RGB images using random rotation ([-5.0, +5.0] degrees), horizontal flip, rectangular window droppings, and colorization. We also add random noise to the XYZ-coordinates of the sparse input points.

4.3 Comparison with State-of-the-art

The proposed method is related to multiple partially overlapping problem areas and, therefore, we compare it with several baseline methods in monocular depth estimation [ramamonjisoa2019sharpnet, Hu2018RevisitingSI, chen2019structure, Yin2019enforcing, huynh2020guiding], depth completion [jaritz2018sparse, mal2018sparse, cheng2018depth, cheng2019learning, Qiu_2019_CVPR, xu2019depth, park2020non], deep multi-view stereo [yao2018mvsnet], and deep structure-from-motion [Luo-VideoDepth-2020]. The baseline results are obtained using the pre-trained models [chen2019structure, Hu2018RevisitingSI, ramamonjisoa2019sharpnet, Yin2019enforcing, park2020non, Luo-VideoDepth-2020], re-training using the official NYU-v2 [yao2018mvsnet, mal2018sparse] code, using our own re-implementations [huynh2020guiding, jaritz2018sparse], and from the original papers [cheng2019learning, Qiu_2019_CVPR, xu2019depth, imran2019depth].


The performance metrics, computed between the estimated depth maps and the ground truth, are provided in Table 1. In addition, we report the number of method parameters, and the number of 3D points used in the estimation. Compared to the monocular depth estimation works, the proposed method provides a substantial improvement according to all metrics. For instance, REL, RMSE and thresholded accuracy () are improved by and , respectively, by using only of the model parameters and 32 additional 3D points.

Compared to the depth completion methods, we result in state-of-the-art performance while using clearly less model parameters. Moreover, our model needs less 3D points compared to [cheng2018depth, cheng2019learning, Qiu_2019_CVPR, xu2019depth] and produces comparable results already with 32 input 3D points. The best performing baselines, Park et al. [park2020non], Xu et al. [xu2019depth], use , times more parameters compared to our method, respectively. Instead of using the explicit 3D points, the multi-view stereo [yao2018mvsnet] and structure-from-motion [Luo-VideoDepth-2020] methods utilise multiple RGB images with camera poses. The results in Table 1 indicate that the proposed model outperforms also these methods using only a fraction of the model parameters.

Method #3D pts #params REL RMSE
Park’20 [park2020non] 32 25.8M 0.340 0.915 0.635
Park’20 [park2020non] 128 25.8M 0.232 0.534 0.811
Ours 32 8.7M 0.096 0.313 0.907
Ours 128 8.7M 0.059 0.271 0.988
Table 2: Evaluation results on real world data. Metrics with mean lower is better and mean higher is better.

Figure 5 shows qualitative results of the predicted depth maps and reconstructed points cloud for our method and for [park2020non]. The baseline [park2020non] results are obtained using the pre-trained model from the original authors. Although both methods produce high quality depth maps, the proposed model is better in recovering fine details in challenging regions and introduces less distortions on flat surfaces. Additional results are provided in the supplementary material.

Figure 6: Qualitative comparison the pattern (left) and quantity (right) of input points. The random sampling points are randomly extracted from the dense ground truth depth map. For COLMAP points, we extract the image frames from raw NYU data and run COLMAP to extract the points. The pattern and number of points are kept similar to all cases. In the left comparison, the number of points in use is 64. As shown in the first row, the random input point is easier compare with COLMAP points because they can cover flat surfaces like walls, floors or doors. In the right comparison, example results shows the predicted depth maps using random sampling set of 32, 128 and 200 points respectively. Nonetheless, our method perform consistently better than Park’20 [park2020non] in either cases.

Dense depth prediction using COLMAP points.

To assess the generalization properties of the proposed method, we recorded an additional set of videos using Kinect-v2. The test set consists of 597 RGB frames with ground truth depth maps from indoor environments. The frames are pre-processed with COLMAP to obtain the sparse 3D point cloud. The new dataset will be made publicly available upon the publication of the paper. Table 2 contains the performance metrics for our method and for Park et al. [park2020non] using the collected dataset. Compared to the NYU-v2 results, we obtain similar performance, while Park et al. [park2020non] result in clearly lower metrics. These results suggest that our method is able to generalise to environments, unseen at the training time.

Firgure 8 illustrate the qualitative examples from the kinect-v2 test set. The proposed method clearly reserve scene structure and details compare to state-of-the-art depth completion baseline.

Figure 7: Comparison with Park [park2020non] on real world data. The 3D points are captured using the ARCore framework [google_arcore_2019]. Experiments use only a set of 32 points as the input. The proposed approach produces more consistent results than state-of-the-art depth completion method.

Dense depth prediction using ARCore points.

The recent AR frameworks provide 3D points of the environment, which can be utilised for dense depth estimation. To this end, we collected video sequences using an Android phone and used ARCore to produce a sparse 3D point cloud of the scene. The dense depth maps obtained from this data with our method and Park et al. [park2020non] are illustrated in Figure 7. Our method provides a high quality depth map with significantly less distortions compared to [park2020non].

Figure 8: RGB frames are captured using Kinect-v2. Inputs 3D points are generated using COLMAP. Results of dense depth maps prediction from RGB and 32 COLMAP points.

4.4 Analysis of the number and pattern of input 3D points

To assess how the quantity and spatial distribution of the input 3D point affect the results, we performed an experiment with varying 3D point patterns. For this purpose we generate sparse point sets by randomly sampling from the dense ground truth or from COLMAP output.

We expect that by sampling from dense depth map provides better results compared to the COLMAP points. This is because, dense depth map covers also flat textureless surfaces such as walls, floor, and doors. However, such points might not be easy to obtain in practice, whereas COLMAP points represent location which are often reconstructed by SfM or SLAM methods.

Figure 9: Performance with different sparsity and pattern. We test our method on NYU-v2 test set and compare the results with Park et al. [park2020non]. Our method works well on either type of pattern.

Figure 9 presents the RMSE errors for different number of input points for both types. The results confirm the initial assumption that sampling from dense depth map results in better performance. Moreover, we notice that the proposed method obtains higher accuracy compared to Park [park2020non] with all point sets. In fact, we obtain similar performance using COLMAP points as Park [park2020non] using points from the dense depth map.

Figure 10: Performance of direct testing our model on KITTI validation set of 1000 images. Points are sampling at , , , and , correspond to 72, 285, 1150, 4600, and full points.

4.5 From NYU-v2 to KITTI

To further assess the generalization abilities of the proposed method, we directly test our pretrained NYU model on KITTI validation set. As shown in Figure 11, the proposed method can produce plausible result on completely unseen data even when using a very sparse set of points. Figure 10 show the RMSE errors for different number of input points. We notice that the proposed method obtains higher accuracy compare to non-learning based methods [silberman2012indoor, barron2016fast] and Zhang et al. [zhang2018deep] when using 72 points.

Figure 11: Direct testing of model train on NYU to KITTI validation set.

5 Conclusion

We propose a lightweight method that fuses RGB monocular depth estimation with information from a sparse set of 3D points at multiple-scales. Experiments show that the proposed method achieves state-of-the-art results on NYU-v2 while having 2-6 times less parameters than baseline methods. Evaluation on real world images and the KITTI dataset demonstrates good generalization properties of our approach. We believe that it provides a practical solution for obtaining high-quality depth maps for various applications where dense depth is needed.


A Outdoor experiments using KITTI

Figure 12: Evaluation of baseline methods [roy2016monocular, barron2016fast, zhang2018deep, park2020non, Qiu_2019_CVPR] and our model that train on KITTI at different sparsity. Performance figures of [park2020non] are calculated using the pre-trained model while [roy2016monocular, barron2016fast, zhang2018deep, Qiu_2019_CVPR] results are obtained from their papers. Points are randomly sampling at , , , , , , , , and , correspond to 1, 2, 16, 32, 72, 285, 1150, 4600, and full points.

Although our main focus is on indoor environments, we performed an experiment to analyse how the proposed method generalizes to outdoors. For this purpose, we train our model using 85K images from the KITTI depth completion data [uhrig2017sparsity] and then evaluate using 1000 images from the KITTI validation set. Figure 12 shows the results for our method and for [roy2016monocular, barron2016fast, zhang2018deep, park2020non, Qiu_2019_CVPR] using a varying number of input 3D points. The proposed method obtains the best results in all tested setups. However the difference is largest with a small number of input 3D points. Figure 14 illustrates examples of the obtained depth maps. We note that the proposed method is able to obtain reasonable depth map already with a singe input point as shown in Figure 13. In addition, Figures 15 presents examples with varying number of input points for our method and for [park2020non]. The results suggest that high-quality depth maps can be obtained by using only a few LiDAR points enabling more cost efficient solutions.

Figure 13: Predicted depth maps when using 1 points and full LiDAR points. The red pattern in the RGB image presents the 1 input point.
Figure 14: Examples from KITTI validation set. The proposed method produces finer depth details as emphasizing in the highlight areas.
Figure 15: Examples from the KITTI validation set using 1, 2, 16, 32, 72, 285, 1150, 4600 randomly sampled points. The proposed method clearly produces better depth map compare to state-of-the-art [park2020non] especially with a small number of input 3D points.

B Additional qualitative results


Figure 16 presents additional results for the NYU-Depth-v2 dataset [silberman2012indoor] using a varying number of randomly selected input points. The coarse structures and the overall scene geometry are well-preserved in all tested cases, whereas using 32, 64 or 200 input points also retain the finer details.

Figure 16: Examples from NYU-v2 with a different number of points.

ARCore data

Figure 17 presents a challenging sample from real-world data using the ARCore [google_arcore_2019] points. The results with different number of input points confirm that although coarse details are relatively well-preserved for all sparsity cases depth estimates are consistently better when using more points.

Figure 17: Examples from real-world data with different sparsity. Input 3D points are overlaid for visualization.

C Definitions of the evaluation metrics

For NYU-Depth-v2 [silberman2012indoor] the evaluation results are calculated for pixels with depth values in the range [0.0, 10.0] while for KITTI [uhrig2017sparsity] the valid range is [0.0, 90.0]. We evaluate the performance for our model and for baselines using the following standard metrics:

  • Mean absolute relative error (REL):

  • Root mean square error (RMSE):

  • Thresholded accuracy ():


where is the number of valid pixel, is the predicted depth value at pixel , and is the ground truth depth at pixel . Higher thresholded accuracies , and figures mean better results, while lower REL and RMSE values are better.