Depth estimation from 2D images is a classical computer vision problem that has been mostly tackled with methods from multiple view geometry. Conventional stereo, structure-from-motion and SLAM approaches are already well-established and integrated to many practical applications. However, they rely on feature detection and matching that can be challenging especially when the scene lacks distinct details, and as a result the 3D reconstruction often becomes sparse and incomplete.
More recently, learning-based approaches have been introduced that enable dense depth estimation by exploiting priors learned from training images. In particular, monocular depth estimation that leverages only a single image together with learned priors has become a popular area of research, where deep neural networks are used to implement models that directly predict depth map for given input image[ramamonjisoa2019sharpnet, chen2019structure, Hu2018RevisitingSI, liu2018planenet, Yin2019enforcing, huynh2020guiding]
. While the basic idea is simple and attractive, the accuracy of the monocular depth estimation methods is limited by the lack of strong geometric constraints such as parallax. Thus, by far more accurate depth maps can be achieved with deep learning based multi-view stereo methods[yao2018mvsnet, yao2019recurrent, Luo-VideoDepth-2020]. However, the accuracy comes at the cost of increased computational complexity as multiple images need to be aggregated by the network to produce a single depth map.
In this paper, we adopt a hybrid approach where we combine geometry-based depth estimation with monocular depth. More specifically, we use a sparse point cloud produced by a conventional pipeline, such as a SLAM system, and feed it as an input to a network together with a single RGB image. We argue that in this way, we can retain low computational cost while achieving state-of-the-art accuracy in dense depth estimation. We also point out that there are already many efficient implementations available for 3D reconstruction such as COLMAP [schoenberger2016mvs, schoenberger2016sfm] that can be plugged in to our framework. Moreover, AR frameworks including ARCore [google_arcore_2019], ARKit [apple_arkit_2015] and AREngine [huawei_arengine_2019] run in real-time in mobile phones, and provide the 3D point cloud basically for free.
Another problem that the learning-based approaches often suffer from is poor generalization to unseen scenes. In our approach, the additional 3D points serve as a skeleton that enforces the network to maintain the overall structure. Therefore, we also argue that this property helps our method to better generalize to different environments.
We make the following contributions: i) we propose a lightweight, yet effective, architecture for estimating the dense depth map from an RGB image and a sparse set of depth measurements; ii) we introduce a 3D point fusion network that extracts and fuses dense 2D representations with geometric 3D features at multiple scales; iii) we demonstrate state-of-the-art results on the NYU-v2 dataset while using only a fraction of the parameters compared to the recent baseline methods. In addition, we show that it is possible to obtain reliable dense depth estimation results already from 32 input depth measurements even on unseen data.
2 Related work
Single image depth estimation (SIDE):
SIDE was first introduced by Saxena et al. [saxena2006learning] and it gained momentum from the work by Eigen et al. [eigen2014depth, eigen2015predicting]. Since then, the number of related studies has grown rapidly [laina2016deeper, fu2018deep, qi2018geonet, ren2019deep, lee2019monocular, jiao2018look, Hu2018RevisitingSI, chen2019structure, facil2019cam, ramamonjisoa2019sharpnet, liu2018planenet, liu2019planercnn, lee2019big, huynh2020guiding]. At first, the proposed SIDE methods improved the accuracy by employing large architectures [laina2016deeper, Hu2018RevisitingSI] and more complex encoding-decoding schemes [chen2019structure]. Then, they started to diverge into using semantic labels [jiao2018look], exploiting the relationship between depth and surface normal [qi2018geonet], reformulating as a classification problem [fu2018deep] or mixing both [ren2019deep]. Other studies suggested to estimate relative depth [lee2019monocular] or to learn calibration patterns to improve the generalization ability. Recent SIDE approaches exploit monocular priors such as occlusion [ramamonjisoa2019sharpnet], and planar structures either explicitly [liu2018planenet, liu2019planercnn, Yin2019enforcing] or implicitly [huynh2020guiding]. Despite these efforts, SIDE still generalizes quite poorly to unseen data. In this work, we leverage SIDE’s ability to produce dense depth estimations and inject it with a small set of depth measurements to boost the accuracy while further shrinking the network size.
Dense depth estimation from sparse depth:
Depth completion is a related problem where the aim is to densify or inpaint an incomplete depth map. Diebel and Thrun [diebel2006application] is one of the first studies to tackle this problem using Markov random fields. Hawe et al. [hawe2011dense] estimate disparity using wavelet analysis. The problem gained popularity as commodity depth sensors and laser scanners (or LiDARs) become more available. Uhrig et al. [uhrig2017sparsity] proposed sparse convolution to train a sparse invariant network. Jaritz et al. [jaritz2018sparse] leveraged semantics to train the network at varying sparsity levels. Ma et al. [mal2018sparse] concatenated the sparse depth map to an RGB image, and used this RGBD volume for training. Xu et al. [xu2019depth] filled in the missing depth values using the depth normal constraint. Imran et al. [imran2019depth] addressed the depth completion problem using depth coefficients as a representation. Qiu et al. [Qiu_2019_CVPR] suggested depth and normal fusion using learned attention maps. Methods based on a spatial propagation network (SPN) iterative optimize the dense depth map either in local [cheng2018depth, cheng2019learning] or non-local [park2020non] affinity. Chen et al. [chen2019learning] suggested fusing features from an image and 3D points to produce the dense depth. However, these depth completion methods usually aim for outdoor environments and street views where the points come from a LiDAR.
The difficulty of the depth completion problem much depends on the density of the 3D points used as an input to the algorithm. For example, LiDARs can produce relatively dense and regularly sampled point clouds without large holes, while passive image-based 3D reconstruction techniques, such as stereo or SLAM, result in substantially sparser set of points where the sampling is highly irregular and depends on the surface details. Thus, we argue that depth completion becomes a much harder problem when using a sparse point cloud from image-based reconstruction rather than from a LiDAR, and consequently, it also requires better regularization for the depth. To this end, we introduce a novel 3D fusion point network that efficiently learns to fuse image and geometric features to boost the performance of a monocular depth estimation network to particularly deal with indoor environments that are often more diverse and challenging than street view scenes. Our work is inspired by [chen2019learning], but instead of sequentially fusing features at the same resolution, we build a deeper model to extract and fuse features at multiple-scales. This is crucial since [chen2019learning] has been developed for depth completion of LiDAR data and as shown in our experiments it fails with a sparse set of points whereas thanks to the multi-scale approach our method can produce decent depth maps at a low resolution even from a few depth measurements.
An overview of our 3D point fusion network is shown in Figure 2. It is a fully convolutional framework that takes an RGB image and sparse 3D points as inputs to estimate a dense depth map. The 3D points serve as constraints to fix the overall geometry of the depth map produced by the network. To deal with the unstructured 3D point cloud, the points are first projected to the image plane and their coordinates are used to create a sparse depth map. Next, the RGB image is stacked with the sparse depth to form an RGBD image. We also apply two convolutional layers to the sparse depth and the RGBD image separately. The two outputs are concatenated to build the low-level input features that are fed to the first fusion-net. The core network consists of five fusion-nets that operate at different feature resolutions. Each fusion-net contains a features fusion encoder (E), a confidence predictor (C), a decoder (D), and a refinement (R) module as illustrates in Figure 3
. We describe these modules in the following subsections and finish this section by giving some details about our loss function.
3.1 Features Fusion Encoder
Convolutional neural networks are good in processing regularly sampled data in a tensor form. Because our input point clouds are sparse and they represent geometric constraints unlike the image data, we cannot just rely on simple concatenation to fuse the information, but we need better representations. Inspired by a recent depth completion method [chen2019learning], we design a feature fusion encoder to extract low-level features from RGB images and 3D points.
Our feature fusion encoder takes a 3D tensor () and a set of sparse points () as inputs. The output is a 3D tensor with a similar shape to the input tensor. Details of the features fusion encoder are shown in a gray box of Figure 3. It consists of two 2D branches, one 3D branch, and one convolutional layer for feature fusion.
The 2D convolutional branches:
The 2D branches are convolved at two different resolutions to learn multi-scale representations from the input 3D tensor. The first 2D branch has one convolutional layer with stride one to extract features at the same size as the input volume. The second 2D branch is a cascade of a stride two convolutional, a stride one convolutional, and an upsampling layer to obtain coarser features of the input tensor. The two outputs are summed to aggregate appearance features at different resolutions.
The 3D point convolution branch: The 3D branch aims to extract structural features from the sparse points. This is difficult for 2D convolutions that operate on local neighbors as 3D points are located on an irregular grid. Therefore, we utilize the feature-kernel alignment convolution (FKAConv) [boulch2020fka]
that operates directly on 3D points to avoid this problem. The key idea of the FKAConv is to learn a linear transformation to align the neighboring points with the grid-like kernel. After that, it performs a weighted sum between this kernel and the features of the 3D points. One can see that 2D convolution is a special case where the learned linear transformation is always an identity matrix.
As shown in Figure 3, our 3D branch consists of two FKAConv layers. We first extract the features of the 3D points from the input tensor using their projected 2D indices on the image plane. This volume has the size of , where is the number of feature channels, and is the number of 3D points. Next, we feed the point features and their 3D coordinates to the FKAConv layers. FKAConv selects a set of k-neighboring points for every input point and learns a transformation matrix to align the 3D points with its kernel. The point features are then convolved with the aligned 3D points to produce a 2D tensor of shape . The output features are projected back to an empty 3D tensor of size using the projected 2D indices. Features of other positions are set to zero.
2D-3D Feature Fusion: Output volumes from the 2D and 3D branches have the same shape as the input tensor (). Therefore, to fuse these features, we sum them together before applying a 2D convolutional layer to output a 3D tensor of the size
. Finally, we add a residual connection to avoid vanishing gradient during training.
3.2 Encoder, Decoder, and Confidence Predictor modules
Encoder and Decoder Module: Designing efficient decoder and refinement modules is essential for depth estimation problem [fang2020towards, wojna2019devil]. One common practice is created large and complex decoders to produce accurate depth maps with sharp edges and fine details. However, we argue that by iteratively fusing relevant depth measurements from the 3D points with appearance features from image pixels, we can significantly reduce the size of our decoder and refinement designs. That is, our decoder and refinement modules have only two convolutional layers for each component. To simplify further, we use the same decoder and refinement designs for all fusion-nets.
As shown in the orange box of Figure 3, the decoder transforms the fused features from the encoder before feeding them to the refinement module (the yellow box in Figure 3). We then initially obtain a depth map and an output volume of the decoder. The estimated confidence map later modifies these two outputs.
|Revisited mono-depth||Hu’19 [Hu2018RevisitingSI]||0||157.0M||0.115||0.530||0.866||0.975||0.993|
|Sparse and Dense||Jaritz’18 [jaritz2018sparse]||200||58.3M||0.050||0.194||0.930||0.960||0.991|
|Depth Coefficients||Imran’19 [imran2019depth]||500||45.7M||0.013||0.118||0.994||0.999||-|
|Consistent depth||Luo’20 [Luo-VideoDepth-2020]||-||178.2M||0.086||0.345||0.916||0.959||0.984|
Confidence Predictor: Although the input 3D sparse points provide useful depth measurements, they can also contain noise. Hence, we proposed a simple yet efficient confidence predictor to attenuate the effect of noise. As illustrated in the cyan box of Figure 3
, the output volumes from the feature fusion encoder are fed to three convolutional layers followed by a sigmoid to output the probability for every pixel. This information is then used to alter the initial depth map and the output features of the decoder. Moreover, we add residual connections at the end of the decoder and refinement blocks to prevent the vanishing gradient problem and regularize the confidence map’s errors. The initial depth map is corrected based on the confidence map, as shown in Figure4.
3.3 Multi-scale Loss function
We calculate our loss at multiple feature resolutions to train our network. The full loss is defined as:
where is the number of resolution scales and is the loss weight at scale , is a variation of the norm that minimizes error on the sparse depth pixels, optimizes the error on edge structures, and penalizes angular error between the ground truth and predicted normal surfaces. These loss terms were introduced by Hu et al. [Hu2018RevisitingSI] and widely adopted by state-of-art monocular depth estimation methods [chen2019structure, huynh2020guiding]. Subsection 4.2 describes in detail how the network is trained using these loss functions.
4 Experimental Evaluation
In this section, we evaluate the performance of the proposed method and compare it with several baselines on the NYU-Depth-v2, real world and KITTI datasets.
4.1 Dataset and Evaluation metrics
The NYU-Depth-v2 dataset contains approximately RGB-D images recorded from 464 indoor scenes. We extract the raw RGB frames from the original videos and reconstruct sparse 3D point clouds using the COLMAP [schoenberger2016sfm, schoenberger2016mvs] structure-from-motion software. COLMAP is also used to extract the camera poses for multi-view stereo methods. The 3D points are back-projected to each input view to obtain a sparse set of depth values. We use 60K videos for training and 654 images from the official test set for evaluating the methods. For outdoor data, we utilize 1000 images of the validation set of the KITTI depth completion benchmark [uhrig2017sparsity] to testing our method.
We report the results in terms of standard metrics, namely, the mean absolute relative error (REL), root mean square error (RMSE), and thresholded accuracy (). The detailed definitions of the measures are provided in the supplementary material.
4.2 Implementation details
The proposed model is trained for 150 epochs on a single TITAN RTX using batch size of 32, the Adam optimizer[kingma2014adam] with , and the loss function presented in Eq. 1. The initial learning rate is , but from epoch 10 the learning is reduced by per epochs. We set the number of scales in Eq. (1) to 5, weight loss coefficients to , and the scale weight losses to and
respectively. For training, we augment the input RGB images using random rotation ([-5.0, +5.0] degrees), horizontal flip, rectangular window droppings, and colorization. We also add random noise to the XYZ-coordinates of the sparse input points.
4.3 Comparison with State-of-the-art
The proposed method is related to multiple partially overlapping problem areas and, therefore, we compare it with several baseline methods in monocular depth estimation [ramamonjisoa2019sharpnet, Hu2018RevisitingSI, chen2019structure, Yin2019enforcing, huynh2020guiding], depth completion [jaritz2018sparse, mal2018sparse, cheng2018depth, cheng2019learning, Qiu_2019_CVPR, xu2019depth, park2020non], deep multi-view stereo [yao2018mvsnet], and deep structure-from-motion [Luo-VideoDepth-2020]. The baseline results are obtained using the pre-trained models [chen2019structure, Hu2018RevisitingSI, ramamonjisoa2019sharpnet, Yin2019enforcing, park2020non, Luo-VideoDepth-2020], re-training using the official NYU-v2 [yao2018mvsnet, mal2018sparse] code, using our own re-implementations [huynh2020guiding, jaritz2018sparse], and from the original papers [cheng2019learning, Qiu_2019_CVPR, xu2019depth, imran2019depth].
The performance metrics, computed between the estimated depth maps and the ground truth, are provided in Table 1. In addition, we report the number of method parameters, and the number of 3D points used in the estimation. Compared to the monocular depth estimation works, the proposed method provides a substantial improvement according to all metrics. For instance, REL, RMSE and thresholded accuracy () are improved by and , respectively, by using only of the model parameters and 32 additional 3D points.
Compared to the depth completion methods, we result in state-of-the-art performance while using clearly less model parameters. Moreover, our model needs less 3D points compared to [cheng2018depth, cheng2019learning, Qiu_2019_CVPR, xu2019depth] and produces comparable results already with 32 input 3D points. The best performing baselines, Park et al. [park2020non], Xu et al. [xu2019depth], use , times more parameters compared to our method, respectively. Instead of using the explicit 3D points, the multi-view stereo [yao2018mvsnet] and structure-from-motion [Luo-VideoDepth-2020] methods utilise multiple RGB images with camera poses. The results in Table 1 indicate that the proposed model outperforms also these methods using only a fraction of the model parameters.
Figure 5 shows qualitative results of the predicted depth maps and reconstructed points cloud for our method and for [park2020non]. The baseline [park2020non] results are obtained using the pre-trained model from the original authors. Although both methods produce high quality depth maps, the proposed model is better in recovering fine details in challenging regions and introduces less distortions on flat surfaces. Additional results are provided in the supplementary material.
Dense depth prediction using COLMAP points.
To assess the generalization properties of the proposed method, we recorded an additional set of videos using Kinect-v2. The test set consists of 597 RGB frames with ground truth depth maps from indoor environments. The frames are pre-processed with COLMAP to obtain the sparse 3D point cloud. The new dataset will be made publicly available upon the publication of the paper. Table 2 contains the performance metrics for our method and for Park et al. [park2020non] using the collected dataset. Compared to the NYU-v2 results, we obtain similar performance, while Park et al. [park2020non] result in clearly lower metrics. These results suggest that our method is able to generalise to environments, unseen at the training time.
Firgure 8 illustrate the qualitative examples from the kinect-v2 test set. The proposed method clearly reserve scene structure and details compare to state-of-the-art depth completion baseline.
Dense depth prediction using ARCore points.
The recent AR frameworks provide 3D points of the environment, which can be utilised for dense depth estimation. To this end, we collected video sequences using an Android phone and used ARCore to produce a sparse 3D point cloud of the scene. The dense depth maps obtained from this data with our method and Park et al. [park2020non] are illustrated in Figure 7. Our method provides a high quality depth map with significantly less distortions compared to [park2020non].
4.4 Analysis of the number and pattern of input 3D points
To assess how the quantity and spatial distribution of the input 3D point affect the results, we performed an experiment with varying 3D point patterns. For this purpose we generate sparse point sets by randomly sampling from the dense ground truth or from COLMAP output.
We expect that by sampling from dense depth map provides better results compared to the COLMAP points. This is because, dense depth map covers also flat textureless surfaces such as walls, floor, and doors. However, such points might not be easy to obtain in practice, whereas COLMAP points represent location which are often reconstructed by SfM or SLAM methods.
Figure 9 presents the RMSE errors for different number of input points for both types. The results confirm the initial assumption that sampling from dense depth map results in better performance. Moreover, we notice that the proposed method obtains higher accuracy compared to Park [park2020non] with all point sets. In fact, we obtain similar performance using COLMAP points as Park [park2020non] using points from the dense depth map.
4.5 From NYU-v2 to KITTI
To further assess the generalization abilities of the proposed method, we directly test our pretrained NYU model on KITTI validation set. As shown in Figure 11, the proposed method can produce plausible result on completely unseen data even when using a very sparse set of points. Figure 10 show the RMSE errors for different number of input points. We notice that the proposed method obtains higher accuracy compare to non-learning based methods [silberman2012indoor, barron2016fast] and Zhang et al. [zhang2018deep] when using 72 points.
We propose a lightweight method that fuses RGB monocular depth estimation with information from a sparse set of 3D points at multiple-scales. Experiments show that the proposed method achieves state-of-the-art results on NYU-v2 while having 2-6 times less parameters than baseline methods. Evaluation on real world images and the KITTI dataset demonstrates good generalization properties of our approach. We believe that it provides a practical solution for obtaining high-quality depth maps for various applications where dense depth is needed.
A Outdoor experiments using KITTI
Although our main focus is on indoor environments, we performed an experiment to analyse how the proposed method generalizes to outdoors. For this purpose, we train our model using 85K images from the KITTI depth completion data [uhrig2017sparsity] and then evaluate using 1000 images from the KITTI validation set. Figure 12 shows the results for our method and for [roy2016monocular, barron2016fast, zhang2018deep, park2020non, Qiu_2019_CVPR] using a varying number of input 3D points. The proposed method obtains the best results in all tested setups. However the difference is largest with a small number of input 3D points. Figure 14 illustrates examples of the obtained depth maps. We note that the proposed method is able to obtain reasonable depth map already with a singe input point as shown in Figure 13. In addition, Figures 15 presents examples with varying number of input points for our method and for [park2020non]. The results suggest that high-quality depth maps can be obtained by using only a few LiDAR points enabling more cost efficient solutions.
B Additional qualitative results
Figure 16 presents additional results for the NYU-Depth-v2 dataset [silberman2012indoor] using a varying number of randomly selected input points. The coarse structures and the overall scene geometry are well-preserved in all tested cases, whereas using 32, 64 or 200 input points also retain the finer details.
Figure 17 presents a challenging sample from real-world data using the ARCore [google_arcore_2019] points. The results with different number of input points confirm that although coarse details are relatively well-preserved for all sparsity cases depth estimates are consistently better when using more points.
C Definitions of the evaluation metrics
For NYU-Depth-v2 [silberman2012indoor] the evaluation results are calculated for pixels with depth values in the range [0.0, 10.0] while for KITTI [uhrig2017sparsity] the valid range is [0.0, 90.0]. We evaluate the performance for our model and for baselines using the following standard metrics:
Mean absolute relative error (REL):
Root mean square error (RMSE):
Thresholded accuracy ():
where is the number of valid pixel, is the predicted depth value at pixel , and is the ground truth depth at pixel . Higher thresholded accuracies , and figures mean better results, while lower REL and RMSE values are better.