DFuseNet: Deep Fusion of RGB and Sparse Depth Information for Image Guided Dense Depth Completion

02/02/2019 ∙ by Shreyas S. Shivakumar, et al. ∙ University of Pennsylvania 0

In this paper we propose a convolutional neural network that is designed to upsample a series of sparse range measurements based on the contextual cues gleaned from a high resolution intensity image. Our approach draws inspiration from related work on super-resolution and in-painting. We propose a novel architecture that seeks to pull contextual cues separately from the intensity image and the depth features and then fuse them later in the network. We argue that this approach effectively exploits the relationship between the two modalities and produces accurate results while respecting salient image structures. We present experimental results to demonstrate that our approach is comparable with state of the art methods and generalizes well across multiple datasets.



There are no comments yet.


page 4

page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dense depth estimation is a critical component in autonomous driving, robot navigation and augmented reality. Popular sensing schemes in these domains involve a high resolution camera and a low resolution depth sensor such as a LiDAR or Time-of-Flight sensor. The density of points returned from commonly available depth sensors is typically an order of magnitude lower than the resolution of the camera image. Additionally, higher resolution variants of these sensors are expensive, making them impractical for most applications. However, a number of applications such as planning, obstacle avoidance and fine-grained layout estimation can benefit from higher resolution range data which motivates us to consider approaches that can up-sample the sparse available depth measurements to the resolution of the available imagery.

Traditionally, interpolation and diffusion based schemes have been used to up-sample these sparse points into a smooth dense depth image, often using the corresponding color image as a guide 

[10]. Convolutional neural networks (CNN) have had tremendous success in depth estimation tasks using monocular image data [30, 3, 13, 51, 8, 26, 27, 23, 28, 11, 42, 52, 47], stereo image data [39, 40, 2, 21, 32, 50, 46, 20] and sparse depth data on it’s own [41, 34, 25, 9, 6].

Monocular depth estimation networks have been proposed that learn to extrapolate depth information from image data alone, and recent methods show that such strategies can even be carried out in an unsupervised manner  [13, 52, 47]. However, the depth estimates generated from such networks are less accurate and often relative, making them impractical for navigation and planning. While the work from this field is relevant to us, we focus on the depth completion task, where sparse depth data is available and can be used with high resolution image data to produce accurate dense depth images.

One way to view both the monocular depth prediction problem and the depth completion problem is in terms of a posterior distribution

which represents the probability of a given depth image,

, given an input intensity image, . In both cases the approaches implicitly assume that the resulting posterior distribution is highly concentrated along a low dimensional manifold which makes it possible to infer the complete depth map from relatively few depth samples

We wish to design a CNN architecture that can learn sufficient global and contextual information from the color images and use this information along with sparse depth input to accurately predict depth estimates for the entire image, while enforcing edge preservation and smoothness constraints. Once designed, such a network could be used to upsample information from a variety of depth sensors including LIDAR systems, stereo algorithms or structure from motion algorithms. To summarize, we propose the following contributions:

  1. A CNN architecture that uses a dual branch architecture, spatial pyramid pooling layers and a sequence of multi-scale deconvolutions to effectively exploit contextual cues from the input color image and the available depth measurements.

  2. A training regime that makes of use different sources of information, such as stereo imagery, to learn how to extrapolate depth effectively in regions where no depth measurements are available.

  3. An evaluation of our methods on the KITTI Depth Completion Benchmark111http://www.cvlibs.net/datasets/kitti/eval_depth.php which shows that our strategy is among the top performing algorithms in the benchmark. We also evaluate our algorithm on two other datasets, virtual KITTI and NYUDepth and show that our method is able to generalize well across the three different datasets.

Figure 1: Our network architecture uses two input branches for RGB depth input respectively. We use Spatial Pyramid Pooling (SPP) blocks in the encoder and use a hierarchical representation of decoder features to predict dense depth images.

2 Related Work

Depth Estimation: Monocular depth estimation is an active research field where CNN based methods are currently the state of the art. Different methods have been proposed that use supervised  [8, 30, 3, 23, 26, 51], unsupervised  [13] and self-supervised  [28] depth estimation strategies. At the time of this writing, the best performing monocular depth estimation algorithm is from Fu et al., achieving an inverse RMSE score of 12.98 on the KITTI depth prediction dataset [11]. The authors propose an ordinal regression based method of predicting depth values, as they state that modelling depth estimation as a regression problem results in slow convergence and unsatisfactory local solutions. Li et al. also discretize the depth prediction problem by formulating it as a classification problem  [27].

CNNs have been successfully used in dense stereo depth estimation tasks. Zbontar et al. proposed a siamese network architecture to learn a similarity measure between two input patches. This similarity measure is then used as a matching cost input for a traditional stereo pipeline  [50]. Recently, many end-to-end methods have been proposed that are able to generate accurate disparity images while preserving edges  [20, 39, 46, 21], of which the work of Chang et al. is most similar to the network we propose, where the authors propose an end-to-end approach using spatial pyramid pooling to better learn global image dependent features  [2].

Incomplete Input Data: Learning dense representations from sparse input is similar to the domain of super resolution and in-painting. Super resolution assumes that the input is a uniformly sub-sampled representation of the desired high resolution output, and the learning problem can be posed as an edge preserving interpolation strategy. A comprehensive review of these methods is presented by Yang et al[45]. We note that multi-scale architectures with multiple skip connections have been successfully used for image and depth upsampling tasks [44, 18]. Content-aware completion is motivated by a similar problem of learning complete representations from incomplete input data. Image in-painting requires semantically aware completion of missing input regions. Generative networks have been used successfully for context aware image completion tasks  [48, 49] but are outside the scope of this paper. However, Liu et al. propose a method relevant to our problem, where partial convolutions are used to effectively complete large irregular missing regions in the input image  [31, 22].

Depth Completion: A particular sub-problem of depth estimation with incomplete input data is depth completion. Following the release of the KITTI depth completion benchmark, novel approaches to solve the problem have been proposed. Uhrig et al., the authors of the benchmark, propose a sparsity invariant CNN architecture  [41], using partial normalized convolutions on the input sparse depth image. They propose multiple architectures, to accommodate RGB information and sparse depth input only. Huang et al. propose HMSNet, which uses masked operations on the partial convolutions such as partial summation, up-sampling and concatenation [17].

Schneider et al. and Jaritz et al. propose the use of semantic information to help improve the depth completion problem [37, 19]. Jartitz et al. noted the saturation of convolution masks when using partial convolution based architectures. Depending on the input density, the masks often become completely dense after three to four layers. They choose to not use sparse convolutions and report no loss of accuracy.

Ku et al. propose a non-learning based approach to this problem to highlight the effectiveness of well crafted classical methods, using only commonly available morphological operations to produce dense depth information  [25]

. Their proposed method currently out-performs multiple deep learning based methods on the KITTI depth completion benchmark. Dimitrievski

et al. propose a CNN architecture which uses the work of Ku et al. as a pre-processing step on the sparse depth input [7]. We followed a similar strategy and chose to fill in our sparse input depth image instead of using sparse convolutions. Their network is designed to use traditional morphological operators as well as subsequently learned morphological filters using a U-Net style architecture [36]. They are able to achieve better quantitative results but their model fails to preserve semantic and depth discontinuities as it relies heavily on the filled depth image for their final output. Eldesokey et al. propose a method that also uses normalized masked convolutions, but generates confidence values for each predicted depth by using a continuous confidence mask instead of a binary mask  [9]. Cheng et al. propose a depth propagation network to explicitly learn an affinity function and apply it to the depth completion problem [5].

Wang et al. propose a multi-scale feature fusion method for depth completion [43] using sparse LIDAR data. Ma et al. propose two methods, a supervised method for depth completion using a ResNet based architecture [34] and a self-supervised method which is currently the top performing depth completion algorithm on the KITTI depth completion benchmark  [33]. Their proposed self-supervised method uses the sparse LiDAR input along with pose estimates to add additional training information based on depth and photometric losses.

3 Approach

3.1 Design Overview

We propose the following CNN architecture (Fig 1

) which has been structured to learn local to global context information from both the color image and the sparse depth data and to fuse them together to produce accurate and consistent dense depth maps. We propose a dual branch encoder design in a similar fashion to previous image comparison networks  

[50]. Given the differences in input modality provided to the two branches, we choose to not use Siamese networks with coupled weights [1], and use independent branches instead with different design decisions made for each branch. In our encoder, we use spatial pyramid pooling (SPP) blocks to learn a coarse-to-fine representation of features. Spatial pyramid pooling blocks have been effective in learning local to global context information and have been successfully used in depth perception tasks [2]. We concatenate features learned from individual branches and propagate these features through our de-convolution layers. The final layer performs a convolution operation on features combined from different de-convolution layers, up-sampled to the final output resolution, to utilize information from different scales and context to generate the final depth image.

3.2 Feature Extraction

Figure 2: Illustration of the Spatial Pyramid Pooling blocks used in our encoder architecture. We use the same pooling window sizes as Chang et al[2]

. The pooling windows are 64, 32, 16 and 8 respectively. For the depth image we use max pooling to preserve sparse information and for the RGB branch we use average pooling.

Our color and depth branches begin with an initial depth filling step, similar to the approach of Ku et al[25]

. We use a simple sequence of morphological operations and Gaussian blurring operations to fill the holes in the sparse depth image with depth values from nearby valid points such that no holes remain. This is then passed to the feature extraction branch. The filled depth image is then normalized by the maximum depth value in the dataset, resulting in depth values between 0 and 1. For the depth image, we choose to use larger kernel sizes and fewer convolution operations, resulting in fewer layers. For the color image, we use smaller kernel sizes and make use of four residual blocks 

[14], in addition to two initial convolution layers. The output of these initial feature extraction layers is then passed to spatial pyramid pooling (SPP) blocks. We use the same SPP block structure as proposed by Chang et al[2], but use max pooling for our depth branch and average pooling for our color branch. An illustration of our SPP block is shown in Fig 2. Our pooling windows are consistent between the two branches and are 64, 32, 16 and 8 for each scale respectively. The output of this layer is an up-sampled stack of feature layers carrying information from different scales.

Figure 3: Dual-branch architecture (L-R,T-B): Input color image, filled input depth, output when input to RGB branch is set to zeros, output when input to the depth branch is set to zeros and predicted depth image when RGB and depth images are provided. This illustration informs us that both branches contribute significantly to the final prediction and that the filled depth is not being naively propagated through the network without any learning.

3.3 Combining Modalities

The features from the previous extraction modules are then concatenated into one volume. The first layer is an intermediate output of the residual blocks from the color branch, which we hypothesize can carry over high level features learned from the color image. The subsequent layers are color and depth features extracted from the SPP blocks of the two branches. We believe that these layers can help learn a joint feature representation between the two input modalities in the following layers. We perform three sequential convolution operations on this volume, reducing the number of channels and increasing the spatial resolution by twice the size of the volume. By forcing a reduction in channels we attempt to force the network to learn a lower dimensional representation of the joint feature space, combining important information from both depth and color branches.

3.4 Depth Prediction

The following layers perform a sequence of convolutions with batch normalization, and incremental de-convolutions to restore the original image resolution. The final step involves concatenating different layer outputs from the de-convolution pipeline, up-sampling by interpolation to achieve the original input resolution and then performing a final convolution on the multi-scale stack to produce a single channel output. This output is then passed to a sigmoid activation function and re-scaled to the original range of depth values. Odena

et al. advise caution in the use of transposed convolutions for spatial upsampling [35], hence we limit the use of transposed convolutions and our final output is a result of a 1x1 convolution on a feature volume, which mainly consists of interpolated low resolution features, and hence minimizes the checkerboard effect in the final depth image.

3.5 Training

Our training signal is a weighted average of multiple loss terms, some calculated over the entire image resolution and some calculated only at points where accurate ground truth depth exists. The weights , and are chosen based on a confidence associated with each signal and are varied at different points in time in the training regime.


3.5.1 Primary Loss

We experimented with both L1 and L2 norms as primary loss functions

. For this term, we calculate the loss only at pixels where ground truth depth exists and average over the total number of ground truth points. For better RMSE values on evaluation benchmarks we found L2 to be the better choice as a primary loss term.

Figure 4: Learning to extrapolate better using available information: By adding a stereo depth based loss term, we are able to make better extrapolations in regions where no ground truth or LiDAR exists. (T-B) Input image, predicted depth without stereo term and prediction with stereo term.

3.5.2 Optional Stereo Supervision

Since Uhrig et al. provide a large dataset with data from multiple cameras, we propose a means of making better use of this data during training. The KITTI depth completion dataset provides roughly 42k stereo image pairs, and we use these images to provide depth information at points where ground truth LiDAR data is missing. We propose an auxiliary loss term that uses the stereo input image pair to generate a dense depth estimate that can guide the learning process in regions where no ground truth LiDAR measurements exist. We compute this loss term in a self-supervised manner since stereo intrinsics and extrinsics are known. We use Semi Global Matching to generate this dense depth estimate [16], since this algorithm can be run in real-time on a GPU acceleration [15]. This loss term is an L2 norm of the difference between the predicted depth and the stereo estimated depth. This term can be computed at almost every pixel in the input image. Some pixels lack depth estimates since we use left-right consistency checks to discard noisy and partially occluded depth estimates.

3.5.3 Smoothness

We add a smoothness loss term , which is an L1 norm on the second order derivative of the predicted dense image, similar to the strategy used in unsupervised monocular depth estimation and structure from motion networks  [13, 52, 42].

Figure 5: KITTI Depth Completion Results: Top-Bottom: Input color image, corresponding LiDAR scan mask (inverted for visualization), Sparse-to-Dense [33], MorphNet [7], SparseConv [41] and Ours (DFuseNet)

4 Experiments

Implementation Details:

All our networks were implemented in PyTorch


and we train them from scratch, not using pre-trained weights for any layers. Our models are trained using the ADAM optimizer, and we typically use batch sizes of 20-25 for our experiments and train for roughly 40 epochs for all experiments. We use an initial learning rate of

, and drop our learning rate by a factor of 10% after every 5 epochs. We use a weight decay term of . The weight terms from Eq 1 are: is usually set to 1, is 0.01 and is 0.001.

4.1 KITTI Depth Completion

S2D (gd) 2.80 1.21 814.73 249.95
NConv-CNN (gd) 2.60 1.03 829.98 233.26
MSFF-Net 2.63 1.07 836.69 241.54
HMS-Net_v2 2.73 1.13 841.78 253.47
CSPN 2.93 1.15 1019.64 279.46
MsCNN 3.62 1.36 1034.39 301.15
Morph-Net 3.84 1.57 1045.45 310.49
DFN (ours) 3.62 1.79 1206.66 429.93
NN+CNN2 12.80 1.43 1208.87 317.76
IP-Basic 3.78 1.29 1288.46 302.60
S2D(w/o gt) 4.07 1.57 1299.85 350.32
ADNN 59.39 3.19 1325.37 439.48
NN+CNN 3.25 1.29 1419.75 416.14
SparseConvs 4.94 1.78 1601.33 481.27
NadarayaW 6.34 1.84 1852.60 416.77
Table 1: KITTI Depth Completion benchmark: Root mean square error (RMSE) and Mean absolute error (MAE) are in millimeters, while inverse RMSE and inverse MAE are in (1/kilometer).

The ground truth depth provided in the KITTI Depth Completion dataset is created by merging 11 LiDAR scans from frames before and after a given frame using pose estimates provided in the dataset [41]

. These projected 3D points are refined using stereo depth estimation algorithms to discard outliers. During evaluation, the final scores are based only on these refined ground truth LiDAR points.

While this corpus provides a large amount of training data the available range measurements are typically clustered towards the bottom of the available imagery and are often missing at critical contextual regions such as object boundaries. A consequence is that models trained on this data often produce blurry edges since the available measurements and evaluation tools do not contraindicate such solutions.

Additionally the data set does not provide information in distant regions like the sky and many previous approaches involve cropping out regions where no LiDAR data is available. In contrast we seek to preserve as much contextual information as possible and make depth predictions across as much of the image as possible using all available data.

4.1.1 Quantitative Comparison

The performance of our approach is shown in Table 1 which shows that our method is competitive with the current state of the art. Our method achieves a mean RMSE score of and the current state of the art is  [33]. We note that this method makes use of information from multiple consecutive frames during the training process while we do not. We are also out performed by MSFF-Net [43], HMS-Net [17], CSPN [5] and MorphNet [7] but we believe that our model is able to better incorporate RGB image information to generate edge preserving and semantically smooth depth images at the cost of a small loss in metric accuracy. We highlight this in Fig 5, where it is clear that our method is able to use contextual information to preserve semantic boundaries as well as or better than methods that outperform us on the benchmark.

4.1.2 Learning to extrapolate with limited ground truth data

We validate the effectiveness of our stereo based loss term by comparing our model with and without this term. Quantitatively the improvements are minimal, i.e the model trains faster and results in slightly improved accuracy, qualitatively we noticed that our network can now extrapolate depth values at regions where no input LiDAR scan or ground truth exist. This is specially useful in datasets such as KITTI where the ground truth information is semi-dense with significant regions of the image missing ground truth LiDAR points. In figure 4 we show a qualitative comparison of our network demonstrating its ability to extrapolate beyond the range of the LIDAR scans.

4.2 Virtual KITTI

We evaluate our network on the Virtual KITTI dataset [12]. This dataset contains roughly 21k image and depth frames generated in virtual worlds with simulated lighting and weather conditions, in a driving dataset similar to KITTI . The maximum depth range for this dataset is (sky), but for simplicity and similarity to our previous dataset, we set our perception limit to and train our model accordingly. We use 60% of this data as our training set and evaluate our model on the remaining images. To generate sparse depth input, we randomly sample 10% of the ground truth depth data uniformly. We apply the same input filling step as in the previous dataset, using the same parameters and morphological window sizes. We then pass this filled depth image along with the RGB image to our network and evaluate our accuracy in the - range.

Figure 6: Virtual KITTI Results (L-R,T-B): Color image, input sparse depth (10% randomly selected ground truth points) image with filled depth after pre-processing, prediction and ground truth

While the virtual KITTI dataset is not an accurate representation of real life data, we show that our method is able to learn to accurately generate depth dense images, while preserving edges and contextual information. Figure 6 shows the our results on this dataset. We achieve an RMSE of and MAE of on our validation set.

4.3 NYU Depth V2

In our evaluation on the NYUDepthV2 dataset [38], we use the 1449 densely labelled pairs of aligned RGB and depth images, and split our dataset into 70% training and 30% validation. All our errors are reported on the 30% validation set and we compare our errors against the errors reported by other authors in their respective papers [5, 34, 29]. We use the full resolution 640480 images as our input and use the same method of subsampling as above to generate sparse input depth measurements from the ground truth. We use this dataset to verify that our model is able learn in different environments using different sources of input data, since here a Kinect RGBD sensor is used to collect data in various common environments such as offices and homes.

Table 2 shows the performance of our model at multiple levels of sparsity compared to the work of Ma et al. and Liao et al. at 200 samples [34, 29]. Our approach performs comparably to the the approach of Ma et al. better than that of Liao et al. Sample depth prediction results are shown in Figure 8. We use the same morphological window size and operations as in the Virtual KITTI and KITTI datasets and our method is able to generate accurate results even with noisy input filling. Again the filling process helps us by removing all zeros in the depth image and providing a reasonable initialization but the final depth prediction is based on the combined features from the RGB and depth branches of the network. It must be noted here that the results reported here were computed using a different randomly chosen set of samples and a direct comparison would be unfair.

Method - number of depth samples RMSE REL
DFuseNet (ours) - 200 0.2966 0.0609 0.9588 0.9927 0.9982
DFuseNet - 500 0.2195 0.0441 98.04 99.70 99.93
DFuseNet - 1k 0.1759 0.0371 98.78 99.82 99.96
Cheng et.al (rgbd) [5] - 500 0.117 0.016 99.2 99.9 100.0
Ma et al. (rgbd) [34] - 200 0.230 0.044 97.1 99.4 99.8
Liao et al. (rgbd) [29] - 225 0.442 0.104 87.8 96.4 98.9
Table 2: NYUDepthV2 Results: Comparisons are made to the errors reported by respective authors. Note: the authors use different training and validation sample sets, and errors here were not computed on the same data.

4.4 Number of depth samples

For this experiment, we use the NYUDepthV2 dataset as we are provided with dense ground truth information resulting in more consistent accuracy results. We train a different model for every sample size, limiting the training time to a fixed number of epochs each. We initialize our model with weights learned from our KITTI Depth Completion dataset to reduce our training time. We evaluate RMSE values on our validation set and a plot of this can be seen in Figure 7. As previously observed by Ma et al. in their network, the performance gained by adding more sparse input samples tends to saturate. We notice a saturation at around 5000 depth samples, roughly 1.7% of the image resolution. Qualitatively we can see in Fig 8 that even with an extremely sparse input sample set, the RGB branch of our network is able to guide the depth prediction using mostly image based contextual cues.

Figure 7: Number of input depth samples vs RMSE on NYUDepthV2 as a percentage of total image resolution. Note the use of log scale in the X-axis.

5 Discussion and Conclusion

Figure 8: NYUDepthV2 Results with 1000 input depth samples (first row) and 500 input samples (second row). We use the same depth filling parameters as in our previous datasets; L-R: Input image, filled input depth, prediction and ground truth.

In this section we discuss our observations and the motivation behind our design decisions in the context of datasets such as KITTI and NYU Depth.

5.1 Architecture

Jaritz et al. talk briefly about the benefits of a late fusion architecture over an early fusion one [19]. We agree with their statement and reaffirm the belief that given the different representations of RGB and depth modalities, the correct way to jointly combine this information is by learning to first transform it into a common feature space. While previous work has proposed single path architectures, where RGB and the sparse depth channels are concatenated into a single 4D input and passed to a network [34], we propose the use of a number of individual and independent convolution and pyramid pooling operators on the individual modalities in a dual branch manner. We experimented with implementations where both modalities were fused prior to the SPP blocks and noticed a drop in performance, hinting that the additional independent information learned was useful to the final fusion and prediction. Figure 3 shows the information gained from having two branches in our network.

In terms of input sparsity, we experimented with replacing all our convolutions in the depth branch with sparse convolutions [41] but noticed a significant drop in performance. Huang et al. propose the use of additional sparse operations such as sparsity invariant upsampling, addition and concatenation in addition to convolution and were able to achieve much better results [17]. However, we are more inclined to believe that desirable performance can be achieved with the use of regular convolutions and operations for multi-modal input with simple pre-processing hole filling operations such as morphological filters, fill maps and nearest neighbor interpolation [7, 25, 4]. This is simple and effective in providing the network with a good initialization.

We did notice however that with hole filling pre-processing steps, care must be taken in the use of residual connections from the depth channels to the penultimate layers. We found that using a residual connection from the second and third layers of our depth channel to the penultimate layer of our deconvolution layers led to similar accuracy as IPBasic 


but the network failed to learn to use information from the rgb branch. Perhaps such a network must be more carefully trained with carefully selected hyperparameters. However, in a single channel or early fusion network, adding residual connections from the depth input to the final layers has been shown to be highly effective 

[33, 5].

5.2 Generalization

We also test the model we trained for the KITTI Depth Completion benchmark on Virtual KITTI and NYUDepthV2. Figure 9 shows a few examples of predictions made using our KITTI Depth Completion model on Virtual Kitti and NYUDepthV2. Qualitatively, we noticed that the network is able to use sufficient RGB cues to generate semantically valid depth predictions but quantitatively we noticed errors in predicted depth values. This is due to the difference in depth scale across the three datasets (KITTI maximum depth is , virtual kitti is and NYUDepth is roughly ) and is quickly corrected after minimal training using the KITTI model as an initialization. We believe that the separate RGB image branch and the density of features learned independently in this branch to be a large contributor to the generalization of this network. Our model trained on KITTI is able to achieve an RMSE value of and MAE of on our NYUDepthV2 test set using 10% of ground truth as input depth samples.

Figure 9: Generalization across datasets: Our model trained on the KITTI dataset is able generalize to new datasets such as Virtual KITTI (left) and NYUDepthV2 (right). No retraining or fine-tuning was performed. From top to bottom is the color image, predicted depth image and the ground truth.

5.3 Conclusion

We have proposed a CNN architecture that can be used to upsample sparse range data using the available high resolution intensity imagery. Our architecture is designed to extract contextual cues from the image to guide the upsampling process which leads to a network with separate branches for the image and depth data. We have demonstrated its performance on relevant datasets and shown that the approach appears to capture salient cues from the image data and produce upsampled depth results that respect relevant image boundaries and correlate well with the available ground truth.