Dense depth estimation is a critical component in autonomous driving, robot navigation and augmented reality. Popular sensing schemes in these domains involve a high resolution camera and a low resolution depth sensor such as a LiDAR or Time-of-Flight sensor. The density of points returned from commonly available depth sensors is typically an order of magnitude lower than the resolution of the camera image. Additionally, higher resolution variants of these sensors are expensive, making them impractical for most applications. However, a number of applications such as planning, obstacle avoidance and fine-grained layout estimation can benefit from higher resolution range data which motivates us to consider approaches that can up-sample the sparse available depth measurements to the resolution of the available imagery.
Traditionally, interpolation and diffusion based schemes have been used to up-sample these sparse points into a smooth dense depth image, often using the corresponding color image as a guide. Convolutional neural networks (CNN) have had tremendous success in depth estimation tasks using monocular image data [30, 3, 13, 51, 8, 26, 27, 23, 28, 11, 42, 52, 47], stereo image data [39, 40, 2, 21, 32, 50, 46, 20] and sparse depth data on it’s own [41, 34, 25, 9, 6].
Monocular depth estimation networks have been proposed that learn to extrapolate depth information from image data alone, and recent methods show that such strategies can even be carried out in an unsupervised manner [13, 52, 47]. However, the depth estimates generated from such networks are less accurate and often relative, making them impractical for navigation and planning. While the work from this field is relevant to us, we focus on the depth completion task, where sparse depth data is available and can be used with high resolution image data to produce accurate dense depth images.
One way to view both the monocular depth prediction problem and the depth completion problem is in terms of a posterior distribution
which represents the probability of a given depth image,, given an input intensity image, . In both cases the approaches implicitly assume that the resulting posterior distribution is highly concentrated along a low dimensional manifold which makes it possible to infer the complete depth map from relatively few depth samples
We wish to design a CNN architecture that can learn sufficient global and contextual information from the color images and use this information along with sparse depth input to accurately predict depth estimates for the entire image, while enforcing edge preservation and smoothness constraints. Once designed, such a network could be used to upsample information from a variety of depth sensors including LIDAR systems, stereo algorithms or structure from motion algorithms. To summarize, we propose the following contributions:
A CNN architecture that uses a dual branch architecture, spatial pyramid pooling layers and a sequence of multi-scale deconvolutions to effectively exploit contextual cues from the input color image and the available depth measurements.
A training regime that makes of use different sources of information, such as stereo imagery, to learn how to extrapolate depth effectively in regions where no depth measurements are available.
An evaluation of our methods on the KITTI Depth Completion Benchmark111http://www.cvlibs.net/datasets/kitti/eval_depth.php which shows that our strategy is among the top performing algorithms in the benchmark. We also evaluate our algorithm on two other datasets, virtual KITTI and NYUDepth and show that our method is able to generalize well across the three different datasets.
2 Related Work
Depth Estimation: Monocular depth estimation is an active research field where CNN based methods are currently the state of the art. Different methods have been proposed that use supervised [8, 30, 3, 23, 26, 51], unsupervised  and self-supervised  depth estimation strategies. At the time of this writing, the best performing monocular depth estimation algorithm is from Fu et al., achieving an inverse RMSE score of 12.98 on the KITTI depth prediction dataset . The authors propose an ordinal regression based method of predicting depth values, as they state that modelling depth estimation as a regression problem results in slow convergence and unsatisfactory local solutions. Li et al. also discretize the depth prediction problem by formulating it as a classification problem .
CNNs have been successfully used in dense stereo depth estimation tasks. Zbontar et al. proposed a siamese network architecture to learn a similarity measure between two input patches. This similarity measure is then used as a matching cost input for a traditional stereo pipeline . Recently, many end-to-end methods have been proposed that are able to generate accurate disparity images while preserving edges [20, 39, 46, 21], of which the work of Chang et al. is most similar to the network we propose, where the authors propose an end-to-end approach using spatial pyramid pooling to better learn global image dependent features .
Incomplete Input Data: Learning dense representations from sparse input is similar to the domain of super resolution and in-painting. Super resolution assumes that the input is a uniformly sub-sampled representation of the desired high resolution output, and the learning problem can be posed as an edge preserving interpolation strategy. A comprehensive review of these methods is presented by Yang et al. . We note that multi-scale architectures with multiple skip connections have been successfully used for image and depth upsampling tasks [44, 18]. Content-aware completion is motivated by a similar problem of learning complete representations from incomplete input data. Image in-painting requires semantically aware completion of missing input regions. Generative networks have been used successfully for context aware image completion tasks [48, 49] but are outside the scope of this paper. However, Liu et al. propose a method relevant to our problem, where partial convolutions are used to effectively complete large irregular missing regions in the input image [31, 22].
Depth Completion: A particular sub-problem of depth estimation with incomplete input data is depth completion. Following the release of the KITTI depth completion benchmark, novel approaches to solve the problem have been proposed. Uhrig et al., the authors of the benchmark, propose a sparsity invariant CNN architecture , using partial normalized convolutions on the input sparse depth image. They propose multiple architectures, to accommodate RGB information and sparse depth input only. Huang et al. propose HMSNet, which uses masked operations on the partial convolutions such as partial summation, up-sampling and concatenation .
Schneider et al. and Jaritz et al. propose the use of semantic information to help improve the depth completion problem [37, 19]. Jartitz et al. noted the saturation of convolution masks when using partial convolution based architectures. Depending on the input density, the masks often become completely dense after three to four layers. They choose to not use sparse convolutions and report no loss of accuracy.
Ku et al. propose a non-learning based approach to this problem to highlight the effectiveness of well crafted classical methods, using only commonly available morphological operations to produce dense depth information 
. Their proposed method currently out-performs multiple deep learning based methods on the KITTI depth completion benchmark. Dimitrievskiet al. propose a CNN architecture which uses the work of Ku et al. as a pre-processing step on the sparse depth input . We followed a similar strategy and chose to fill in our sparse input depth image instead of using sparse convolutions. Their network is designed to use traditional morphological operators as well as subsequently learned morphological filters using a U-Net style architecture . They are able to achieve better quantitative results but their model fails to preserve semantic and depth discontinuities as it relies heavily on the filled depth image for their final output. Eldesokey et al. propose a method that also uses normalized masked convolutions, but generates confidence values for each predicted depth by using a continuous confidence mask instead of a binary mask . Cheng et al. propose a depth propagation network to explicitly learn an affinity function and apply it to the depth completion problem .
Wang et al. propose a multi-scale feature fusion method for depth completion  using sparse LIDAR data. Ma et al. propose two methods, a supervised method for depth completion using a ResNet based architecture  and a self-supervised method which is currently the top performing depth completion algorithm on the KITTI depth completion benchmark . Their proposed self-supervised method uses the sparse LiDAR input along with pose estimates to add additional training information based on depth and photometric losses.
3.1 Design Overview
We propose the following CNN architecture (Fig 1
) which has been structured to learn local to global context information from both the color image and the sparse depth data and to fuse them together to produce accurate and consistent dense depth maps. We propose a dual branch encoder design in a similar fashion to previous image comparison networks. Given the differences in input modality provided to the two branches, we choose to not use Siamese networks with coupled weights , and use independent branches instead with different design decisions made for each branch. In our encoder, we use spatial pyramid pooling (SPP) blocks to learn a coarse-to-fine representation of features. Spatial pyramid pooling blocks have been effective in learning local to global context information and have been successfully used in depth perception tasks . We concatenate features learned from individual branches and propagate these features through our de-convolution layers. The final layer performs a convolution operation on features combined from different de-convolution layers, up-sampled to the final output resolution, to utilize information from different scales and context to generate the final depth image.
3.2 Feature Extraction
Our color and depth branches begin with an initial depth filling step, similar to the approach of Ku et al. 
. We use a simple sequence of morphological operations and Gaussian blurring operations to fill the holes in the sparse depth image with depth values from nearby valid points such that no holes remain. This is then passed to the feature extraction branch. The filled depth image is then normalized by the maximum depth value in the dataset, resulting in depth values between 0 and 1. For the depth image, we choose to use larger kernel sizes and fewer convolution operations, resulting in fewer layers. For the color image, we use smaller kernel sizes and make use of four residual blocks, in addition to two initial convolution layers. The output of these initial feature extraction layers is then passed to spatial pyramid pooling (SPP) blocks. We use the same SPP block structure as proposed by Chang et al. , but use max pooling for our depth branch and average pooling for our color branch. An illustration of our SPP block is shown in Fig 2. Our pooling windows are consistent between the two branches and are 64, 32, 16 and 8 for each scale respectively. The output of this layer is an up-sampled stack of feature layers carrying information from different scales.
3.3 Combining Modalities
The features from the previous extraction modules are then concatenated into one volume. The first layer is an intermediate output of the residual blocks from the color branch, which we hypothesize can carry over high level features learned from the color image. The subsequent layers are color and depth features extracted from the SPP blocks of the two branches. We believe that these layers can help learn a joint feature representation between the two input modalities in the following layers. We perform three sequential convolution operations on this volume, reducing the number of channels and increasing the spatial resolution by twice the size of the volume. By forcing a reduction in channels we attempt to force the network to learn a lower dimensional representation of the joint feature space, combining important information from both depth and color branches.
3.4 Depth Prediction
The following layers perform a sequence of convolutions with batch normalization, and incremental de-convolutions to restore the original image resolution. The final step involves concatenating different layer outputs from the de-convolution pipeline, up-sampling by interpolation to achieve the original input resolution and then performing a final convolution on the multi-scale stack to produce a single channel output. This output is then passed to a sigmoid activation function and re-scaled to the original range of depth values. Odenaet al. advise caution in the use of transposed convolutions for spatial upsampling , hence we limit the use of transposed convolutions and our final output is a result of a 1x1 convolution on a feature volume, which mainly consists of interpolated low resolution features, and hence minimizes the checkerboard effect in the final depth image.
Our training signal is a weighted average of multiple loss terms, some calculated over the entire image resolution and some calculated only at points where accurate ground truth depth exists. The weights , and are chosen based on a confidence associated with each signal and are varied at different points in time in the training regime.
3.5.1 Primary Loss
We experimented with both L1 and L2 norms as primary loss functions. For this term, we calculate the loss only at pixels where ground truth depth exists and average over the total number of ground truth points. For better RMSE values on evaluation benchmarks we found L2 to be the better choice as a primary loss term.
3.5.2 Optional Stereo Supervision
Since Uhrig et al. provide a large dataset with data from multiple cameras, we propose a means of making better use of this data during training. The KITTI depth completion dataset provides roughly 42k stereo image pairs, and we use these images to provide depth information at points where ground truth LiDAR data is missing. We propose an auxiliary loss term that uses the stereo input image pair to generate a dense depth estimate that can guide the learning process in regions where no ground truth LiDAR measurements exist. We compute this loss term in a self-supervised manner since stereo intrinsics and extrinsics are known. We use Semi Global Matching to generate this dense depth estimate , since this algorithm can be run in real-time on a GPU acceleration . This loss term is an L2 norm of the difference between the predicted depth and the stereo estimated depth. This term can be computed at almost every pixel in the input image. Some pixels lack depth estimates since we use left-right consistency checks to discard noisy and partially occluded depth estimates.
We add a smoothness loss term , which is an L1 norm on the second order derivative of the predicted dense image, similar to the strategy used in unsupervised monocular depth estimation and structure from motion networks [13, 52, 42].
All our networks were implemented in PyTorch222http://pytorch.org
and we train them from scratch, not using pre-trained weights for any layers. Our models are trained using the ADAM optimizer, and we typically use batch sizes of 20-25 for our experiments and train for roughly 40 epochs for all experiments. We use an initial learning rate of, and drop our learning rate by a factor of 10% after every 5 epochs. We use a weight decay term of . The weight terms from Eq 1 are: is usually set to 1, is 0.01 and is 0.001.
4.1 KITTI Depth Completion
The ground truth depth provided in the KITTI Depth Completion dataset is created by merging 11 LiDAR scans from frames before and after a given frame using pose estimates provided in the dataset 
. These projected 3D points are refined using stereo depth estimation algorithms to discard outliers. During evaluation, the final scores are based only on these refined ground truth LiDAR points.
While this corpus provides a large amount of training data the available range measurements are typically clustered towards the bottom of the available imagery and are often missing at critical contextual regions such as object boundaries. A consequence is that models trained on this data often produce blurry edges since the available measurements and evaluation tools do not contraindicate such solutions.
Additionally the data set does not provide information in distant regions like the sky and many previous approaches involve cropping out regions where no LiDAR data is available. In contrast we seek to preserve as much contextual information as possible and make depth predictions across as much of the image as possible using all available data.
4.1.1 Quantitative Comparison
The performance of our approach is shown in Table 1 which shows that our method is competitive with the current state of the art. Our method achieves a mean RMSE score of and the current state of the art is . We note that this method makes use of information from multiple consecutive frames during the training process while we do not. We are also out performed by MSFF-Net , HMS-Net , CSPN  and MorphNet  but we believe that our model is able to better incorporate RGB image information to generate edge preserving and semantically smooth depth images at the cost of a small loss in metric accuracy. We highlight this in Fig 5, where it is clear that our method is able to use contextual information to preserve semantic boundaries as well as or better than methods that outperform us on the benchmark.
4.1.2 Learning to extrapolate with limited ground truth data
We validate the effectiveness of our stereo based loss term by comparing our model with and without this term. Quantitatively the improvements are minimal, i.e the model trains faster and results in slightly improved accuracy, qualitatively we noticed that our network can now extrapolate depth values at regions where no input LiDAR scan or ground truth exist. This is specially useful in datasets such as KITTI where the ground truth information is semi-dense with significant regions of the image missing ground truth LiDAR points. In figure 4 we show a qualitative comparison of our network demonstrating its ability to extrapolate beyond the range of the LIDAR scans.
4.2 Virtual KITTI
We evaluate our network on the Virtual KITTI dataset . This dataset contains roughly 21k image and depth frames generated in virtual worlds with simulated lighting and weather conditions, in a driving dataset similar to KITTI . The maximum depth range for this dataset is (sky), but for simplicity and similarity to our previous dataset, we set our perception limit to and train our model accordingly. We use 60% of this data as our training set and evaluate our model on the remaining images. To generate sparse depth input, we randomly sample 10% of the ground truth depth data uniformly. We apply the same input filling step as in the previous dataset, using the same parameters and morphological window sizes. We then pass this filled depth image along with the RGB image to our network and evaluate our accuracy in the - range.
While the virtual KITTI dataset is not an accurate representation of real life data, we show that our method is able to learn to accurately generate depth dense images, while preserving edges and contextual information. Figure 6 shows the our results on this dataset. We achieve an RMSE of and MAE of on our validation set.
4.3 NYU Depth V2
In our evaluation on the NYUDepthV2 dataset , we use the 1449 densely labelled pairs of aligned RGB and depth images, and split our dataset into 70% training and 30% validation. All our errors are reported on the 30% validation set and we compare our errors against the errors reported by other authors in their respective papers [5, 34, 29]. We use the full resolution 640480 images as our input and use the same method of subsampling as above to generate sparse input depth measurements from the ground truth. We use this dataset to verify that our model is able learn in different environments using different sources of input data, since here a Kinect RGBD sensor is used to collect data in various common environments such as offices and homes.
Table 2 shows the performance of our model at multiple levels of sparsity compared to the work of Ma et al. and Liao et al. at 200 samples [34, 29]. Our approach performs comparably to the the approach of Ma et al. better than that of Liao et al. Sample depth prediction results are shown in Figure 8. We use the same morphological window size and operations as in the Virtual KITTI and KITTI datasets and our method is able to generate accurate results even with noisy input filling. Again the filling process helps us by removing all zeros in the depth image and providing a reasonable initialization but the final depth prediction is based on the combined features from the RGB and depth branches of the network. It must be noted here that the results reported here were computed using a different randomly chosen set of samples and a direct comparison would be unfair.
|Method - number of depth samples||RMSE||REL|
|DFuseNet (ours) - 200||0.2966||0.0609||0.9588||0.9927||0.9982|
|DFuseNet - 500||0.2195||0.0441||98.04||99.70||99.93|
|DFuseNet - 1k||0.1759||0.0371||98.78||99.82||99.96|
|Cheng et.al (rgbd)  - 500||0.117||0.016||99.2||99.9||100.0|
|Ma et al. (rgbd)  - 200||0.230||0.044||97.1||99.4||99.8|
|Liao et al. (rgbd)  - 225||0.442||0.104||87.8||96.4||98.9|
4.4 Number of depth samples
For this experiment, we use the NYUDepthV2 dataset as we are provided with dense ground truth information resulting in more consistent accuracy results. We train a different model for every sample size, limiting the training time to a fixed number of epochs each. We initialize our model with weights learned from our KITTI Depth Completion dataset to reduce our training time. We evaluate RMSE values on our validation set and a plot of this can be seen in Figure 7. As previously observed by Ma et al. in their network, the performance gained by adding more sparse input samples tends to saturate. We notice a saturation at around 5000 depth samples, roughly 1.7% of the image resolution. Qualitatively we can see in Fig 8 that even with an extremely sparse input sample set, the RGB branch of our network is able to guide the depth prediction using mostly image based contextual cues.
5 Discussion and Conclusion
In this section we discuss our observations and the motivation behind our design decisions in the context of datasets such as KITTI and NYU Depth.
Jaritz et al. talk briefly about the benefits of a late fusion architecture over an early fusion one . We agree with their statement and reaffirm the belief that given the different representations of RGB and depth modalities, the correct way to jointly combine this information is by learning to first transform it into a common feature space. While previous work has proposed single path architectures, where RGB and the sparse depth channels are concatenated into a single 4D input and passed to a network , we propose the use of a number of individual and independent convolution and pyramid pooling operators on the individual modalities in a dual branch manner. We experimented with implementations where both modalities were fused prior to the SPP blocks and noticed a drop in performance, hinting that the additional independent information learned was useful to the final fusion and prediction. Figure 3 shows the information gained from having two branches in our network.
In terms of input sparsity, we experimented with replacing all our convolutions in the depth branch with sparse convolutions  but noticed a significant drop in performance. Huang et al. propose the use of additional sparse operations such as sparsity invariant upsampling, addition and concatenation in addition to convolution and were able to achieve much better results . However, we are more inclined to believe that desirable performance can be achieved with the use of regular convolutions and operations for multi-modal input with simple pre-processing hole filling operations such as morphological filters, fill maps and nearest neighbor interpolation [7, 25, 4]. This is simple and effective in providing the network with a good initialization.
We did notice however that with hole filling pre-processing steps, care must be taken in the use of residual connections from the depth channels to the penultimate layers. We found that using a residual connection from the second and third layers of our depth channel to the penultimate layer of our deconvolution layers led to similar accuracy as IPBasic
but the network failed to learn to use information from the rgb branch. Perhaps such a network must be more carefully trained with carefully selected hyperparameters. However, in a single channel or early fusion network, adding residual connections from the depth input to the final layers has been shown to be highly effective[33, 5].
We also test the model we trained for the KITTI Depth Completion benchmark on Virtual KITTI and NYUDepthV2. Figure 9 shows a few examples of predictions made using our KITTI Depth Completion model on Virtual Kitti and NYUDepthV2. Qualitatively, we noticed that the network is able to use sufficient RGB cues to generate semantically valid depth predictions but quantitatively we noticed errors in predicted depth values. This is due to the difference in depth scale across the three datasets (KITTI maximum depth is , virtual kitti is and NYUDepth is roughly ) and is quickly corrected after minimal training using the KITTI model as an initialization. We believe that the separate RGB image branch and the density of features learned independently in this branch to be a large contributor to the generalization of this network. Our model trained on KITTI is able to achieve an RMSE value of and MAE of on our NYUDepthV2 test set using 10% of ground truth as input depth samples.
We have proposed a CNN architecture that can be used to upsample sparse range data using the available high resolution intensity imagery. Our architecture is designed to extract contextual cues from the image to guide the upsampling process which leads to a network with separate branches for the image and depth data. We have demonstrated its performance on relevant datasets and shown that the approach appears to capture salient cues from the image data and produce upsampled depth results that respect relevant image boundaries and correlate well with the available ground truth.
-  J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. Signature verification using a” siamese” time delay neural network. In Advances in neural information processing systems, pages 737–744, 1994.
-  J.-R. Chang and Y.-S. Chen. Pyramid stereo matching network. In , 2018.
-  W. Chen, Z. Fu, D. Yang, and J. Deng. Single-image depth perception in the wild. In Advances in Neural Information Processing Systems, pages 730–738, 2016.
-  Z. Chen, V. Badrinarayanan, G. Drozdov, and A. Rabinovich. Estimating depth from rgb and sparse sensing. arXiv preprint arXiv:1804.02771, 2018.
-  X. Cheng, P. Wang, and R. Yang. Depth estimation via affinity learned with convolutional spatial propagation network. In European Conference on Computer Vision, pages 108–125. Springer, 2018.
-  N. Chodosh, C. Wang, and S. Lucey. Deep convolutional compressed sensing for lidar depth completion. arXiv preprint arXiv:1803.08949, 2018.
-  M. Dimitrievski, P. Veelaert, and W. Philips. Learning morphological operators for depth completion. In International Conference on Advanced Concepts for Intelligent Vision Systems, pages 450–461. Springer, 2018.
-  D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014.
-  A. Eldesokey, M. Felsberg, and F. S. Khan. Propagating confidences through cnns for sparse data regression. arXiv preprint arXiv:1805.11913, 2018.
-  D. Ferstl, C. Reinbacher, R. Ranftl, M. Rüther, and H. Bischof. Image guided depth upsampling using anisotropic total generalized variation. In 2013 IEEE International Conference on Computer Vision, pages 993–1000. IEEE, 2013.
-  H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4340–4349, 2016.
-  C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, volume 2, page 7, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  D. Hernandez-Juarez, A. Chacón, A. Espinosa, D. Vázquez, J. C. Moure, and A. M. López. Embedded real-time stereo estimation via semi-global matching on the gpu. Procedia Computer Science, 80:143–153, 2016.
-  H. Hirschmuller. Accurate and efficient stereo processing by semi-global matching and mutual information. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 807–814. IEEE, 2005.
-  Z. Huang, J. Fan, S. Yi, X. Wang, and H. Li. Hms-net: Hierarchical multi-scale sparsity-invariant network for sparse depth completion. arXiv preprint arXiv:1808.08685, 2018.
-  T.-W. Hui, C. C. Loy, and X. Tang. Depth map super-resolution by deep multi-scale guidance. In European Conference on Computer Vision, pages 353–369. Springer, 2016.
-  M. Jaritz, R. De Charette, E. Wirbel, X. Perrotton, and F. Nashashibi. Sparse and dense data with cnns: Depth completion and semantic segmentation. In 2018 International Conference on 3D Vision (3DV), pages 52–60. IEEE, 2018.
-  A. Kendall, H. Martirosyan, S. Dasgupta, and P. Henry. End-to-end learning of geometry and context for deep stereo regression. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 66–75. IEEE, 2017.
-  S. Khamis, S. Fanello, C. Rhemann, A. Kowdle, J. Valentin, and S. Izadi. Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction. In European Conference on Computer Vision, pages 596–613. Springer, 2018.
-  H. Knutsson and C.-F. Westin. Normalized and differential convolution. In Computer Vision and Pattern Recognition, 1993. Proceedings CVPR’93., 1993 IEEE Computer Society Conference on, pages 515–523. IEEE, 1993.
-  S. Kong and C. Fowlkes. Pixel-wise attentional gating for parsimonious pixel labeling. arXiv preprint arXiv:1805.01556, 2018.
-  J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele. Joint bilateral upsampling. ACM Transactions on Graphics (ToG), 26(3):96, 2007.
-  J. Ku, A. Harakeh, and S. L. Waslander. In defense of classical image processing: Fast depth completion on the cpu. arXiv preprint arXiv:1802.00036, 2018.
-  B. Li, Y. Dai, and M. He. Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recognition, 2018.
-  R. Li, K. Xian, C. Shen, Z. Cao, H. Lu, and L. Hang. Deep attention-based classification network for robust depth prediction. arXiv preprint arXiv:1807.03959, 2018.
-  Z. Li and N. Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Computer Vision and Pattern Recognition (CVPR), 2018.
-  Y. Liao, L. Huang, Y. Wang, S. Kodagoda, Y. Yu, and Y. Liu. Parse geometry from a line: Monocular depth estimation with partial laser observation. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 5059–5066. IEEE, 2017.
-  F. Liu, C. Shen, G. Lin, and I. D. Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell., 38(10):2024–2039, 2016.
-  G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro. Image inpainting for irregular holes using partial convolutions. arXiv preprint arXiv:1804.07723, 2018.
-  W. Luo, A. G. Schwing, and R. Urtasun. Efficient deep learning for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5695–5703, 2016.
-  F. Ma, G. V. Cavalheiro, and S. Karaman. Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera. arXiv preprint arXiv:1807.00275, 2018.
-  F. Ma and S. Karaman. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018.
-  A. Odena, V. Dumoulin, and C. Olah. Deconvolution and checkerboard artifacts. Distill, 2016.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
-  N. Schneider, L. Schneider, P. Pinggera, U. Franke, M. Pollefeys, and C. Stiller. Semantically guided depth upsampling. In German Conference on Pattern Recognition, pages 37–48. Springer, 2016.
-  N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pages 746–760. Springer, 2012.
-  X. Song, X. Zhao, H. Hu, and L. Fang. Edgestereo: A context integrated residual pyramid network for stereo matching. arXiv preprint arXiv:1803.05196, 2018.
-  S. Tulyakov, A. Ivanov, and F. Fleuret. Practical deep stereo (pds): Toward applications-friendly deep stereo matching. arXiv preprint arXiv:1806.01677, 2018.
-  J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger. Sparsity invariant cnns. In 2017 International Conference on 3D Vision (3DV), pages 11–20. IEEE, 2017.
-  S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki. Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804, 2017.
-  B. Wang, Y. Feng, and H. Liu. Multi-scale features fusion from sparse lidar data and single image for depth completion. Electronics Letters, 2018.
-  J. Yamanaka, S. Kuwashima, and T. Kurita. Fast and accurate image super resolution by deep cnn with skip connection and network in network. In Neural Information Processing, pages 217–225. Springer, 2017.
-  C.-Y. Yang, C. Ma, and M.-H. Yang. Single-image super-resolution: A benchmark. In European Conference on Computer Vision, pages 372–386. Springer, 2014.
-  G. Yang, H. Zhao, J. Shi, Z. Deng, and J. Jia. Segstereo: Exploiting semantic information for disparity estimation. In European Conference on Computer Vision, pages 660–676. Springer, 2018.
-  Z. Yang, P. Wang, Y. Wang, W. Xu, and R. Nevatia. Lego: Learning edge with geometry all at once by watching videos. arXiv preprint arXiv:1803.05648, 2018.
R. A. Yeh, C. Chen, T.-Y. Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N.
Semantic image inpainting with deep generative models.In CVPR, volume 2, page 4, 2017.
-  J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with contextual attention. arXiv preprint, 2018.
J. Zbontar and Y. LeCun.
Stereo matching by training a convolutional neural network to compare
Journal of Machine Learning Research, 17(1-32):2, 2016.
-  Z. Zhang, C. Xu, J. Yang, Y. Tai, and L. Chen. Deep hierarchical guidance and regularization learning for end-to-end depth estimation. Pattern Recognition, 83:430–442, 2018.
-  T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, volume 2, page 7, 2017.