Sparse-to-Continuous: Enhancing Monocular Depth Estimation using Occupancy Maps

09/24/2018 ∙ by Nícolas Rosa, et al. ∙ The University of Sydney 6

This paper addresses the problem of single image depth estimation (SIDE), focusing on improving the accuracy of deep neural network predictions. In a supervised learning scenario, the quality of predictions is intrinsically related to the training labels, which guide the optimization process. For indoor scenes, structured-light-based depth sensors (e.g. Kinect) are able to provide dense, albeit short-range, depth maps. On the other hand, for outdoor scenes, LiDARs are still considered the standard sensor, which comparatively provide much sparser measurements, especially in areas further away. Rather than modifying the neural network structure to deal with sparse depth maps, this paper introduces a novel technique for the densification of depth maps based on the Hilbert Maps framework. A continuous occupancy map is produced based on 3D points from LiDAR scans, and the resulting reconstructed surface is projected into a 2D depth map with arbitrary resolution. Experiments conducted with various subsets of the KITTI dataset show the improvement produced by the proposed Sparse-to-Continuous technique, without the introduction of extra information into the training methodology.



There are no comments yet.


page 1

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robotic platforms have been increasingly present in our society, performing progressively more complex activities in the most diverse environments. One of the driving factors behind this breakthrough is the development of sophisticated perceptual systems, which allow these platforms to understand the environment around them as well as, or better than, humans.

Nowadays, distinct sensors allow the capture of three-dimensional information, and amongst the most advanced are the rangefinders using LiDAR technology [1]. Nonetheless, these sensors can be extremely expensive depending on the range and level of detail required by the application. Since it is also possible to reconstruct 3D structures from 2D observations of the scene [2], visual systems have been employed as an alternative due to their reduced cost and size, also being able to perceive colors. However, estimating depths from 2D images is a challenging task and it is described as an ill-posed problem, since the observed images may be resultant of several possible projections from the actual real-world scene [3]. This problem has been extensively studied in Stereo Vision [4, 5, 6, 7] and Single Image Depth Estimation (SIDE) [3, 8, 9, 10], and in this work we focus on the second approach.

Fig. 1: Sparsity comparison between the (a) KITTI Discrete (sparse), (b) KITTI Depth (semi-dense) and (c) KITTI Continuous (dense, ours) datasets, respectively. Warmer and colder colors represent larger and smaller distances, respectively.

Deep Convolutional Networks (CNNs) have had a deep impact on how recent works are addressing the SIDE task, with significant improvements on the accuracy of estimates and the level of details present in depth maps. Many of these methods model the monocular depth estimation task as a regression problem and are supervised, often using sparse depth maps as ground-truth, since these are readily available from other sensors (i.e. LiDAR rangefinders).

However, the degree of sparsity present in these maps is very high. For instance, in a ( pixels) image from the KITTI Depth dataset, 84.78% of its pixels do not contain valid information. One of the palliatives found is to use datasets that provide a large number of examples, such as KITTI Raw Data [1] and KITTI Depth [11], for the effective training of deep neural networks. Some recent works also propose the use of secondary information (i.e. low-resolution depth maps, normal surfaces, semantic maps), associating them to the RGB images as extra inputs [12, 13, 14], or focus on the development of network architectures that are more suitable for the processing of sparse information [11].

Similar existing works have also proposed the use of rendered depth images from synthetic datasets, which are also continuous [15, 16]. However, the generalization power of networks trained using this type of dataset is still questioned [11], mainly due to the existing gap in terms of the degree of realism between virtual environments and real-world scenes.

In this work, we address the SIDE task and propose the use of occupancy models to interpolate raw sparse LiDAR measurements and generate continuous depth images, which then serve to train a ResNet-based architecture in a supervised manner. These continuous images have four times more information than sparse ones, which makes network convergence faster and easier, use fewer images to train, and – more importantly – improve the quality of network predictions. Finally, to demonstrate the benefit of training deep convolutional networks using our proposed method, we compare the obtained estimates when training in three different datasets with varying levels of sparsity, as illustrated in Figure

1. To produce the occupancy models necessary for continuous projections, we employ the Hilbert Maps framework [17], due to its efficient training and query properties and scalability to large datasets.

Ii Related Work

Depth Estimation from a single image is an ill-posed problem, since the observed image may be generated from several possible projections from the actual real-world scene [3]. In addition, it is inherently ambiguous, as the proposed methods attempt to retrieve depth information directly from color intensities [18]. Besides all the presented adversities, other tasks such as obstacle detection, semantic segmentation and scene structure highly benefit from the presence of depth estimates [19], which makes this task particularly useful.

Previous approaches relied on handcrafted features, where the most suitable features for the application were manually selected, and used to have strong geometrical assumptions [20, 21]. More recently, automatic optimization methods were proposed to automate the generation process of visual cues [3, 8, 22, 23, 19, 12, 10]. Commonly, the monocular depth estimation problem is modeled as a regression problem, whose parameters are optimized based on the minimization of a cost function and often using sparse depth maps as ground-truth for supervised learning.

Early works employed techniques such as Markov Random Fields (MRFs) [24, 25] and Conditional Random Fields (CRFs) [26]

to perform this task. More recently, deep learning concepts have also been used to address the SIDE problem, where deep convolutional neural networks (CNNs) are responsible for extracting the visual features

[3, 27, 8, 18]. The success of these techniques highly impacted how subsequent works began to address the SIDE task, which in turn significantly improved the accuracy of estimates and the level of details present in depth maps [12, 10].

In parallel to the supervised learning approach, some works focus on minimizing photometric reconstruction errors between the stereo images [28, 9] or video sequences [29], which allow them to be trained in an unsupervised way (i.e. without depth estimates as ground-truth).

Depth Map Completion

has been widely studied in computer vision and image processing, and deals with decreasing the sparsity level of depth maps. Monocular Depth Estimation differs from it as it seeks to directly approximate RGB images to depth maps. Briefly, existing Depth Map Completion methods seek to predict distances for pixels where the depth sensor doesn’t have information. Currently, there are two types of approaches associated with this problem.

The first one, non-Guided Depth Upsampling

, aims to generate denser maps using only sparse maps obtained directly from 3D data or SLAM features. These methods resemble those proposed in Depth Super-Resolution task

[11], where the goal is to retrieve accurate high-resolution depth maps. More recently, deep convolutional neural networks have also been employed in super-resolution for both image [30, 31] and depth [32, 33] applications. Other works focus on inpainting the missing depth information, e. g., Uhrig et al. [11] employed sparse convolutional layers to process the irregular distributed 3D laser data. However, methods predicting depth when trained only on raw information usually do not perform too well [14].

The second approach, Image Guided Depth Completion, suggests incorporating any kind of guidance for achieving superior performance, e. g. to use sparse maps and RGB images of the scene (RGB-D data) as inputs. Besides low-resolution sparse samples obtained from low-cost LiDAR or SLAM features [12, 34], other auxiliary information can also be employed, such as semantic labels [35], 2D laser points [13], normal surface and occlusion boundary maps [14].

Synthetic Datasets have also been employed to retrieve depth information [15, 16]. These datasets provide high-quality dense depth maps that are extracted straightly from virtual environments. Some of the most used available datasets are: Apolloscape [36], SUNCG [37], SYNTHIA [38] and Virtual KITTI [39]. However, it remains open for discussion if the complexity and realism levels of the information in such synthetic datasets is sufficient to train the algorithms so they can be successfully deployed in real-world situations [11].

Iii Methodology

Iii-a Occupancy Maps

A common way to store range-based sensor data is through the use of pointclouds, which can be projected back into a 2D plane to produce depth images, containing distance estimates for all pixels that have a corresponding world point. Assuming a rectified camera projection matrix , a rectifying rotation matrix and a rigid body transformation matrix from camera to range-based sensor , a 3D point P can be projected into pixel u as such:


An example of this projection can be seen in Figure 1a, where we can see the sparsity generated by directly projecting pointcloud information, most notably in areas further away from the sensor. Spatial dependency modeling is a crucial aspect in computer vision, and the introduction of such irregular gaps can severely impact performance. Because of that, here we propose projecting not the pointcloud itself, but rather its occupancy model, as generated by the Hilbert Maps (HM) framework [17]. This methodology has recently been successfully applied to the modeling of large-scale 3D environments [40], producing a continuous occupancy function that can be queried at arbitrary resolutions. Assuming a dataset , where is a point in the three-dimensional space and is a classification variable that indicates the occupancy property of

, the probability of non-occupancy for a query point

is given by:



is the feature vector and

w are the weight parameters, that describe the discriminative model . We employ the same feature vector from [40], defined by a series of squared exponential kernel evaluations against an inducing point set , obtained by clustering the pointcloud and calculating mean

and variance

estimates for each subset of points:


Clustering is performed using the Quick-Means algorithm proposed in [41], due to its computational efficiency and ability to produce consistent cluster densities. However, this algorithm is modified to account for variable cluster densities within a function, in this case the distance from origin. This is achieved by setting , where and are the inner and outer radii used to define cluster size and is a scaling constant. The intuition is that areas further from the center will have fewer points, and therefore larger clusters are necessary to properly interpolate over such sparse structures. The trade-off for this increase interpolative power is loss in structure details, since a larger volume will be modeled by the same cluster. The optimal weight parameters w

are calculated by minimizing the following negative-likelihood loss function:


where is a regularization function such as the elastic net [42]. Once the occupancy model has been trained, it can be used to produce a reconstruction of the environment, and each pixel is then checked for collision in the 3D space, producing depth estimates. An example of reconstructed depth image is depicted in Figure 1c, where we can see that virtually all previously empty areas were filled by the occupancy model, while maintaining spatial dependencies intact (up to the reconstructive capabilities of the HM framework).

Iii-B Continuous Depth Images

When datasets do not provide ground truths directly, it is still possible to obtain them using the 3D LiDAR scans and extrinsic/intrinsic parameters from the RGB cameras. In this case, a sparse depth image can be generated by directly projecting the cloud of points of the scene to the image plane of the visual sensor [3, 9]. Continuous depth images, in turn, can be obtained by interpolating the measured points into continuous surfaces prior to the projection. In this work, the Hilbert Maps technique was used on the LiDAR scans to generate these surfaces. After restricting the continuous map to the region under the left camera’s field of view, we projected the remaining depth values in the image plane.

Iii-C Data Augmentation

Two types of random online transformations were performed, thus artificially increasing the number of training data samples.

Flips: The input image and the corresponding depth map are flipped horizontally with 50% probability.

Color Distortion: Adjusts the intensity of color components on an RGB image randomly. The order of the following transformations is also chosen randomly:

  • Brightness by a random value

  • Saturation by a random value

  • Hue by a random value

  • Contrast by a random value

As pointed out by [3], the world-space geometry of the scene is not preserved by image scaling and translation transformations. Therefore, we opted for not using these transformations and rotations, although this last one is geometry-preserving. We believe that aggressive color distortions prevent the network from becoming biased in relating pixel intensity to depth values, thus focusing on learning the scene’s geometric relationships.

Iii-D Loss Functions

We employed three different loss functions for adjusting the internal parameters of the presented deep neural network. The motivation behind this is simply to determine which one is more suitable for approximating the outputs () to the reference values () for the -th pixel. The mathematical expressions for each one are presented as follows:

Iii-D1 Squared Euclidian Norm (mse)

Also known as norm, it is the most commonly used cost function for neural network optimization, which consists of computing the Euclidean distances (Equation 6) between predictions () and ground-truths () [18].


Iii-D2 Scale Invariant Mean Squared Error (eigen)

In the context of depth prediction, discovering the global scale of a scene is a naturally ambiguous task. Eigen & Fergus [27] verified that subtracting the mean scale from the evaluated scenes (second term in Equation III-D2) results in a significant improvement in network predictions.

L_eigen_grads(y, y^*) = 1n ∑_i d_i^2 - λn2 (∑_i d_i )^2 + 1n ∑_i [(∇_x d_i)^2+(∇_y d_i)^2],

Iii-D3 Adaptive BerHu Penalty (berhu)

Also known as Reverse Huber’s function, it was proposed in order to obtain more robust estimates by penalizing differently predictions that are close to the source (i.e. vehicle) than those that are distant [18, 43, 44].


As shown in the Equation 7, the penalization adapts according to how far the predictions are from the reference depths, where small values are subject to the norm, whereas high values, to the norm.

Iii-E Network Architecture

Fig. 2: Network architecture. The used architecture was proposed by Laina et al. [18], which is inspired on the ResNet-50, but its fully-connected layers were replaced by upsampling blocks. Differently from the original authors, we changed the training framework to make the proposed network predicts meters instead of distances in log-space.

In this work, we used the network topology Fully Convolutional Residual Network (FCRN) proposed by Laina et al. [18]. This network was selected because it presents a smaller number of trainable parameters, besides requiring a smaller number of images to be trained, without losing performance. In addition, the residual blocks present in the architecture allow the construction of a deeper model capable of predicting more accurate and higher-resolution output maps. More specifically, the FCRN (Figure 2) is based on the ResNet-50 topology, but the fully-connected layers have been replaced by a set of residual upsampling blocks, also referred to as up-projections, which are layers responsible for deconvolving and retrieving spatial resolution of feature maps. This network was trained end-to-end in a supervised way, but unlike the authors who proposed it, we modified its output to predict distances in meters rather than distances in log-space. The network uses RGB images of pixels as inputs for training the 63M trainable parameters and provides an output map with size of pixels.

Iv Experimental Results

Iv-a Implementation details

We implemented the network using Tensorflow

[45]. Our models were trained on the KITTI Depth and KITTI Raw Data datasets using a NVIDIA Titan X GPU with 12 GB memory. We used a batch size of 4 and 300000 training steps. The initial learning rate value was 0.0001, reducing 5% every 1000 steps. Besides learning decay, we also employed a dropout of 50% and normalization as regularization [46, 47].

Iv-B Datasets

Three different datasets are considered in this work: KITTI Discrete (sparse), KITTI Depth (semi-dense), and KITTI Continuous (dense), including frames from the “city”, “residential”, “road”, “campus” and “person” sequences. Typically, the resolution of the used RGB and depth maps images is pixels.

KITTI Depth: Due to their complexity, training and evaluating deep convolutional neural networks require a large number of annotated images pairs. The KITTI Depth is a large-scale dataset created to allow supervised end-to-end training, since other datasets such as Middlebury [48], Make3D [25], and KITTI [49, 50] do not have enough data to adjust all internal parameters of a deep neural network [11]. The dataset is paired with scenes presented in the KITTI Raw Data dataset [1]

and consists of 92750 semi-dense depth maps (ground truth). The depth images were obtained by accumulating 11 lasers scans whose outliers were removed by enforcing consistency between the LiDAR points and the reconstructed depth maps, generated by semi-global matching (SGM)


KITTI Discrete/Continuous: Unlike the procedure performed in KITTI Depth to make the depth images less sparse, which consisted of accumulating different scans of the laser sensor, we used an occupancy model to make the depth maps denser. In other words, the goal was to increase the number of valid pixels available in depth images for training. In this sense, this alternative requires a smaller number of training images than other techniques that use datasets with sparse/semi-dense ground truth information. The KITTI Continuous is also based on 3D Velodyne pointclouds, but first we interpolate its measurements as surfaces to generate the continuous depth images (more details in section III-B). The KITTI Discrete dataset was built alongside KITTI Continuous and consists of depth images which are the direct projections of pointclouds on the 2D image plane. However, in this dataset version the generated depth maps are very sparse since they use only one LiDAR scan.

For the Discrete and Continuous datasets, we randomly shuffled the images and corresponding depth maps before splitting them into training and test sets by an 80% ratio. The resulting number of pairs for each dataset is presented in Table I. On average, the number of valid pixels available on the KITTI Discrete and KITTI Depth datasets represents only 6.63% and 24.33%, i. e., respectively 15 and 4 times smaller than the number of points available on the KITTI Continuous dataset.

Dataset Train Test Total
Average number
of valid pixels
KITTI Depth 85898 6852111Validation subset used instead of test subset. The test subset doesn’t have depth maps for the corresponding input images. 92750 70910
KITTI Discrete 25742 6436 32178 19323
KITTI Continuous 25742 6436 32178 291453
TABLE I: Number of image pairs used for Train/Test for each Dataset

Iv-C Evaluation Metrics

Since the final results are generally a set of predictions of the testing set images, qualitative (visual) analysis may be biased and not sufficient to say if one approach is better than another. This way, several works use the following metrics to evaluate their methods and thus compare them with other techniques in the literature [3, 51, 19]:

Threshold (): % of s.t.

Abs Relative Difference:

Squared Relative Difference: :

RMSE (linear):

RMSE (log):

where is the number of valid pixels in all evaluated images. In addition, in order to compare our results with other works, we also use the evaluation protocol of restricting ground-truth depth values and predictions to a range, in this case the interval. In other words, we discard depths below and cap distances above . Some works [28, 9] require different intervals to be fairly compared to.

Iv-D Benchmark Evaluation

In this section, we perform the benchmark evaluation of our method trained on the presented datasets and compare them with existing works. Since there are no test images for the KITTI Raw Data, we evaluated the network predictions on two different test splits, which were resized to the original size using bilinear upsampling, and compared them to the corresponding ground-truth depth maps.

Iv-D1 Eigen Split

As already mentioned, the KITTI Raw dataset does not have an official training/test split, so Eigen et al. subdivided the available images into 33,131 for training and 697 for evaluation [3]. As other works present in the literature, we also use the test subset to evaluate our methods, which allow us to directly compare them with state-of-art algorithms. Since this dataset doesn’t provide the ground truth depth images, they need to be manually generated using the methodology presented in section III-B. In Table II we detail how our approaches perform on this test split alongside other results of leaderboard algorithms. The last two rows show how our method improved over the baseline, causing the network to reach current state-of-the-art results. A qualitative comparison between our results and the current state-of-the-art is presented in Figure 3.

Fig. 3: Qualitative comparison between our estimates and previous state-of-art works.

Like DORN [10], our method also detects well obstacles present in the scenes, with the noticeable difference that ours provide a certain margin of safety around the obstacles, due to the reconstructive properties of the Hilbert Maps framework, as shown in section III-A, besides achieving similar performance using a simpler architecture.

Abs Rel Sqr Rel RMSE RMSE (log)
Approach Supervised Range lower is better higher is better
Make 3D [25] Yes 0.280 3.012 8.734 0.361 0.601 0.820 0.926
Mancini et al. [52] Yes - - 7.508 0.524 0.318 0.617 0.813
Eigen et al. [3], coarse 28144 Yes 0.194 1.531 7.216 0.273 0.679 0.897 0.967
Eigen et al. [3], fine 27142 Yes 0.190 1.515 7.156 0.270 0.692 0.899 0.967
Liu et al. [8], DCNF-FCSP FT Yes 0.217 1.841 6.986 0.289 0.647 0.882 0.961
Ma & Karaman [12], only RGB Yes 0.208 - 6.266 - 0.591 0.900 0.962
Fu et al. [10], DORN (ResNet) Yes 0.072 0.307 2.727 0.120 0.932 0.984 0.994
Kuznietsov et al. [53] No 0.262 4.537 6.182 0.338 0.768 0.912 0.955
Godard et al. [9] (CS+K) No 0.136 1.512 5.763 0.236 0.836 0.935 0.968
Zhou et al. [29] (w/o explainability) No 0.208 1.551 5.452 0.273 0.695 0.900 0.964
Garg et al. [28], L12 Aug 8x No 09 1.080 5.104 0.273 0.740 0.904 0.962
Zhou et al. [29] (CS+K) No 0.190 1.436 4.975 0.258 0.735 0.915 0.968
Godard et al. [9] (CS+K) No 0.118 0.932 4.941 0.215 0.858 0.947 0.974
Kuznietsov et al. [53] Yes 0.117 0.597 3.531 0.183 0.861 0.964 0.989
Ma & Karaman [12], RGBd 500 Samples Yes 0.073 - 3.378 - 0.935 0.976 0.989
Fu et al. [10], DORN (ResNet) Yes 0.071 0.268 2.271 0.116 0.936 0.985 0.995
KITTI Depth, only valid, (Ours) Yes 0.195 1.417 4.040 0.236 0.718 0.841 0.883
KITTI Continuous, only valid, (Ours) Yes 0.071 0.267 2.536 0.133 0.820 0.894 0.908
TABLE II: Comparison with state-of-the-art algorithms on KITTI Raw Eigen test split [3]. The reported results are those presented by the authors from their respective original papers.

Iv-D2 Eigen Split (Continuous)

The above-mentioned Eigen test split has been used in the literature for evaluating depth estimation methods for years. However, the depth maps (ground truths) proposed by the split set are sparse. For a fair comparison between the methods trained on the sparse and continuous datasets, we increased the number of evaluation points, since our technique improves the prediction quality not only for sparse points of the original depth map but also for the scene as a whole. This modification makes it possible to further highlight the benefits of our technique. More specifically, we generated an evaluation split using 638 test images from the original testing set, but using their corresponding continuous version.

Iv-E Ablation Studies

Besides evaluating the presented architecture on the different versions of the KITTI datasets, we conducted various ablation studies to identify the best training combination. More specifically, we trained using different datasets and loss functions. We also studied the influence of using all pixels, including sky and reflecting surfaces, or only valid pixels, which have corresponding depth information. The obtained results are presented in Table III. The models trained in the continuous dataset showed a decrease of 63.6% and 37.2%, respectively, in the AbsRel and RMSE error metrics compared to methods trained in the same circumstances, but using semi-sparse maps.

Abs Rel Sqr Rel RMSE RMSE (log)
Dataset Pixels Loss lower is better higher is better
discrete valid 0.200 1.525 4.315 0.244 0.715 0.833 0.878
depth valid 0.195 1.417 4.040 0.236 0.718 0.841 0.883
continuous’city all 0.180 1.413 6.088 0.486 0.659 0.806 0.858
continuous’city all 0.144 1.099 4.548 0.717 0.727 0.837 0.875
continuous valid 0.125 0.661 4.232 0.195 0.728 0.860 0.898
continuous valid 0.124 0.653 4.176 0.197 0.733 0.861 0.898
continuous all 0.118 0.697 4.272 0.636 0.756 0.865 0.891
continuous all 0.110 0.488 3.300 0.297 0.773 0.872 0.898
continuous valid 0.103 0.408 2.976 0.159 0.782 0.882 0.906
continuous all 0.093 0.500 3.561 0.481 0.790 0.878 0.899
continuous valid 0.071 0.267 2.536 0.133 0.820 0.894 0.908
TABLE III: Quantitative results of different variants of our approach on KITTI using the proposed Continuous Eigen test split. The predictions of all results were capped to range.
Fig. 4: Qualitative comparison between depth predictions when trained on the proposed datasets. (a) Input RGB Image. (b) KITTI Discrete (Sparse, 1 Scan). (c) KITTI Depth (Semi-Dense, 11 Scans). (d) KITTI Continuous (Dense, Continuous Occupancy Maps). (e) Ground Truth (Continuous).

The Figure 4 illustrates the qualitative comparison between the predictions when training on the proposed datasets. As can be noted, the continuous depth images boosted up the quality of distance estimations. In other words, they make the predicted images much less blurred, i. e., they have a better definition of the edges of the objects, also having more accurate measurements according to the ground truth maps. The main cause of the predictions of sparse datasets to be blurred is the use of 2D convolutional filters in widely sparse regions and the occasionality of depth information, since the distance value in a given pixel is intermittent and this depends on where the laser points will be reprojected.

V Conclusion

In this paper, we present a novel data pre-processing step by employing occupancy models (i.e. Hilbert Maps) to the Single Image Depth Estimation problem, which generates continuous depth maps for training our deep residual network, differing from typical supervised approaches that use sparse ones. This training process does not require any other type of sensors or extra information, only RGB images as input and continuous maps as supervision, which significantly improved the quality of network predictions over typical sparse maps. Moreover, the proposed methodology presented superior performance even when using 60% fewer examples than those trained on the KittiDepth dataset, as a consequence of increasing the valid information present in the ground truth maps from 15.2% to 62.6%. The main limitation of the proposed pre-processing method is the computational cost required to compute each continuous depth map used for training. Future work will focus on optimizing the method itself, mainly tackling the aforementioned problem, and honing the network topology by incorporating new layers that are more suited to the SIDE task. Uhrig et al. [11] employed sparse convolutions to deal with the sparsity present on depth maps, similarly, we suggest the development or the use of more suitable layers, e. g. sub-pixel convolutional layers [54], for processing the available information, which is now continuous but still has empty areas.


This research was supported by funding from the Brazilian National Council for Scientific and Technological Development (CNPq), under grant 130463/2017-5 and 465755/2014-3, the São Paulo Research Foundation (FAPESP) grant 2014/50851-0, and the Faculty of Engineering & Information Technologies, The University of Sydney, under the Faculty Research Cluster Program. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X GPUs used on this research.


  • [1] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research (IJRR), vol. 32, no. 11, pp. 1231–1237, 2013.
  • [2]

    D. J. Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess, “Unsupervised learning of 3d structure from images,” in

    Proc. of the 30th International Conference on Neural Information Processing Systems (NIPS).   USA: Curran Associates Inc., 2016, pp. 5003–5011.
  • [3] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Proc. of the 27th International Conference on Neural Information Processing Systems (NIPS), 2014, pp. 2366–2374.
  • [4] Z. Chen, X. Sun, L. Wang, Y. Yu, and C. Huang, “A Deep Visual Correspondence Embedding Model for Stereo Matching Costs,” in Proc. of the 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 972–980.
  • [5] J. Žbontar and Y. LeCun, “Computing the stereo matching cost with a convolutional neural network,” in

    Proc. of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2015, pp. 1592–1599.
  • [6] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation,” in Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4040–4048.
  • [7] W. Luo, A. G. Schwing, and R. Urtasun, “Efficient Deep Learning for Stereo Matching,” in Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5695–5703.
  • [8] F. Liu, Chunhua Shen, and Guosheng Lin, “Deep convolutional neural fields for depth estimation from a single image,” in Proc. of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 5162–5170.
  • [9] C. Godard, O. M. Aodha, and G. J. Brostow, “Unsupervised Monocular Depth Estimation with Left-Right Consistency,” in Proc. of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6602–6611.
  • [10] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep Ordinal Regression Network for Monocular Depth Estimation,” in Proc. of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [11] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in 2017 International Conference on 3D Vision (3DV), Oct 2017, pp. 11–20.
  • [12] F. Ma and S. Karaman, “Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image,” in Proc. of the 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018.
  • [13] Y. Liao, L. Huang, Y. Wang, S. Kodagoda, Y. Yu, and Y. Liu, “Parse geometry from a line: Monocular depth estimation with partial laser observation,” in Proc. of the IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 5059–5066.
  • [14] Y. Zhang and T. Funkhouser, “Deep Depth Completion of a Single RGB-D Image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 175–185.
  • [15] M. Mancini, G. Costante, P. Valigi, T. A. Ciarfuglia, J. Delmerico, and D. Scaramuzza, “Toward Domain Independence for Learning-Based Monocular Depth Estimation,” IEEE Robotics and Automation Letters, vol. 2, no. 3, pp. 1778–1785, 2017.
  • [16] M. Mancini, G. Costante, P. Valigi, and T. A. Ciarfuglia, “J-mod 2: Joint monocular obstacle detection and depth estimation,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1490–1497, 2018.
  • [17]

    F. Ramos and L. Ott, “Hilbert maps: Scalable continuous occupancy mapping with stochastic gradient descent,”

    International Journal of Robotics Research (IJRR), vol. 35, no. 14, pp. 1717–1730, 2016.
  • [18] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in Proc. of the 4th International Conference on 3D Vision (3DV), 2016, pp. 239–248.
  • [19] Y. Cao, Z. Wu, and C. Shen, “Estimating depth from monocular images as classification using deep fully convolutional residual networks,” Proc. of the IEEE Transactions on Circuits and Systems for Video Technology, 2017.
  • [20] D. Hoiem, A. A. Efros, and M. Hebert, “Automatic photo pop-up,” ACM Transactions on Graphics, vol. 24, no. 3, p. 577, 2005.
  • [21] V. Hedau, D. Hoiem, and D. Forsyth, “Thinking inside the box: Using appearance models and context based on room geometry,”

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    , vol. 6316 LNCS, no. PART 6, pp. 224–237, 2010.
  • [22] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. Yuille, “Towards unified depth and semantic prediction from a single image,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2800–2809.
  • [23] A. Roy, “Monocular Depth Estimation Using Neural Regression Forest,” in Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5506–5514.
  • [24] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning Depth from Single Monocular Images,” Advances in Neural Information Processing Systems, vol. 18, pp. 1161–1168, 2006.
  • [25] A. Saxena, M. Sun, and A. Y. Ng, “Make3D: Learning 3D scene structure from a single still image,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, pp. 824–840, 2009.
  • [26] M. Liu, M. Salzmann, and X. He, “Discrete-Continuous Depth Estimation from a Single Image,” in Proc. of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 716–723.
  • [27] D. Eigen and R. Fergus, “Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture,” in Proc. of the 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2650–2658.
  • [28] R. Garg, B. G. Vijay Kumar, G. Carneiro, and I. Reid, “Unsupervised CNN for single view depth estimation: Geometry to the rescue,” in Proc. of the European Conference on Computer Vision (ECCV), 2016, pp. 740–756.
  • [29] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in Proc. of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, no. 6, 2017, p. 7.
  • [30] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE Transactions on Image Processing, vol. 19, no. 11, pp. 2861–2873, 2010.
  • [31] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in Proc. of the European conference on computer vision (ECCV).   Springer, 2014, pp. 184–199.
  • [32] G. Riegler, M. Rüther, and H. Bischof, “Atgv-net: Accurate depth super-resolution,” in Proc. of the European Conference on Computer Vision (ECCV).   Springer, 2016, pp. 268–284.
  • [33] X. Song, Y. Dai, and X. Qin, “Deep depth super-resolution: Learning depth super-resolution using deep convolutional neural network,” in Asian Conference on Computer Vision.   Springer, 2016, pp. 360–376.
  • [34] C. S. Weerasekera, T. Dharmasiri, R. Garg, T. Drummond, and I. Reid, “Just-in-Time Reconstruction: Inpainting Sparse Maps using Single View Depth Predictors as Priors,” in Proc. of the 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018.
  • [35] N. Schneider, L. Schneider, P. Pinggera, U. Franke, M. Pollefeys, and C. Stiller, “Semantically guided depth upsampling,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9796 LNCS, pp. 37–48, 2016.
  • [36] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, and R. Yang, “The ApolloScape Dataset for Autonomous Driving,” 2018. [Online]. Available:
  • [37] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, “Semantic Scene Completion from a Single Depth Image,” in Proc. of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 190–198.
  • [38] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3234–3243.
  • [39] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual Worlds as Proxy for Multi-object Tracking Analysis,” in Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4340–4349.
  • [40] V. Guizilini and F. Ramos, “Large-scale 3d scene reconstruction with Hilbert maps,” in Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS), 2016.
  • [41] ——, “Learning to reconstruct 3d structures for occupancy mapping,” in Proceedings of Robotics: Science and Systems (RSS), 2017.
  • [42] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society, Series B, vol. 67, pp. 301–320, 2005.
  • [43]

    A. B. Owen, “A robust hybrid of lasso and ridge regression,”

    Contemporary Mathematics, vol. 443, no. 7, pp. 59–72, 2007.
  • [44] L. Zwald and S. Lambert-Lacroix, “The BerHu penalty and the grouped effect,” jul 2012. [Online]. Available:
  • [45]

    M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, and Others, “Tensorflow: a system for large-scale machine learning.” in

    Proc. of the 12th USENIX Symposium on Operating Systems Design and Implementation, vol. 16, 2016, pp. 265–283.
  • [46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
  • [47] I. Goodfellow, Y. Bengio, and A. Courville, “Deep Learning,” Nature Methods, vol. 13, no. 1, pp. 35–35, 2015.
  • [48] D. Scharstein and R. Szeliski, “A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms,” International Journal of Computer Vision, vol. 47, no. 1-3, pp. 7–42, 2002.
  • [49] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite,” in Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [50] M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3061–3070.
  • [51]

    B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs,” in

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 07-12-June, 2015, pp. 1119–1127.
  • [52] M. Mancini, G. Costante, P. Valigi, and T. A. Ciarfuglia, “Fast Robust Monocular Depth Estimation for Obstacle Detection with Fully Convolutional Networks,” in International Conference on Intelligent Robots and Systems (IROS), 2016, pp. 4296–4303.
  • [53] Y. Kuznietsov, J. Stückler, and B. Leibe, “Semi-supervised deep learning for monocular depth map prediction,” in Proc. of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6647–6655.
  • [54] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.