ICRA 2018 "Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image" (Torch Implementation)
We consider the problem of dense depth prediction from a sparse set of depth measurements and a single RGB image. Since depth estimation from monocular images alone is inherently ambiguous and unreliable, we introduce additional sparse depth samples, which are either collected from a low-resolution depth sensor or computed from SLAM, to attain a higher level of robustness and accuracy. We propose the use of a single regression network to learn directly from the RGB-D raw data, and explore the impact of number of depth samples on prediction accuracy. Our experiments show that, as compared to using only RGB images, the addition of 100 spatially random depth samples reduces the prediction root-mean-square error by half in the NYU-Depth-v2 indoor dataset. It also boosts the percentage of reliable prediction from 59 more challenging KITTI driving dataset. We demonstrate two applications of the proposed algorithm: serving as a plug-in module in SLAM to convert sparse maps to dense maps, and creating much denser point clouds from low-resolution LiDARs. Codes and video demonstration are publicly available.READ FULL TEXT VIEW PDF
ICRA 2018 "Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image" (Torch Implementation)
Depth sensing and estimation is of vital importance in a wide range of engineering applications, such as robotics, autonomous driving, augmented reality (AR) and 3D mapping. However, existing depth sensors, including LiDARs, structured-light-based depth sensors, and stereo cameras, all have their own limitations. For instance, the top-of-the-range 3D LiDARs are cost-prohibitive (with up to $75,000 cost per unit), and yet provide only sparse measurements for distant objects. Structured-light-based depth sensors (e.g. Kinect) are sunlight-sensitive and power-consuming, with a short ranging distance. Finally, stereo cameras require a large baseline and careful calibration for accurate triangulation, which demands large amount of computation and usually fails at featureless regions. Because of these limitations, there has always been a strong interest in depth estimation using a single camera, which is small, low-cost, energy-efficient, and ubiquitous in consumer electronic products.
However, the accuracy and reliability of such methods is still far from being practical, despite over a decade of research effort devoted to RGB-based depth prediction including the recent improvements with deep learning approaches. For instance, the state-of-the-art RGB-based depth prediction methods[1, 2, 3] produce an average error (measured by the root mean squared error) of over 50cm in indoor scenarios (e.g., on the NYU-Depth-v2 dataset ). Such methods perform even worse outdoors, with at least 4 meters of average error on Make3D and KITTI datasets [5, 6].
To address the potential fundamental limitations of RGB-based depth estimation, we consider the utilization of sparse depth measurements, along with RGB data, to reconstruct depth in full resolution. Sparse depth measurements are readily available in many applications. For instance, low-resolution depth sensors (e.g., a low-cost LiDARs) provide such measurements. Sparse depth measurements can also be computed from the output of SLAM444A typical feature-based SLAM algorithm, such as ORB-SLAM , keeps track of hundreds of 3D landmarks in each frame.
and visual-inertial odometry algorithms. In this work, we demonstrate the effectiveness of using sparse depth measurements, in addition to the RGB images, as part of the input to the system. We use a single convolutional neural network to learn a deep regression model for depth image prediction. Our experimental results show that the addition of as few as 100 depth samples reduces the root mean squared error by over 50% on the NYU-Depth-v2 dataset, and boosts the percentage of reliable prediction from 59% to 92% on the more challenging KITTI outdoor dataset. In general, our results show that the addition of a few sparse depth samples drastically improves depth reconstruction performance. Our quantitative results may help inform the development of sensors for future robotic vehicles and consumer devices.
The main contribution of this paper is a deep regression model that takes both a sparse set of depth samples and RGB images as input and predicts a full-resolution depth image. The prediction accuracy of our method significantly outperforms state-of-the-art methods, including both RGB-based and fusion-based techniques. Furthermore, we demonstrate in experiments that our method can be used as a plug-in module to sparse visual odometry / SLAM algorithms to create an accurate, dense point cloud. In addition, we show that our method can also be used in 3D LiDARs to create much denser measurements.
RGB-based depth prediction Early works on depth estimation using RGB images usually relied on hand-crafted features and probabilistic graphical models. For instance, Saxena et al.  estimated the absolute scales of different image patches and inferred the depth image using a Markov Random Field model. Non-parametric approaches [9, 10, 11, 12] were also exploited to estimate the depth of a query image by combining the depths of images with similar photometric content retrieved from a database.
Recently, deep learning has been successfully applied to the depth estimation problem. Eigen et al.  suggest a two-stack convolutional neural network (CNN), with one predicting the global coarse scale and the other refining local details. Eigen and Fergus  further incorporate other auxiliary prediction tasks into the same architecture. Liu et al.  combined a deep CNN and a continuous conditional random field, and attained visually sharper transitions and local details. Laina et al.  developed a deep residual network based on the ResNet  and achieved higher accuracy than [1, 2]. Semi-supervised 16, 17, 18] setups have also been explored for disparity image prediction. For instance, Godard et al.  formulated disparity estimation as an image reconstruction problem, where neural networks were trained to warp left images to match the right.
Depth reconstruction from sparse samples Another line of related work is depth reconstruction from sparse samples. A common ground of many approaches in this area is the use of sparse representations for depth signals. For instance, Hawe et al.  assumed that disparity maps were sparse on the Wavelet basis and reconstructed a dense disparity image with a conjugate sub-gradient method. Liu et al.  combined wavelet and contourlet dictionaries for more accurate reconstruction. Our previous work on sparse depth sensing [21, 22] exploited the sparsity underlying the second-order derivatives of depth images, and outperformed both [19, 1] in reconstruction accuracy and speed.
Sensor fusion A wide range of techniques attempted to improve depth prediction by fusing additional information from different sensor modalities. For instance, Mancini et al.  proposed a CNN that took both RGB images and optical flow images as input to predict distance. Liao et al.  studied the use of a 2D laser scanner mounted on a mobile ground robot to provide an additional reference depth signal as input and obtained higher accuracy than using RGB images alone. Compared to the approach by Liao et al. , this work makes no assumption regarding the orientation or position of sensors, nor the spatial distribution of input depth samples in the pixel space. Cadena et al.  developed a multi-modal auto-encoder to learn from three input modalities, including RGB, depth, and semantic labels. In their experiments, Cadena et al.  used sparse depth on extracted FAST corner features as part of the input to the system to produce a low-resolution depth prediction. The accuracy was comparable to using RGB alone. In comparison, our method predicts a full-resolution depth image, learns a better cross-modality representation for RGB and sparse depth, and attains a significantly higher accuracy.
In this section, we describe the architecture of the convolutional neural network. We also discuss the depth sampling strategy, the data augmentation techniques, and the loss functions used for training.
We found in our experiments that many bottleneck architectures (with an encoder and a decoder) could result in good performance. We chose the final structure based on  for the sake of benchmarking, because it achieved state-of-the-art accuracy in RGB-based depth prediction. The network is tailed to our problem with input data of different modalities, sizes and dimensions. We use two different networks for KITTI and NYU-Depth-v2. This is because the KITTI image is triple the size of NYU-Depth-v2 and consequently the same architecture would require 3 times of GPU memory, exceeding the current hardware capacity. The final structure is illustrated in fig:cnn.
The feature extraction (encoding) layers of the network, highlighted in blue, consist of a ResNet
followed by a convolution layer. More specifically, the ResNet-18 is used for KITTI, and ResNet-50 is used for NYU-Depth-v2. The last average pooling layer and linear transformation layer of the original ResNet have been removed. The second component of the encoding structure, the convolution layer, has a kernel size of 3-by-3.
The decoding layers, highlighted in yellow, are composed of 4 upsampling layers followed by a bilinear upsampling layer. We use the module proposed by Laina et al.  as our upsampling layer, but a deconvolution with larger kernel size can also achieve the same level of accuracy. An empirical comparison of different upsampling layers is shown in sec:results-architecture.
In this section, we introduce the sampling strategy for creating the input sparse depth image from the ground truth.
During training, the input sparse depth is sampled randomly from the ground truth depth image on the fly. In particular, for any targeted number of depth samples
(fixed during training), we compute a Bernoulli probability, where is the total number of valid depth pixels in . Then, for any pixel ,
With this sampling strategy, the actual number of non-zero depth pixels varies for each training sample around the expectation . Note that this sampling strategy is different from dropout , which scales up the output by
during training to compensate for deactivated neurons. The purpose of our sampling strategy is to increase robustness of the network against different number of inputs and to create more training data (i.e., a data augmentation technique). It is worth exploring how injection of random noise and a different sampling strategy (e.g., feature points) would affect the performance of the network.
We augment the training data in an online manner with random transformations, including
Scale: color images are scaled by a random number , and depths are divided by .
Rotation: color and depths are both rotated with a random degree .
Color Jitter: the brightness, contrast, and saturation of color images are each scaled by .
: RGB is normalized through mean subtraction and division by standard deviation.
Flips: color and depths are both horizontally flipped with a 50% chance.
Nearest neighbor interpolation, rather than the more common bi-linear or bi-cubic interpolation, is used in both scaling and rotation to avoid creating spurious sparse depth points. We take the center crop from the augmented image so that the input size to the network is consistent.
One common and default choice of loss function for regression problems is the mean squared error ().
is sensitive to outliers in the training data since it penalizes more heavily on larger errors. During our experiments we found that theloss function also yields visually undesirable, over-smooth boundaries instead of sharp transitions.
Another common choice is the Reversed Huber (denoted as berHu) loss function , defined as
 uses a batch-dependent parameter , computed as 20% of the maximum absolute error over all pixels in a batch. Intuitively, berHu acts as the mean absolute error () when the element-wise error falls below , and behaves approximately as when the error exceeds .
In our experiments, besides the aforementioned two loss functions, we also tested and found that it produced slightly better results on the RGB-based depth prediction problem. The empirical comparison is shown in sec:results-architecture. As a result, we use as our default choice throughout the paper for its simplicity and performance.
We implement the network using Torch
. Our models are trained on the NYU-Depth-v2 and KITTI odometry datasets using a NVIDIA Tesla P100 GPU with 16GB memory. The weights of the ResNet in the encoding layers (except for the first layer which has different number of input channels) are initialized with models pretrained on the ImageNet dataset
. We use a small batch size of 16 and train for 20 epochs. The learning rate starts at 0.01, and is reduced to 20% every 5 epochs. A small weight decay ofis applied for regularization.
The NYU-Depth-v2 dataset  consists of RGB and depth images collected from 464 different indoor scenes with a Microsoft Kinect. We use the official split of data, where 249 scenes are used for training and the remaining 215 for testing. In particular, for the sake of benchmarking, the small labeled test dataset with 654 images is used for evaluating the final performance, as seen in previous work [3, 13].
For training, we sample spatially evenly from each raw video sequence from the training dataset, generating roughly 48k synchronized depth-RGB image pairs. The depth values are projected onto the RGB image and in-painted with a cross-bilateral filter using the official toolbox. Following [3, 13], the original frames of size 640480 are first down-sampled to half and then center-cropped, producing a final size of 304228.
In this work we use the odometry dataset, which includes both camera and LiDAR measurements . The odometry dataset consists of 22 sequences. Among them, one half is used for training while the other half is for evaluation. We use all 46k images from the training sequences for training the neural network, and a random subset of 3200 images from the test sequences for the final evaluation.
We use both left and right RGB cameras as unassociated shots. The Velodyne LiDAR measurements are projected onto the RGB images. Only the bottom crop (912228) is used, since the LiDAR returns no measurement to the upper part of the images. Compared with NYU-Depth-v2, even the ground truth is sparse for KITTI, typically with only 18k projected measurements out of the 208k image pixels.
We evaluate each method using the following metrics:
: root mean squared error
: mean absolute relative error
: percentage of predicted pixels where the relative error is within a threshold. Specifically,
where and are respectively the ground truth and the prediction, and is the cardinality of a set. A higher indicates better prediction.
In this section we present all experimental results. First, we evaluate the performance of our proposed method with different loss functions and network components on the prediction accuracy in sec:results-architecture. Second, we compare the proposed method with state-of-the-art methods on both the NYU-Depth-v2 and the KITTI datasets in sec:results-literature. Third, In sec:results-samples, we explore the impact of number of sparse depth samples on the performance. Finally, in sec:results-dense and sec:results-lidar, we demonstrate two use cases of our proposed algorithm in creating dense maps and LiDAR super-resolution.
In this section we present an empirical study on the impact of different loss functions and network components on the depth prediction accuracy. The results are listed in tab:architecture.
To compare the loss functions we use the same network architecture, where the upsampling layers are simple deconvolution with a 22 kernel (denoted as ). , berHu and loss functions are listed in the first three rows in tab:architecture for comparison. As shown in the table, both berHu and significantly outperform . In addition, produces slightly better results than berHu. Therefore, we use as our default choice of loss function.
We perform an empirical evaluation of different upsampling layers, including deconvolution with kernels of different sizes ( and ), as well as the and modules proposed by Laina et al. . The results are listed from row 3 to 6 in tab:architecture.
We make several observations. Firstly, deconvolution with a 33 kernel (i.e., ) outperforms the same component with only a 22 kernel (i.e., ) in every single metric. Secondly, since both and have a receptive field of 33 (meaning each output neuron is computed from a neighborhood of 9 input neurons), they have comparable performance. Thirdly, with an even larger receptive field of 44, the module outperforms the others. We choose to use as a default choice.
Since our RGBd input data comes from different sensing modalities, its 4 input channels (R, G, B, and depth) have vastly different distributions and support. We perform a simple analysis on the first convolution layer and explore three different options.
The first option is the regular spatial convolution (). The second option is depthwise separable convolution (denoted as ), which consists of a spatial convolution performed independently on each input channel, followed by a pointwise convolution across different channels with a window size of 1. The third choice is channel dropout (denoted as ), through which each input channel is preserved as is with some probability , and zeroed out with probability .
The bottom 3 rows compare the results from the 3 options. The networks are trained using RGBd input with an average of 100 sparse input samples. and yield very similar results, and both significantly outperform the layer. Since the difference is small, for the sake of comparison consistency, we will use the convolution layer for all experiments.
In this section, we compare with existing methods.
We compare with RGB-based approaches [30, 13, 3], as well as the fusion approach  that utilizes an additional 2D laser scanner mounted on a ground robot. The quantitative results are listed in tab:methods-nyu.
|RGB||0||Roy et al. ||0.744||0.187||-||-||-|
|0||Eigen et al. ||0.641||0.158||76.9||95.0||98.8|
|0||Laina et al. ||0.573||0.127||81.1||95.3||98.8|
|RGBd||225||Liao et al. ||0.442||0.104||87.8||96.4||98.9|
Our first observation from Row 2 and Row 3 is that, with the same network architecture, we can achieve a slightly better result (albeit higher ) by replacing the berHu loss function proposed in  with a simple . Secondly, by comparing problem group RGB (Row 3) and problem group sd (e.g., Row 4), we draw the conclusion that an extremely small set of 20 sparse depth samples (without color information) already produces significantly better predictions than using RGB. Thirdly, by comparing problem group sd and proble group RGBd row by row with the same number of samples, it is clear that the color information does help improve the prediction accuracy. In other words, our proposed method is able to learn a suitable representation from both the RGB images and the sparse depth images. Finally, we compare against  (bottom row). Our proposed method, even using only 100 samples, outperforms  with 225 laser measurements. This is because our samples are spatially uniform, and thus provides more information than a line measurement. A few examples of our predictions with different inputs are displayed in fig:examples-nyu.
|0||Eigen et al. ||7.156||0.190||69.2||89.9||96.7|
|225||Liao et al. ||4.50||0.113||87.4||96.0||98.4|
The KITTI dataset is more challenging for depth prediction, since the maximum distance is 100 meters as opposed to only 10 meters in the NYU-Depth-v2 dataset. A greater performance boost can be obtained from using our approach. Although the training and test data are not the same across different methods, the scenes are similar in the sense that they all come from the same sensor setup on a car and the data were collected during driving. We report the values from each work in tab:methods-kitti.
The results in the first RGB group demonstrate that RGB-based depth prediction methods fail in outdoor scenarios, with a pixel-wise of close to 7 meters. Note that we use sparsely labeled depth image projected from LiDAR, instead of dense disparity maps computed from stereo cameras as in . In other words, we have a much smaller training dataset compared with [23, 13].
In this section, we explore the relation between the prediction accuracy and the number of available depth samples. We train a network for each different input size for optimal performance. We compare the performance for all three kinds of input data, including RGB, sd, and RGBd. The performance of RGB-based depth prediction is independent of input sample size and is thus plotted as a horizontal line for benchmarking.
On the NYU-Depth-v2 dataset in fig:acc_vs_samples_nyu, the RGBd outperforms RGB with over 10 depth samples and the performance gap quickly increases with the number of samples. With a set of 100 samples, the of RGBd decreases to around 25cm, half of RGB (51cm). The sees a larger improvement (from 0.15 to 0.05, reduced by two thirds). On one hand, the RGBd approach consistently outperforms sd, which indicates that the learned model is indeed able to extract information not only from the sparse samples alone, but also from the colors. On the other hand, the performance gap between RGBd and sd shrinks as the sample size increases. Both approaches perform equally well when sample size goes up to 1000, which accounts for less than 1.5% of the image pixels and is still a small number compared with the image size. This observation indicates that the information extracted from the sparse sample set dominates the prediction when the sample size is sufficiently large, and in this case the color cue becomes almost irrelevant.
The performance gain on the KITTI dataset is almost identical to NYU-Depth-v2, as shown in fig:acc_vs_samples_kitti. With 100 samples the of RGBd decreases from 7 meters to a half, 3.5 meters. This is the same percentage of improvement as on the NYU-Depth-v2 dataset. Similarly, the is reduced from 0.21 to 0.07, again the same percentage of improvement as the NYU-Depth-v2.
On both datasets, the accuracy saturates as the number of depth samples increases. Additionally, the prediction has blurry boundaries even with many depth samples (see fig:super-resolution). We believe both phenomena can be attributed to the fact that fine details are lost in bottleneck network architectures. It remains further study if additional skip connections from encoders to decoders help improve performance.
In this section, we demonstrate a use case of our proposed method in sparse visual SLAM and visual inertial odometry (VIO). The best-performing algorithms for SLAM and VIO are usually sparse methods, which represent the environment with sparse 3D landmarks. Although sparse SLAM/VIO algorithms are robust and efficient, the output map is in the form of sparse point clouds and is not useful for other applications (e.g. motion planning).
To demonstrate the effectiveness of our proposed methods, we implement a simple visual odometry (VO) algorithm with data from one of the test scenes in the NYU-Depth-v2 dataset. For simplicity, the absolute scale is derived from ground truth depth image of the first frame. The 3D landmarks produced by VO are back-projected onto the RGB image space to create a sparse depth image. We use both RGB and sparse depth images as input for prediction. Only pixels within a trusted region, which we define as the convex hull on the pixel space formed by the input sparse depth samples, are preserved since they are well constrained and thus more reliable. Dense point clouds are then reconstructed from these reliable predictions, and are stitched together using the trajectory estimation from VIO.
The results are displayed in fig:slam. The prediction map resembles closely to the ground truth map, and is much denser than the sparse point cloud from VO. The major difference between our prediction and the ground truth is that the prediction map has few points on the white wall, where no feature is extracted or tracked by the VO. As a result, pixels corresponding to the white walls fall outside the trusted region and are thus removed.
We present another demonstration of our method in super-resolution of LiDAR measurements. 3D LiDARs have a low vertical angular resolution and thus generate a vertically sparse point cloud. We use all measurements in the sparse depth image and RGB images as input to our network. The average is 4.9%, as compared to 20.8% when using only RGB. An example is shown in fig:super-resolution. Cars are much more recognizable in the prediction than in the raw scans.
We introduced a new depth prediction method for predicting dense depth images from both RGB images and sparse depth images, which is well suited for sensor fusion and sparse SLAM. We demonstrated that this method significantly outperforms depth prediction using only RGB images, and other existing RGB-D fusion techniques. This method can be used as a plug-in module in sparse SLAM and visual inertial odometry algorithms, as well as in super-resolution of LiDAR measurements. We believe that this new method opens up an important avenue for research into RGB-D learning and the more general 3D perception problems, which might benefit substantially from sparse depth samples.
This work was supported in part by the Office of Naval Research (ONR) through the ONR YIP program. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the DGX-1 used for this research.
C. Cadena, A. R. Dick, and I. D. Reid, “Multi-modal auto-encoders as joint estimators for robotics scene understanding.” inRobotics: Science and Systems, 2016.
Journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
A. B. Owen, “A robust hybrid of lasso and ridge regression,”Contemporary Mathematics, vol. 443, pp. 59–72, 2007.