1 Introduction
Deep convolutional networks have proven to be effective tools for solving deep regression problems like depth prediction and depth completion [8]. Most networks proposed for this regression task share a common structure where the penultimate features are reduced to single channel by a final convolutional layer. This final convolutional output is then passed through a nonlinear function to map it onto the range of acceptable depth values.
This observation motivates the main contribution of this paper: Instead of using a fixed set of weights in the final layer, we perform a least squares fit from the penultimate features to the sparse depths to get a set of datadependent
weights. The rest of the network parameters are still shared across input data and learned using stochastic gradient descent. From a regression point of view, the network that produces the penultimate layer of features is an adaptive basis function
[2] and we refer to the features before the final layer as depth bases. We argue that explicitly carrying out a regression from the depth bases to the sparse depths allows the network to learn a different representation that better enforce its predictions to be consistent with the measurements, which manifests as significant performance gain.To this end, we first demonstrate how one could circumvent the nonlinearity from the depth activation function by solving a linear least squares problem with transformed target sparse depths. We then address the full robustified nonlinear least squares problem in order to deal with noisy measurements and outliers in realworld data. Finally, to make our module truly a dropin replacement for the final convolutional layer, we show how to adapt it to output predictions at multiple scales with progressively increased detail, which is a feature required by selfsupervised training schemes.
2 Related Work
2.1 Depth Estimation
Supervised Learning.Estimating dense depths from a single image is a fundamentally illposed problem. Recent learningbased approaches try to solve this by leveraging the predictive power of deep convolutional neural networks (CNN) with strong regularization [8, 27, 10]. These works require dense or semidense ground truth annotations, which are costly to obtain in large quantities in practice. Synthetic data [36, 11, 39], on the other hand, can be generated more easily from current graphics systems. However, it is nontrivial to generate synthetic data that closely matches the appearance and structure of the realworld, thus the resulting networks may require an extra step of finetuning or domain adaptation [1].
SelfSupervised Learning. When ground truth depths are not available, one could instead seek to use view synthesis as a supervisory signal [43]. This socalled selfsupervised training has gained popularity in recent years [30, 34, 48]. The network still takes a single image as input and predicts depths, but the loss is computed on a set of images. This is achieved by warping pixels from a set of source images to the target image using the predicted depths, along with estimated camera poses and camera intrinsics. Under various constancy assumptions [33], errors between target and synthesized images are computed and backpropagated through the network for learning.
Another version of selfsupervision utilizes synchronized stereo pairs [13] during training. In this setting, the network predicts the depth for the left view and uses the known focal length and baseline to reconstruct the right view, and vice versa. A more complex form utilizes the motion in monocular videos [57]. In these approaches the network also needs to predict the transformation between frames. The biggest challenge faced by monocular selfsupervision is handling moving objects. Many authors try to address this issue by predicting an explanability mask [57], motion segmentation [47], and joint opticalflow estimation [55]. We refer readers to [16] for a more detailed review.
2.2 Depth Completion
Depth completion is an extension to the depth estimation task where sparse depths are available as input. Uhrig [46] propose a sparse convolution layer that explicitly handles missing data, which renders it invariant to different levels of sparsity. Ma [29] adopt an earlyfusion strategy to combine color and sparse depths inputs in a selfsupervised training framework. On the other hand, Jaritz [25] and Shivakumar [41] advocate a latefusion approach to transform both inputs into a common feature space. Zhang [56] and Qiu [35] estimate surface normals as a secondary task to help densify the sparse depths. Irman [22] identify the cause of artifacts caused by convolution on sparse data and propose a novel scheme, Depth Coefficients, to address this problem. Eldesokey [9] and Gansbeke [12] propose to use a confidence mask to handle noise and uncertainty in sparse data. Yang [54] infer the posterior distribution of depth given an image and sparse depths by a Conditional Prior Network. While most of the above works deal with data from LiDARs or depth cameras, Wong [53] design a system that works with very sparse data from a visualinertial odometry system. Weeraskera [52] attach a fullyconnected Conditional Random Field to the output of a depth prediction network, which can also handle any input sparsity pattern.
Cheng [4]
propose a convolutional spatial propagation network that learns the affinity matrix to complete sparse depths. This is similar to a diffusion process and uses several iterations to update the depth map. Another iterative approach is described by Wang
[50], in which they design a module that can be integrated into many existing methods to improve performance of a pretrained network without retraining. This is done by iteratively updating the intermediate feature map to make the model output consistent with the given sparse depths. Like [50], our approach could be readily integrated into many of the previously proposed depth completion networks.In other related work Tang [44]. propose to parameterize depth map with a set of basis depth maps and optimize weights to minimize a featuremetric distance. In contrast, our bases are multiscale by construction and are fit directly to the sparse depths.
3 Method
In this section, we describe our proposed method for the task of monocular imageguided depth completion^{1}^{1}1From now on we will refer to this task as depth completion.. Given an image and a sparse depth map , we wish to predict a dense depth image from a depth estimation function
that minimizes some loss function
with respect to the ground truth depth . Typically, is a color image, the sparse depth map where invalid pixels are encoded by , and a fully convolutional neural network whose parameters are denoted by . When groundtruth depth is available, the learning problem is to determine according to(1) 
For supervised training we choose to be the L1 norm on depth and for selfsupervised training we use a combination of L1+SSIM on the intensity values [51] coupled with an edgeaware smoothness term [16].
3.1 Linear LeastSquares Fitting (LSF) Module
Existing depth prediction networks usually employ a final convolutional layer to convert an channel set of basis features, , to a singlechannel result,
, which is sometimes referred to as the logits layer. The inputs to this final layer are allowed to range freely between
and while the logit outputs are mapped to positive depth values by a nonlinear activation function, . Following common practice in the depth completion literature [16] we choose as follows:(2) 
where is a scaling factor that controls the minimum depth and
the sigmoid function. In this work, we set
.For simplicity we assume that the final convolution filter that maps the basis features, , onto the logits, , has a kernel size of with bias , but one could easily extend our result to arbitrary kernel size. is, therefore, an affine combination of channels in and the predicted depth at pixel is
(3) 
where represents the combined filter weights and bias, and
the basis (feature) vector at pixel
with , and the pixel index operator. To simplify notations, we use lower case letters, , to denote values at a particular pixel location. The weights are updated via backpropagation [28] using stochastic gradient descent [3]. Once learned they are typically fixed at inference time.When enough sparse depth measurements are available the weights can instead be directly computed from data. Specifically, our weights are obtained through a least squares fit from the bases to the sparse depths at valid pixels, which can then be used in place of the final convolutional layer. An overview of our proposed method is shown in Figure 1.
The objective function we wish to minimize for the least squares problem is
(4) 
with residual function
(5) 
where denotes an individual sparse depth measurement, is the number of valid pixels in , the number of channels in , and a nonlinear activation function.
The residual function is obviously nonlinear the weights due to the nonlinearity in . A simple workaround is to transform the target variable by to arrive at a new linear residual function
(6) 
We can then rewrite the new objective function (4) in matrix form to obtain a linear least squares problem
(7) 
where denotes the matrix of stacked features at valid pixel locations and the corresponding transformed sparse depths vector. The solution to (7) is the wellknown MoorePenrose pseudoinverse which can be further regularized with parameter [2].
(8) 
Notice here that our weights are calculated deterministically as a function of the bases and the sparse depth , while the original convolution filter is independent of both. In practice, this problem is usually solved via LU or Cholesky decomposition both of which are differentiable [31]. Thus, the entire training process including our LSF module is differentiable which means that it can be trained in an endtoend manner. This is an important point since we have found that retraining the network with this fitting module produces much better results than simply adding the fitting procedure to a pretrained network without retraining. Effectively the retraining allows the network to make best use of the new adaptive fitting layer.
3.2 Robustified Nonlinear Fitting
The linear LSF module is readily usable as a replacement for the final convolution layer in many depth prediction networks. One problem remains to be addressed, which is the fact that the original objective function in Equation 5 is nonlinear the weights . Although applying the inverse transformation to the sparse depths is a simple yet effective solution, we show that performing a full robustified nonlinear least squares fitting provides further performance improvements and outlier rejection at the cost of extra computation time.
Realworld data often contain noise and outliers that are hard to model or eliminate. Cheng [5] point out that there exist many different types of noise in LiDAR data from the wellknown KITTI [14] dataset. These include: 1) noise in the LiDAR measurement itself, 2) LiDAR camera misalignment, 3) moving objects, and 4) transparent and reflective surfaces. They propose a novel feedback loop that utilizes stereo matching from the network to clean erroneous data points in the sparse depths. Gansbeke [12] let the network predict a confidence map to weight information from different input branches. To handle these cases, we employ Mestimators [20], which fit well within our least squares framework.
Recall the objective function in equation (4), taking the derivative with respect to , setting it to zero and ignoring higherorder terms yields the following linear equation (GaussNewton approximation)
(9) 
where is the Jacobian matrix that is formed by stacking Jacobians , and is the residual vector formed by stacking . Following standard practice in Triggs [45], we minimize the effective squared error where the cost function is statistically weighted and robustified, which is equivalent to solving for in
(10) 
(11) 
where a diagonal matrix with terms inverseproportional to the noise in each measurement, which we assume to be Gaussian for LiDARs , is the Huber loss [21] and its first derivative
(12) 
We iteratively calculate by solving (10) and update
(13) 
with initialized from the linear fitting in Section 3.1.
Theoretically, one should repeat this until convergence, but to alleviate the problem of vanishing or exploding gradients [19], we adopt the fixediteration approach used in [44], which is also known as an incomplete optimization [7]. Despite its limitations, it has the advantage of having a fixed training/inference time and reduced memory consumption, which is often desirable in robotic systems with limited computational resources. As discussed in earlier Section 3.1, solving a linear system like equation (10) via Cholesky decomposition is differentiable, thus optimizing this nonlinear objective function by performing a fixed number of GaussNewton steps maintains the differentiability of the entire system.
3.3 Multiscale Prediction for Selfsupervision
Selfsupervised training formulates the learning problem as novel view synthesis, where the network predicted depth is used to synthesize a target image from other viewpoints. To overcome the gradient locality problem of the bilinear sampler [24] during image warping, previous works [15, 57] adopt a multiscale prediction and image reconstruction scheme by predicting a depth map at each decoder layer’s resolution. According to Godard [16], this has the side effect of creating artifacts in large textureless regions in the lower resolution depth maps due to ambiguities in photometric errors. They later improved upon this multiscale formulation by upsampling all the lower resolution depth maps to the input image resolution.
This technique greatly reduces various artifacts in the final depth prediction, but it still has one undesired property, namely, depth maps predicted at each scale are largely independent. Lower resolution depth maps are used in training phase, but are discarded during inference, resulting in a waste of parameters.
Rather than predicting a depth map at each scale separately, we propose to predict a set of bases , as shown in Figure 2. Each of the basis vectors is obtained by upsampling features from corresponding scales in the decoder as shown in Figure 2 so the resulting basis images are bandlimited by construction with coarser basis images corresponding to earlier layers in the decoder. The depth prediction at a particular scale is then reconstructed using bases up to that scale.
(14) 
The final depth prediction at highest scale is
(15) 
where and .
With this formulation, predictions at different scales will work towards the same goal, which is to reconstruct the full resolution depth map. This approach is analogous to wavelet or Fourier encodings of an image where the basis maps are organized into bandlimited components to represent the signal at various scales.
Our LSF module handles this multiscale approach quite naturally since we can simply allocate the basis maps in amongst the desired scales, then upsample and group them back together to perform the fitting step. Henceforth we use this new multiscale prediction scheme in all our experiments, even for supervised training where only the full resolution depth prediction is required.
Dataset  Resolution  # Train  # Val  Cap [m] 

KITTI [14, 46]  375 1242  38412  3347  80 
VKITTI [11]  188 621  5156  837  130 
Synthia [39]  304 512  3634  901  130 
NYUV2 [42]  480 640  1086  363   
4 Experiments
Supervised Training  NYUV2  VKITTI  Synthia  KITTI  
Input  Method  Sparse  MAE  RMSE  MAE  RMSE  MAE  RMSE  MAE  RMSE  
rgb  conv    0.6244  0.8693  58.44  6.9998  14.653  66.43  2.3911  6.3915  76.09  1.8915  4.1164  86.24 
rgb  pnp  0.2%  0.5517  0.7976  64.23  6.4701  13.990  70.18  2.1716  6.0084  81.37  1.6581  3.8019  88.67 
rgb  lsf  0.2%  0.4081  0.6124  77.86  5.8379  12.712  71.62  2.4089  6.2520  78.49  1.7033  3.5986  91.80 
rgb  lsf  0.2%  0.1826  0.3165  96.11  4.5122  9.7933  77.18  2.0104  5.6285  84.37  0.7716  2.0808  97.69 
(convlsf) / conv  +71%  +64%  +36%  +33%  +16%  +12%  +59%  +50%  
rgbd  conv  4%  0.1089  0.1679  99.20  1.5683  4.8982  94.71  0.7506  3.3322  96.50  0.3033  1.1392  99.57 
rgbd  pnp  4%  0.1008  0.1604  99.24  1.5301  4.8798  94.81  0.7311  3.3217  96.60  0.2993  1.1343  99.57 
rgbd  lsf  4%  0.1127  0.1853  99.34  2.1049  6.1901  95.30  1.3220  4.6594  94.27  0.6319  2.2895  98.46 
rgbd  lsf  4%  0.0300  0.0735  99.83  1.2598  4.6227  97.43  0.5317  3.1146  97.85  0.2266  0.9988  99.67 
(convlsf) / conv  +72%  +56%  +20%  +6%  +29%  +7%  +25%  +12% 
Noise and Outliers  NYUV2  VKITTI  Synthia  KITTI  
Input  Method  Sparse  MAE  RMSE  MAE  RMSE  MAE  RMSE  MAE  RMSE  
rgb  pnp  0.2%  0.5587  0.8019  63.66  6.5099  14.018  69.86  2.2044  6.0268  80.89  1.6571  3.8019  88.67 
rgb  lsf  0.2%  0.2439  0.3815  92.93  5.2670  10.696  65.00  2.2197  5.9136  78.34  0.7716  2.0808  97.69 
rgb  lsf2  0.2%  0.2304  0.3519  92.70  6.0025  10.768  51.01  3.2160  7.2096  59.68  1.0111  2.4547  95.88 
rgb  lsf2+  0.2%  0.1880  0.3217  94.97  4.6786  9.7402  70.16  2.1032  5.7685  79.00  0.6775  1.9651  98.28 
(lsf  lsf2+) / lsf  +23%  +16%  +11%  +9%  +5%  +2%  +12%  +6%  
rgbd  conv  4%  0.1173  0.1788  99.07  1.8748  5.1880  94.17  0.8774  3.4660  96.03  0.3033  1.1392  99.57 
rgbd  pnp  4%  0.1061  0.1688  99.15  1.8067  5.1342  94.46  0.8452  3.4511  96.19  0.2993  1.1343  99.57 
rgbd  lsf  4%  0.0606  0.1102  99.73  1.8599  5.1987  95.90  0.7082  3.2426  97.41  0.2266  0.9988  99.67 
rgbd  lsf2  4%  0.0577  0.1080  99.72  1.8008  5.0008  94.58  0.7890  3.4142  96.78  0.2305  1.0417  99.67 
rgbd  lsf2+  4%  0.0493  0.1003  99.73  1.7273  5.0422  95.50  0.7188  3.2579  97.31  0.2208  0.9758  99.71 
(conv  lsf2+) / conv  +58%  +44%  +8%  +3%  +18%  +6%  +27%  +14% 
Quantitative results of supervised training with noisy data and outliers. For all datasets except KITTI, noise is additive Gaussian with standard deviation of 0.05m. We randomly sample 30% of sparse depths to be outliers.
conv denotes the baseline network, pnp denotes running the PnP [50] module on the trained conv network without retraining, lsf is our linear fitting module, lsf2 is our nonlinear fitting module with 2 iterations, and lsf2+ is lsf2 with robust norm (Huber). Best results in each category are in bold.SelfSupervised Training  VKITTI Mono  Synthia Mono  Synthia Stereo  KITTI Stereo  

Input  Method  Sparse  MAE  RMSE  MAE  RMSE  MAE  RMSE  MAE  RMSE  
rgbd  convms  4%  2.9904  7.4517  86.87  3.0191  9.1076  66.43  1.3498  5.8643  92.73  0.6295  2.0950  99.00 
rgbd  lsf  4%  2.3804  6.7326  93.76  1.4564  4.6260  91.76  0.8619  3.9523  96.30  0.5820  1.7370  98.79 
(convms  lsf) / convms  +20%  +10%  +52%  +49%  +36%  +33%  +8%  +17% 
4.1 Implementation Details
Network Architecture.
All networks and training are implemented in PyTorch
^{2}^{2}2http://pytorch.org. To investigate the effectiveness of the proposed LSF module, we adopt the network used in Ma [29] as our main baseline. The network is a symmetric encoderdecoder [38] with skip connections. We make the following modifications for better training: 1) transposed convolutions are replaced with resize convolutions [32] for better upsampling, 2) the extra convolution layer between the encoder and the decoder are removed, 3) the encoder is based on ResNet18, as opposed to ResNet34 [18]and is initialized with parameters pretrained on ImageNet
[40].We let the decoder output 4, 8, 16, and 32dimensional bases at each scale. These are then upsampled to the image resolution and concatenated together to form a 60dimensional basis. For the baseline network, it is fed directly into a final convolution layer while for ours, it is passed into the LSF module together with the sparse depths. Therefore, these two methods are exactly the same in terms of network parameters, up to the last convolution layer.
Training Parameters. Following [29], we use the Adam optimizer [26]
with an initial learning rate of 1e4 and reduce it by half every 5 epochs. Training is carried out on a single Tesla V100 GPU with 15 epochs and the best validation result is reported. Batch sizes may vary across datasets due to GPU memory constraints, but are kept the same for experiments of the same dataset. Only random horizontal flips are used to augment the data for supervised training, no data augmentation is performed for selfsupervised training. The above settings are used across
all experiments in this work (unless explicitly stated) with the same random seed to ensure controlled experiments with fair and meaningful comparisons.4.2 Datasets
A summary of all datasets we evaluate on is shown in Table 1.
KITTI Depth Completion. We evaluate on the newly introduced KITTI depth completion dataset [46] and follow the official training/validation split. The ground truth depth is generated by merging several consecutive LiDAR scans around a given frame and refined using a stereo matching algorithm. The sparse depth map is generated by projecting LiDAR measurements onto the closest image, which occupies on average 4% of the image resolution. We use all categories from the KITTI raw dataset [14] except for Person as it contains mostly static scenes with moving objects, which is not suitable for selfsupervised training.
Virtual KITTI. The Virtual KITTI (VKITTI) dataset is a synthetic video dataset [11], which contains 50 monocular videos generated with various simulated lighting and weather conditions with dense ground truth annotations. We adopt an outofdistribution testing scheme for this dataset. Specifically, we use sequences 1, 2, 6, 18 with variations clone, morning, overcast and sunset for training, and sequence 20 with variation clone for validation. Thus the testing sequence is never seen during training. The sparse depths are generated by randomly sampling pixels that have a depth value less than 130 meters. We intentionally increase the depth cap to 130 meters for all synthetic datasets since recent LiDAR units ^{3}^{3}3https://www.ouster.io/ can easily achieve this range.
Synthia. Synthia [39] is another synthetic dataset in urban settings with dense ground truth. We use the SYNTHIASeqs version which simulates four video sequences acquired from a virtual car across different seasons. Following the training protocol in VKITTI, we use sequences 1,2,5,6 for training and sequence 4 for validation, all under the summer variation. We include this dataset because it has simulated stereo images, which serves as a complement to the monocular only VKITTI. Again ground truth and sparse depths are capped at 130 meters.
NYU Depth V2. In addition to all the outdoor datasets, we also validate our approach on NYU Depth V2 (NYUV2) [42], which is an indoor dataset. We use the 1449 densely labeled pairs of aligned RGB and depth images instead of the full dataset which is comprised of raw image and depth data as provided by the Kinect sensor. The dataset is split into approximately 75% training and 25% validation. We use the same strategy as above for sampling sparse depths but put no cap on the maximum depth.
4.3 Results
We evaluate performance using standard metrics in the depth estimation literature. Note that for accuracy ( threshold) [8] we only report , due to space limitations and the fact that the and are typically 99% for our experiments and thus provide limited insights. Following [50], we group results based on input modalities, where rgb denotes a network that only takes a color image as input. In contrast rgbd indicates a network that takes both the color image and the sparse depths as inputs.
Performance of Linear Fitting. Table 2 shows quantitative comparisons between our proposed linear LSF module from Section 3.1 and the baseline under supervised training. We see consistent improvements of our linear LSF module over the baseline in all metrics across all datasets. Note that for rgb input only, the baseline doesn’t use any sparse depth information at all. Thus the large improvement achieved by our fitting method using depth measurements for only 0.2% of the pixels is quite significant. For the rgbd case, although the sparse depth map is already used as the input to the baseline network, adding our fitting module better constrains the final prediction to be in accordance with the measurements and improves the baseline network. Since we use the L1 norm as our loss function, the improvement in MAE is bigger than that in RMSE. Examples of depth prediction are shown in Figure 3 for qualitative comparisons.
We also perform experiments in which we take a pretrained baseline method, replace the final convolutional layer with our LSF module and evaluate without retraining. This is denoted by lsf. Results show that retraining a baseline network with the LSF module allows it to achieve significantly better performance.
Additionally, we compare with PnP [50], which is a similar method that can be used on many existing networks to improve performance (see Table 2 and 3 ). The main difference is that PnP does not require retraining. We use the author’s official implementation on our baseline network by updating the output of the encoder and run for 5 iterations with update rate 0.01 as suggested in the paper. We found that although PnP has the advantage no retraining, it takes much longer to run, uses a large amount of memory and yields a smaller improvement compared to ours. Comparisons of runtime are provided in the supplementary material.
Table 5 compares our results to those achieved with CSPN[4]. The numbers for the CSPN system are taken directly taken from their paper and the official KITTI depth completion benchmark. For NYUV2 we use the same data split they used and sample 500 sparse depths. These results show the improvement afforded by our method.
NYUV2  KITTI  

Input  Method  RMSE  MAE  RMSE  iRMSE  
rgbd  cspn  0.136  99.0  0.2795  1.0196  2.93 
rgbd  lsf2+  0.134  99.3  0.2552  0.8850  3.40 
Dealing with Noise and Outliers. To verify the effectiveness of our proposed robustified nonlinear fitting module, we inject additive Gaussian noise with a standard deviation of 0.05 meters to sparse depths from NYUV2, VKITTI, and Synthia. We then randomly select 30% of the available sparse depths to be outliers and set them to random values drawn uniformly from a range between to of the true depth value. We left KITTI untouched as it already contains noise and outliers [5]. All nonlinear variants of LSF runs for 2 iterations, which we empirically found to achieve a good balance between performance and efficiency. We refer the reader to our supplementary material for further discussion on the number of iterations. We then train various models with different configurations using the corrupted data, which are also grouped by input modalities. Quantitative results are shown in Table 3.
For the rgb case, we ignore the baseline conv as it doesn’t use sparse depths and is, therefore, unaffected by noise. We again see consistent improvements in all metrics across all datasets. Notice that for our nonlinear fitting without Huber loss (lsf2), we get worse numbers on some datasets compared to our linear variant (lsf). This is because least squares fitting is sensitive to outliers without a robust norm. There are also some models in the rgbd case where the robustified version (lsf2+) doesn’t outperform the linear and nonlinear ones. We hypothesize this to be caused by using the corrupted sparse depths as network input which degrades the networks performance early on. We show in Figure 4 that our proposed method is able to identify outliers in the sparse depths and downplay them during fitting.
These results can also be crosscompared with those in Table 2, which are all trained on clean data. Clearly, models trained with clean data outperforms those trained with corrupted ones with the same configuration. But ours with nonlinear fitting and Huber loss (lsf2+) can sometimes reach similar performance to those trained with clean data even when significant noise and outliers are present.
Selfsupervised Training with Multiscale Prediction. Table 4 shows quantitative comparisons between our linear LSF module with multiscale basis and the baseline network under both monocular and stereo selfsupervised training. In this case, the baseline network has more parameters because it needs to predict depths at different scales independently. We again witness consistent improvement in all metrics across all datasets except for in KITTI. Qualitative results are shown in Figure 5. For all selfsupervised training, we use the same hyperparameters on photometric and smoothness loss as in [16], where and . Note in monocular training, we use the ground truth poses directly, as opposed to having a dedicated pose network.
5 Conclusions
In this paper we propose a novel approach to the depth completion problem that augments deep convolutional networks with a least squares fitting procedure. This method allows us to combine some of the best features of modern deep networks and classical regression algorithms. This scheme could be applied to a number of proposed depth completion networks or other regression problems to improve performance. Our proposed module is differentiable which means the modified networks can still be trained from end to end. This is important because retraining the networks allows them to make better use of the new fitting layer and allows them to produce better depth bases from the input data. We then describe how a linear least squares fitting scheme could be extended to incorporate robust estimation to improve resilience to noise and outliers which are common in real world data. We also show the method can be employed in selfsupervised settings where no ground truth is available. We validate our fitting module on a stateoftheart depth completion network with various input modalities, training frameworks, and datasets.
6 Acknowledgement
We would like to acknowledge the support of Novateur Research Solutions and an Nvidia NVAIL grant.
7 Ablation Study
Trained on KITTI  NYUV2  VKITTI  Synthia  

Input  Method  Sparse  MAE  RMSE  MAE  RMSE  MAE  RMSE  
rgbd  conv  4%  0.5318  0.8670  67.93  2.8855  9.1813  90.02  4.7380  14.408  69.51 
rgbd  lsf  4%  0.2590  0.8155  90.59  1.9189  6.9789  92.80  3.1198  9.5432  83.07 
(convlsf)/lsf  +51%  +6%  +33%  +24%  +34%  +34% 
7.1 Generalization
We demonstrate the generalization capability of our proposed module compared to the s2d baseline. We train both models on the KITTI depth completion dataset and evaluate on the rest (NYUV2, VKITTI, and Synthia). Table 6 shows the quantitative results of this set of experiments. We change the evaluation strategy of Synthia and VKITTI to cap at 80 meters since the maximum depth of KITTI is around 85 meters, but the maximum input sparse depth is still 130 meters for both datasets. We observe similar improvements in networks with our LSF module, which shows that it is able to generalize well to other datasets.
7.2 Convergence Rate
Figure 6 shows snapshots of Tensorboard^{4}^{4}4https://www.tensorflow.org/guide/summaries_and_tensorboard records of several training sessions of the s2d baseline with and without our LSF module. Notice that on both NYUV2 and KITTI datasets (and others that are not shown here), training with our LSF module has a faster convergence rate, a more stable learning curve and as a result, better validation performance. This trend is observed in all our experiments with various datasets and baseline networks [41, 12, 29]. Given this property, we hypothesize that by using our LSF module, we can quickly finetune a pretrained baseline model to another dataset.
7.3 Multiscale Bases
We’ve already shown in the main paper that our least squares fitting with multiscale bases outperforms the baseline in various selfsupervised training frameworks. Here, we show additional experiments of multiscale vs singlescale bases in supervised learning, using a multiscale bases of size (4, 8, 16, 32) versus a single scale basis of size 60. These two schemes have exactly the same same parameters. The results in table 7 show that using the multiscale formulation is even beneficial in settings where only a single full resolution depth map is needed. This improvement can be partially explained by the fact that gradients can now directly flow back to the intermediate decoder layers rather than indirectly from the final layer. This is also related to the idea of deep supervision [49], which demonstrated that adding intermediate supervision facilitates training of very deep neural networks. Therefore, we use multiscale bases in all our experiment with the baseline, for it doesn’t introduce extra parameters, yields superior performance and is compatible with both supervised and selfsupervised training paradigms. We use singlescale bases for FusionNet [12] because their network is not designed for multiscale prediction.
NYUV2  Synthia  

Basis  MAE  RMSE  MAE  RMSE  
60  0.0315  0.0757  99.83  0.5332  3.1353  97.83 
4,8,16,32  0.0300  0.0735  99.83  0.5317  3.1057  97.84 
7.4 Underdetermined Systems
In our setup, the linear system will become underdetermined when the number of sparse samples is smaller than the number of basis. Although this has rarely been evaluated in prior literature (100 samples in [50]), we test our module under such extreme cases (50 samples).
We use 50 sparse depths samples for both training and validation, which is less than the dimension of the basis used (60) which makes the linear system in the LSF module underdetermined. Due to the small number of samples, we increase the regularization parameter to 0.01. Note that PnP [50] does not require training and operates directly on the pretrained baseline. Table 8 shows the result of this experiment. In this case, our LSF module does not outperform the baseline nor PnP, but it still provides a reasonable result due to the regularization in the fitting.
When the number of sparse depth samples goes to 0, our module will inevitably fail, while the other baselines are still able to output a solution. This is indeed a drawback to our method and we plan to address this issue in future work.
NYUV2  VKITTI  

Method  MAE  RMSE  MAE  RMSE  
conv  0.2218  0.4170  92.03  6.1841  14.273  74.13 
pnp  0.2233  0.4170  92.12  6.0465  14.119  75.21 
lsf  0.3313  0.5464  83.63  8.2031  16.686  55.75 
7.5 Number of Iterations for Nonlinear Fitting
We ran several experiments training the baseline network with our nonlinear LSF module while varying the number of iterations used. Results are shown in table 9. Like many iterative approaches, we see a diminishing return with increasing number of iterations. Empirically, we found that 2 iterations strike a balance between performance and efficiency, and thus use it across all our nonlinear fitting experiments. However, we did not observe any instability with more iterations, other than a marginal variation in validation metrics.
NYUV2  VKITTI  

Method  MAE  RMSE  MAE  RMSE  
conv  0.1089  0.1679  99.20  1.5683  4.8982  94.68 
lsf0  0.0300  0.0735  99.83  1.2598  4.6227  97.32 
lsf1  0.0293  0.0721  99.83  1.2932  4.5717  96.92 
lsf2  0.0293  0.0720  99.83  1.2643  4.6114  97.07 
lsf3  0.0292  0.0720  99.83  1.3047  4.6159  96.78 
7.6 Runtime Comparison
In Figure 7, we show runtime comparisons between variants of our LSF modules and the baseline in both training and inference. The increase in computation time is due to the (repeated) solving of a linear system of equations whose complexity depends on the size of the basis. The number of sparse depth samples has a very small impact to the runtime (as explained above), and we fix it to 1024 in this experiment. Our linear LSF module adds on average 46% to inference time compared to the baseline network. Note that the times provided in the graph represent the total time required for a complete forward/backward pass through the network.
conv  lsf  lsf2  cspn  pnp  
Time [ms]  34.4  42.3  45.9  53.9  335.2 
8 More Experiments
Supervised Training  NYUV2  VKITTI  Synthia  KITTI  

Input  Method  Sparse  MAE  RMSE  MAE  RMSE  MAE  RMSE  MAE  RMSE  
rgbd  conv  4%  0.1035  0.1454  99.00  2.3531  5.5823  86.28  0.8052  3.1054  95.87  0.2790  1.0001  99.58 
rgbd  lsf  4%  0.0338  0.0752  99.82  1.4440  4.5085  95.83  0.6754  2.9411  96.72  0.2707  0.9142  99.68 
(conv  lsf)/ conv  +67%  +48%  +39%  19%  +16%  +5%  3%  +9% 
Supervised Training  NYUV2  VKITTI  Synthia  KITTI  

Input  Method  Sparse  MAE  RMSE  MAE  RMSE  MAE  RMSE  MAE  RMSE  
rgbd  conv  4%  0.1217  0.2198  97.57  3.5029  8.2014  83.05  1.5143  4.8709  90.19  0.7732  1.9537  98.21 
rgbd  lsf  4%  0.0722  0.1517  99.18  2.6840  6.6656  90.89  1.2929  4.2834  92.40  0.5526  1.6380  98.60 
(conv  lsf)/ conv  +41%  +31%  +23%  19%  +14%  +13%  29%  +16% 
8.1 FusionNet
FusionNet [12] generates dense depth predictions by combining both global and local information guided by color images. Their network also learns two uncertainty maps that fuses the global and local depth maps. The global branch generates a global depth prediction with uncertainty as well as a guidance map, which is then used in the local branch to predict a local depth map with uncertainty. These two depth maps are then linearly combined with normalized weights from the corresponding uncertainty maps. In terms of architectural difference, their network uses an ERFNet[37]
in the global branch and two hourglass networks in the local branch. ERFNet is a network designed for efficient semantic segmentation and has around 3M parameters (while ResNet18 has around 15M). There is no multiscale bases in this baseline and the final activation function is ReLU.
We use the network implementation from the official repository^{5}^{5}5https://github.com/wvangansbeke/SparseDepthCompletion. We make the following modifications to their network so that our LSF module can be attached: 1) instead of using the uncertainty maps to weight the predicted depth, we use them to weight the penultimate feature maps (bases). Because of the linearity of the convolution operations, this is equivalent to the original implementation. 2) we trained the network from scratch as a whole instead of using pretrained weights on CityScapes[6] and break the training into two steps (first global then local). We tried our best to follow the training settings in the original paper with a starting learning rate of 1e3 and L2 loss instead of L1. All networks are trained for 15 epochs.
Quantitative results are show in table 11. Note that some of the results might differ from the reported numbers from their paper, which can be attributed to many factors such as differing random seeds, training epochs and weight initialization. However, we made an honest effort to make sure that the network architecture was the same as the original and we believe that the improvements in performance offered by our method are representative.
8.2 DFuseNet
The main baseline [29] that we used adopts an early fusion strategy to combine color and depth information, where the two streams of information are combined after the first convolution layer. In DFuseNet[41], the authors instead favor a late fusion approach and use a Spatial Pyramid Pooling (SPP) [17] block in each branch to incorporate more contextual information. In terms of architectural differences, their network does not use skip connection and is trained from scratch.
We use the network implementation from the official repository^{6}^{6}6https://github.com/ShreyasSkandanS/DFuseNet
, but make the following modifications for more stable training: 1) add a batch normalization
[23] layer after every convolution except for the last one, 2) remove one scale (pool64) from the SPP block to make the network trainable with a reasonable batch size on our GPU, 3) change the decoder output from (64, 32, 16, 1) channels to (4, 8, 16, 32) channels bases. The rest of the training parameters are kept the same as described in their paper.Quantitative results are shown in Table 12. Our LSF module again improve the baseline by a significant amount under the same training setting.
References

[1]
(2018)
Realtime monocular depth estimation using synthetic data with domain adaptation via image style transfer.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 2800–2810. Cited by: §2.1. 
[2]
(2006)
Pattern recognition and machine learning (information science and statistics)
. SpringerVerlag, Berlin, Heidelberg. External Links: ISBN 0387310738 Cited by: §1, §3.1.  [3] (2010) Largescale machine learning with stochastic gradient descent. Cited by: §3.1.
 [4] (2018) Depth estimation via affinity learned with convolutional spatial propagation network. In ECCV, Cited by: §2.2, §4.3, Table 5.
 [5] (2019) Noiseaware unsupervised deep lidarstereo fusion. ArXiv abs/1904.03868. Cited by: §3.2, §4.3.

[6]
(2016)
The cityscapes dataset for semantic urban scene understanding
. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §8.1. 
[7]
(201221–23 Apr)
Generic methods for optimizationbased modeling.
In
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics
, N. D. Lawrence and M. Girolami (Eds.), Proceedings of Machine Learning Research, Vol. 22, La Palma, Canary Islands, pp. 318–326. External Links: Link Cited by: §3.2.  [8] (2014) Depth map prediction from a single image using a multiscale deep network. In NIPS, Cited by: §1, §2.1, §4.3.
 [9] (2018) Propagating confidences through cnns for sparse data regression. In BMVC, Cited by: §2.2.
 [10] (2018) Deep ordinal regression network for monocular depth estimation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2002–2011. Cited by: §2.1.
 [11] (2016) Virtual worlds as proxy for multiobject tracking analysis. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4340–4349. Cited by: §2.1, Table 1, §4.2.
 [12] (2019) Sparse and noisy lidar completion with rgb guidance and uncertainty. 2019 16th International Conference on Machine Vision Applications (MVA), pp. 1–6. Cited by: §2.2, §3.2, §7.2, §7.3, §8.1, Table 11.
 [13] (2016) Unsupervised cnn for single view depth estimation: geometry to the rescue. ArXiv abs/1603.04992. Cited by: §2.1.
 [14] (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: §3.2, Table 1, §4.2.
 [15] (2016) Unsupervised monocular depth estimation with leftright consistency. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6602–6611. Cited by: §3.3.
 [16] (2018) Digging into selfsupervised monocular depth estimation. ArXiv abs/1806.01260. Cited by: §2.1, §3.1, §3.3, §3, Figure 5, §4.3.
 [17] (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, pp. 1904–1916. Cited by: §8.2.
 [18] (2015) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §4.1.

[19]
(2001)
Gradient flow in recurrent nets: the difficulty of learning longterm dependencies.
In
A Field Guide to Dynamical Recurrent Neural Networks
, S. C. Kremer and J. F. Kolen (Eds.), Cited by: §3.2.  [20] (1981) Robust statistics. Wiley New York. Cited by: §3.2.
 [21] (196403) Robust estimation of a location parameter. Annals of Mathematical Statistics 35 (1), pp. 73–101. External Links: Document, ISSN 00034851, Link Cited by: §3.2.
 [22] (2019) Depth coefficients for depth completion. ArXiv abs/1903.05421. Cited by: §2.2.
 [23] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. ArXiv abs/1502.03167. Cited by: §8.2.
 [24] (2015) Spatial transformer networks. ArXiv abs/1506.02025. Cited by: §3.3.
 [25] (2018) Sparse and dense data with cnns: depth completion and semantic segmentation. 2018 International Conference on 3D Vision (3DV), pp. 52–60. Cited by: §2.2.
 [26] (2015) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §4.1.
 [27] (2016) Deeper depth prediction with fully convolutional residual networks. 2016 Fourth International Conference on 3D Vision (3DV), pp. 239–248. Cited by: §2.1.
 [28] (1988) A theoretical framework for backpropagation. In Proceedings of the 1988 Connectionist Models Summer School, CMU, Pittsburg, PA, D. Touretzky, G. Hinton, and T. Sejnowski (Eds.), pp. 21–28 (English (US)). Cited by: §3.1.
 [29] (2018) Selfsupervised sparsetodense: selfsupervised depth completion from lidar and monocular camera. ArXiv abs/1807.00275. Cited by: §2.2, §4.1, §4.1, §7.2, Table 6, §8.2.
 [30] (2018) Unsupervised learning of depth and egomotion from monocular video using 3d geometric constraints. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5667–5675. Cited by: §2.1.
 [31] (2016) Differentiation of the cholesky decomposition. ArXiv abs/1602.07527. Cited by: §3.1.
 [32] (2016) Deconvolution and checkerboard artifacts. Distill. External Links: Link, Document Cited by: §4.1.
 [33] (2005) Highly accurate optic flow computation with theoretically justified warping. International Journal of Computer Vision 67, pp. 141–158. Cited by: §2.1.
 [34] (2018) SuperDepth: selfsupervised, superresolved monocular depth estimation. ArXiv abs/1810.01849. Cited by: §2.1.
 [35] (2018) DeepLiDAR: deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. ArXiv abs/1812.00488. Cited by: §2.2.
 [36] (2016) UnrealCV: connecting computer vision to unreal engine. ArXiv abs/1609.01326. Cited by: §2.1.
 [37] (2018) ERFNet: efficient residual factorized convnet for realtime semantic segmentation. IEEE Transactions on Intelligent Transportation Systems 19, pp. 263–272. Cited by: §8.1.
 [38] (2015) Unet: convolutional networks for biomedical image segmentation. In Medical Image Computing and ComputerAssisted Intervention (MICCAI), LNCS, Vol. 9351, pp. 234–241. Cited by: §4.1.
 [39] (2016) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3234–3243. Cited by: §2.1, Table 1, §4.2.
 [40] (2014) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, pp. 211–252. Cited by: §4.1.
 [41] (2019) DFuseNet: deep fusion of rgb and sparse depth information for image guided dense depth completion. ArXiv abs/1902.00761. Cited by: §2.2, §7.2, §8.2, Table 12.
 [42] (2012) Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: Table 1, §4.2.
 [43] (1999) Prediction error as a quality metric for motion and stereo. Proceedings of the Seventh IEEE International Conference on Computer Vision 2, pp. 781–788 vol.2. Cited by: §2.1.
 [44] (2019) BAnet: dense bundle adjustment networks. In International Conference on Learning Representations, External Links: Link Cited by: §2.2, §3.2.
 [45] (1999) Bundle adjustment  a modern synthesis. In Workshop on Vision Algorithms, Cited by: §3.2.
 [46] (2017) Sparsity invariant cnns. 2017 International Conference on 3D Vision (3DV), pp. 11–20. Cited by: §2.2, Table 1, §4.2.
 [47] (2017) SfMnet: learning of structure and motion from video. ArXiv abs/1704.07804. Cited by: §2.1.
 [48] (2017) Learning depth from monocular videos using direct methods. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2022–2030. Cited by: §2.1.
 [49] (2015) Training deeper convolutional networks with deep supervision. ArXiv abs/1505.02496. Cited by: §7.3.
 [50] (2018) Plugandplay: improve depth estimation via sparse data propagation. arXiv preprint arXiv:1812.08350. Cited by: §2.2, §4.3, §4.3, Table 2, Table 3, §7.4, §7.4, Table 8.
 [51] (2004) Image quality assessment: from error visibility to structural similarty. IEEE Transactions on Image Processing 13, pp. 600–612. Cited by: §3.
 [52] (2018) Justintime reconstruction: inpainting sparse maps using single view depth predictors as priors. 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. Cited by: §2.2.
 [53] (2019) VOICED: depth completion from inertial odometry and vision. ArXiv abs/1905.08616. Cited by: §2.2.
 [54] (2019) Dense depth posterior (ddp) from single image and sparse range. ArXiv abs/1901.10034. Cited by: §2.2.
 [55] (2018) GeoNet: unsupervised learning of dense depth, optical flow and camera pose. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1983–1992. Cited by: §2.1.
 [56] (2018) Deep depth completion of a single rgbd image. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 175–185. Cited by: §2.2.
 [57] (2017) Unsupervised learning of depth and egomotion from video. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6612–6619. Cited by: §2.1, §3.3.
Comments
There are no comments yet.