Dense depth estimation is crucial in the field of 3D reconstruction [geiger2011stereoscan], 3D object detection [wang2019pseudo, you2020pseudo], and robotic vision [marapane1989region, nalpantidis2010stereo]. Many works have proposed to estimate depth from RGB images or stereo pairs. Yet, the stereo estimation could be unreliable on homogeneous planes, large illumination changes, and repetitive textures [shivakumar2019real, wang20193d]; while monocular depth estimation is an ill-posed problem [eigen2014depth] and inherently ambiguous and unreliable [lee2019monocular, mal2018sparse]. To attain a higher level of robustness and accuracy, modern solutions commonly leverage raw sparse signal, such as LiDAR [ahmad2020extensible, qiu2019deeplidar, mal2018sparse] and Radar [chadwick2019distant, nobis2019deep], to improve depth estimation results or object detection for the challenging outdoor scenes, termed guidance in this paper.
Despite the success of those sparse-guidance methods, however, we still find two big problems with sparse signal. First, raw sparse signal can be ignored by networks when it is largely different from depth predicted with RGB (shown in Figure 0(a)). This situation stems from the low density property of the sparse signal, which is a common problem in many large-scale dataset. For example, KITTI dataset [geiger2013vision] wraps up an average density of and nuScenes dataset [nuscenes2019] has an average of less than Radar points over a image. Actually, the guidance module tends to ignore the accurate but sparse signals when they strongly disagree with the original prediction.
Furthermore, imbalance guidance is also the main problem. As shown in Figure 0(b), the algorithms only focus on the small region with high signal density while barely correct the low density region between scanning lines and cause non-smoothing result. However, these low-density parts neither implicate less importance nor less confidence. The uneven signal distribution in space is caused by the different sensing devices. For example, LiDAR signals are mostly localized on the scanning lines with the same polar angles in the spherical coordinate, and the azimuth resolution of Radar signals is poor [daniel2018application, sheeny2020300]
. In fact, the importance of a sparse signal point should depend on how many nearby points it can affect. Thus, methods that conduct experiments under the assumption of uniformly distributed signal can be unreliable for real cases with imbalanced distributed signal.
To tackle the critical low density and imbalanced distribution problems, we propose a novel framework, Sparse Signal Superdensity (), to enhance the density and mitigate imbalanced sparse signal for guided depth estimation. consists of two components: (1) sparse signal expansion (2) confidence weighting. For sparse signal expansion, first estimates the expanded area for each sparse signal based on the RGB image, and then assigns appropriate depth value to the expanded region. For confidence weighting, measures the confidence of the assigned depth to control the amount of influence to the sparse-fusion methods. Our method effectively utilizes confidence measure to increase the density of the sparse signal.
We implement the framework with a light-weight network, which can be embedded in existing depth estimation networks and trained in an end-to-end fashion. We conduct qualitative experiments to show the effectiveness of network on LiDAR and Radar guidance methods. The experimental results show that using our proposed can solve the low density and imbalanced distribution problems. Our method can highly increase the utility of the sparse signal and make substantial improvement on four typical sparse-guidance schemes on KITTI [Geiger2012CVPR, Menze2015CVPR] and nuScenes [nuscenes2019] dataset.
To sum up, our main contributions are highlighted as follows,
Our work is the first to point out the defective properties of the sparse signal and the subsequent influence to the depth estimation results.
The novel and general framework Sparse Signal Superdensity () enhances the density of sparse signal, mitigates the imbalanced distribution problem, and provides extra confidence cues for depth estimation.
largely increases the robustness and accuracy on depth estimation tasks using sparse signals, e.g., LiDAR and Radar.
2 Related Work
In this section, we will introduce guided depth estimation approaches and review related ideas about signal expansion.
Guided Mono Estimation.
Previous works guide monocular depth estimation networks with external active sensors to address the technically ill-posed problem [eigen2014depth] and improve performance [zhang2018deep, huang2019indoor, mal2018sparse, ma2019self, shivakumar2019dfusenet, zhong2019deep, uhrig2017sparsity] known as Depth Completion. Cheng [cheng2019learning] fuse the sparse depth as input and propagate the information to the surrounding pixels. Cadena [cadena2016multi] concatenate the features of the cross-modality data to learn an auto-encoder for completing the partial or noisy depth. Ma and Karaman [mal2018sparse] fuse different modalities in the first convolution layer to generate high-resolution depth. The methods aim at completing the depth from sparse depth signal and an image.
Guided Stereo Estimation.
Previous works guide stereo matching results with external sparse signal for better predicted results [li2020lidar, ahmad2020extensible, park2018high, shivakumar2019real, cheng2019noise]. Stereo matching leverages epipolar geometry to match pixels across image pairs and produce disparity [zhang1998determining], which can be transformed to depth by triangulation. PSMNet [chang2018pyramid] and GANet [zhang2019ga] are renowned stereo backbones. Poggi [poggi2019guided] propose guided techniques on cost volume to alleviate the domain shift. Yet, their method assumes sparse signal to be uniformly distributed, which does not consider imbalanced signal problem. You [you2020pseudo] propose a graph-based depth correction algorithm to refine the stereo results in 3D domain with cheap LiDAR sensors. Nonetheless, their algorithm design does not take the imbalanced signal issue into account. Wang [wang20193d]
propose input fusion and regularize batch normalization conditioning on LiDAR signal. The above methods utilize the raw sparse signal for guidance or correction, which puts little emphasis on the inherent problems of the sparse signal mentioned.
The expansion idea has shown in tasks like superpixel segmentation [achanta2012slic, van2012seeds, yao2015real, rubio2016bass], depth completion, and depth sampling [hawe2011dense, liu2015depth, wolff2020super]. Superpixel aggregates pixels with similar semantics, but similar semantics do not imply similar depth values. Depth completion and depth sampling complete the sparse depth, but most of the previous works do not measure the confidence of the expanded depth and rely on heavily computational resources.
Shivakumar [shivakumar2019real] propose promotion of the depth signal to the neighboring pixels in the cost volume to improve depth estimation. The incentive to promote the sparse signal is close to our application on cost volume. However, their methods are only applicable to Semi Global Matching [hirschmuller2005accurate] algorithm. Furthermore, there are lots of hand-tuned hyper-parameters and assumptions, like promotion with Gaussian, which may not hold for real data.
To the best of our knowledge, none of the previous works propose a general guidance framework to solve the inherent problems of the sparse signal.
3.1 Intuition of Sparse Signal Superdensity
To solve the issues of low density and imbalanced distribution, we propose expanding the sparse cues to the neighbor region. Our idea is that neighboring pixels with similar color intensities belong to the same image structure or object and thus have similar depth values.
Intuitively, the ad-hoc method is to expand points by color thresholds inspired by cross-based support window method [zhang2009cross]. To be specific, let , and be the color intensity map, sparse signal map and expanded map. Given a central pixel (the coordinate of the source point), we greedily expand from the central value to its neighbor pixels and fill in the expanded pixels with as shown in Figure 3. The expansion stops until the maximum of color intensity differences is larger than a threshold or the expansion size reaches the limit.
Although the expanded map can substitute the sparse to perform any fusion techniques in depth estimation, the expanded points may provide false guidance to the estimating process, especially for occlusions or pixels across object boundary. As a result, instead of applying the same level of guidance to all pixels, we provide a confidence map to measures the reliability of the expanded value in and the level of guidance to apply for depth estimation.
3.2 Learnable Sparse Signal Superdensity
We propose leveraging a neural network to learn how to expand sparse signals and the corresponding confidence with the concept ofsparse signal expansion and confidence weighting from Section 3.1. We expand each sparse signal to a patch by a network and aggregate all the expanded patches to form the final output.
To be specific, we predict how confident the sparse depth can expand from the center pixel to the neighboring pixel with the network. We set the expansion space to be a square patch of size for each sparse signal, where . The input of the network is a crop of the intensity map . The output is a confidence patch of the same size and saved in , where is the index of ’th sparse depth signal and for other pixels out of the patch. Then, we aggregate the confidence patches to be the expanded depth map
by the following interpolation equation.
where is the pixel coordinate of the ’th sparse signal and is the set of indices of the sparse signal. The operation means that pixels with no signal are assigned with an interpolated depth value from its nearby sparse signal value. Consequently, the more confidence of the source signal, the more likely the assigned depth value to be. Finally, we aggregate the confidence maps by taking the maximum among the confidence patches.
Note that if has no expanded signal. and if for a .
We formulate a general method to learn network along with any depth backbone. Here, the confidence value can act as the weights between the guided depth and the original estimated depth from monocular estimation or stereo matching . That is,
With the depth ground truth , the supervised loss on the output depth can be formed as . We also supervise with and add regularization
The first term means the more confident about the expanded depth, the more accurate the depth should be, while the second term prevents excessive confidence for pretraining. In practice, the gradient of of the first term is detached, otherwise, can be a bad local minimum. The model is trained end-to-end so that the expansion process is learned from data.
4 Application of
network can learn to expand different modality data, including the most widely used LiDAR and Radar. Furthermore, works on both depth and disparity representation, allowing users to use our module in various applications. For instance, disparity is preferred for robotic tasks due to the need to provide higher accuracy in the nearby region [wang20193d].
Many works have proposed signal-guidance schemes to enhance depth estimated from RGB as addressed in Section 1 and 2. These methods can be divided into three categories: (1) Guidance on Input and Output (2) Guidance on Cost Volume (3) Guidance on 3D Space. We will introduce how to apply our module for each type of methods (overview in Figure 2) in the following.
4.1 Guidance on Input and Output
For guidance on input, the most intuitive way is to concatenating these external sparse signal as one of the input to the neural network. This strategy is widely used in dense depth estimation domain for either monocular [zhang2018deep, ma2019self, mal2018sparse] or stereo [wang20193d] depth estimation. For these approaches, we can simply replace the original raw sparse signal as our expanded signal along with the confidence map.
For guidance after the output of the depth prediction network, a naive way is to add the accurate but sparse signal to the predicted depth. Similar schemes are used by Chen [chen2019learning], called shortcut connection in the paper, and You [you2020pseudo], who ignores the sparse signals largely different from stereo results to avoid numerical error and add those signals back to the corrected depth. We modify the naive method by interpolation with Equation 3 so that more pixels are guided with the expanded and confidence .
4.2 Guidance on Cost Volume
Many practices have tried to modify the cost volume, an intermediate representation of matching relationships between pixels, either guidance with external cues [poggi2019guided, spyropoulos2014learning, shivakumar2019real] or confidence measure [poggi2017quantitative]
in the field of stereo matching. The cost volume in the stereo network consists of 3D features with geometric and contextual information that allows the consequent convolution to regress the disparity probability[kendall2017end, chang2018pyramid, zhang2019ga]. Here, we take Guided Stereo Matching (GSM) [poggi2019guided] as an example and explain how framework is applied to GSM in the follows.
GSM [poggi2019guided] peaks the correlated features of the cost volume suggested from the sparse signal with Gaussian function to provide guidance to the network. Specifically, let be external sparse but accurate data, specify a binary mask whether has signal on pixel coordinate , and the cost volume is , where is the max disparity to match, and is the feature number. Given the pixel coordinate and disparity value from external cue , they apply Gaussian function
on the features of the cost volume, where and are hyper-parameters to control the height and width of the Gaussian, . The function enlarges the feature values having positive relation to sparse cues, while suppressing others.
We propose fusing the expanded disparity map and the correspondent confidence map in the following novel approach
The shift range preserves the minimum feature value when is large or . When is positive, value in cost volume will not be suppressed to zero so that the gradient of network would not be blocked during back-propagating. can be a learnable parameter for training. The confidence value acts as a switch to control how much guidance should be applied according to the expanded guidance .
The largest difference between our approach and others are learnable and confidence-based expansion, which are visualized in Figure 4. Additionally, GSM is a subset of ours. Lastly, our module is flexible to apply to other guidance-based approaches on cost volume.
4.3 Guidance on 3D Space
In addition to using sparse signal information on input or cost volume, performing sparse signal guidance on 3D space is an intuitive alternative. Take Graph-based Depth Correction (GDC) algorithm proposed by You [you2020pseudo] as an example. The algorithm first projects the dense depth estimated from monocular or stereo network to 3D space. Then, it forms a neighborhood-relation graph considering depth value via -nearest neighbor.
denotes the depth vector, anddenotes the edge weight between two points. Given the sparse 3D point cloud data, it then corrects the projected points with the relation graph in an optimization manner.
where . The first points are set to their correct depth value from the hint of the sparse signals, and the algorithm corrects the rest of points by minimizing the reconstruction loss. The algorithm corrects the neighbors of the sparse signal points via the relation built from , and the neighbors of the neighbors would also be corrected. The algorithm would propagate the correct depth value via the graph relation for the sparse signals in the long run.
We improve the algorithm with the expanded depth and confidence in the following approach. Suppose there are expanded points and points to be corrected, we first built the graph in Equation 7, and then minimize the reconstruction considering the confidence.
Here is a diagonal matrix, where for , for , and , otherwise. The modification differs from Equation 8 is that is interpolated to the suggested value with confidence . For close to , the influence of the guidance value is negligible. For close to , the guidance value is as confident as the one from sparse signal. Such modification not only allows more points to be corrected by the algorithm, but also takes the magnitude of guidance into consideration.
5.1 Experimental Setting
We use SceneFlow [mayer2016large], KITTI Stereo 2012 [Geiger2012CVPR], and 2015 [Menze2015CVPR] to conduct experiments for LiDAR sparse signal, and NuScenes v1.0 dataset [nuscenes2019] for Radar sparse signal. SceneFlow [mayer2016large] dataset is a large-scale synthetic stereo dataset mainly for pretraining purpose. KITTI Stereo 2012 [Geiger2012CVPR] and KITTI Stereo 2015 [Menze2015CVPR] datasets contain stereo and LiDAR data with an application to autonomous driving. Due to no dense depth ground truth provided on NuScenes, we accumulate consecutive frames of LiDAR signals (5 before and 5 after the frame of interest) for evaluation as KITTI dataset did in [Geiger2012CVPR].
The sparse signal for KITTI Stereo datasets is obtained according to the original paper methods. For Guided Stereo Matching (GSM) [poggi2019guided] experiments, we sub-sample 15% of pixels from the semi-dense disparity maps. For Graph-based Depth Correction (GDC) [you2020pseudo] experiments, we obtain the 4-beam LiDAR signal by slicing point clouds into separate lines by an elevation step of , and choose the elevation angles similar to the cheap ScaLa LiDAR sensor.
For GSM [poggi2019guided], we pretrain on SceneFlow, fine-tune on the training set of KITTI Stereo 2012, and test on the training set of KITTI Stereo 2015, following the protocols in the original paper. We also fine-tune on KITTI Stereo 2015, and test on KITTI Stereo 2012. For GDC [you2020pseudo], we use the officially released SceneFlow pretraining from PSMNet [chang2018pyramid] and fine-tune on the training sets of KITTI Stereo 2012 and 2015, and test on 2015 and 2012, respectively. For monocular depth estimation on nuScenes dataset, the network is trained supervisedly with L1 loss on LiDAR signal and guided with two algorithms: (1) Guidance on Output in Section 4.1 (2) GDC in Section 4.3.
We implement the proposed methods with PyTorch[paszke2019pytorch] framework. The architecture of network is a light-weight version of U-Net [ronneberger2015u] structure with patch size . The number of parameters for network is M and only takes of the depth network like PSMNet [chang2018pyramid]. The inference time of the module is ms per patch for a single thread on one NVIDIA TESTLA V100 GPU with batch size . network is pretrained on SceneFlow for iterations end-to-end with PSMNet [chang2018pyramid] optimized with Adam [kingma2014adam] and learning rate. Following previous works [chang2018pyramid, zhang2019ga], we randomly crop by
for training and pad to full resolution for testing for SceneFlow and KITTI datasets. On nuScenes experiment, we rescale the input image and train sparse-to-dense[ma2019self] depth backbone from scratch for k iterations. Then, the depth is guided by network pretrained from SceneFlow.
We follow standard metrics to evaluate the results. For disparity maps, we use average pixel error (Avg) and -pixel error rate (). The Avg is defined as , where denotes the number of pixels included in valid ground truth disparity map. The represents the percentage of disparity error that is greater than . We evaluate depth maps with root mean squared (RMS) error, mean absolute relative error (REL), and . The means the percentage of the relative error within a threshold of . Except for , the other metrics are the smaller the better.
5.2 Guidance Experiment
5.2.1 Guidance on Input and Output.
|Model||Avg Disp Error||Disp Error Rate (%)|
|In + Ours||0.851||21.93||5.98||2.77||1.78||1.34|
|Out + Ours||0.418||8.90||1.97||1.05||0.73||0.55|
In Table 1, even though our input guidance simply concatenating the superdensity as input, our approach can still improve upon the guided results with PSMNet. On the other hand, we contribute the huge gain of our output guidance to the density of the sparse signal, since the only difference is that more pixels are guided by expanded signal. Also, the improvement strengthens our idea that neighboring pixels of the sparse signal have similar depth and are able to be modeled with confidence by the center depth value.
5.2.2 Guidance on Cost Volume
In Table 2, applying our method in Section 4.2 on GSM can boost a large gap of performance. In the visualization results of Figure 5, GSM does not correct much depth pixel from the stereo output, but it does when applying . This tells that the network tends to ignore sparse signal when the density is not high enough, which consents to our motivation of method. Note that we use GANet [zhang2019ga] as backbone for no fine-tuning cases because we fail to reproduce GSM results on PSMNet [chang2018pyramid].
|KITTI 2015||GANet [zhang2019ga]||1.949||20.72||12.43||8.78||6.73|
|+ GSM + Ours||1.027||6.65||2.86||1.92||1.51|
|KITTI 2015 (ft)||PSMNet [chang2018pyramid]||1.200||6.34||3.12||2.18||1.75|
|+ GSM + Ours||0.443||1.65||0.96||0.71||0.57|
|KITTI 2012||GANet [zhang2019ga]||1.640||17.41||11.32||8.28||6.45|
|+ GSM + Ours||0.836||4.70||2.27||1.54||1.18|
|KITTI 2012 (ft)||PSMNet [chang2018pyramid]||1.010||7.19||4.77||3.65||2.96|
|+ GSM + Ours||0.342||1.37||0.86||0.65||0.52|
5.2.3 Guidance on 3D Space
|Model||Fine-tune||KITTI Stereo 2012||KITTI Stereo 2015|
|+ GDC + Ours||7.776||80.32||71.27||62.34||53.45||45.01||8.479||81.84||69.60||57.78||47.03||37.74|
|+ GDC + Ours||✓||0.904||14.53||6.31||4.20||3.22||2.62||0.915||20.07||5.76||3.05||2.17||1.75|
In Table 3, the results show consistent improvement when applying our method in Section 4.3. The performance gain of GDC is smaller than GSM because the number of points of 4-beam LiDAR is less than sub-sampled one from GSM. The visualization in the fourth row of Figure 5 illustrates the imbalanced signal distribution problem is reduced with our method. The results are presented in the disparity domain, since the Pseudo-LiDAR point cloud [wang2019pseudo] originates from stereo matching. Also, we evaluate on the task of depth estimation instead of object detection because the focus of this paper is to improve depth estimation results.
5.3 Radar Guidance
We test the effectiveness of our module for Radar signal on nuScenes [nuscenes2019] dataset, which is one of the first datasets containing Camera, Radar, and LiDAR in diverse scenes and weather conditions. We choose guidance on output and guidance on 3D (GDC [you2020pseudo]) to improve the prediction of monocular depth estimation shown in Table 4. The improvement of GDC + Ours on LiDAR modality is significant compared to Table 3 because the LiDAR source is 32-beam instead of 4-beam. As a result, there is much expanded guidance to remove the unreliable prediction from monocular estimation benefits. The improvement from Radar modality is minor compared to LiDAR because the number of Radar point cloud is extremely sparse due to small elevation degree. However, with the help of , the performance gain can be amplified. The experiment demonstrates the success of our proposed on both Radar and LiDAR sparse signals.
5.4 Ablation Study
We perform the ablation study by decomposing our module with the expansion part and the confidence part. In Table 5, the main improvement comes from the expansion design, which realizes our arguments that expanding the sparse signal before guidance can improve. When considering the confidence of the expanded signal, network is allowed to learn the magnitude of influence to the guidance.
|+ Sparse Signal||0.526||6.45||2.68||1.76||1.34||1.10|
We also discuss on how to expand the sparse signal in Table 6. Two baseline models closely related to the idea of expansion are chosen for the experiment: (1) The ad-hoc method mentioned in Section 3.1. (2) A superpixel algorithm, SLIC [achanta2012slic], which iteratively clusters the neighbor pixels based on color and distance. Confidence weighting is applied to the baselines by considering the inverse distance of the expanded point to the source point, i.e., expanded depth closer to the source has higher confidence.
In Table 6, performing expansion on the sparse signal is better than no expansion for no fine-tuning case. This tells that increasing the density of the external signal can help reduce the domain shift problem, where a network is initially trained on a synthetic dataset and tested on real imagery when real data is insufficient. This also meets the goal of improving the overall accuracy without retraining mentioned in GSM [poggi2019guided].
For fine-tuning case, simple expansion by color thresholds, like ad-hoc expansion, is worse than no expansion. This implies the stereo network can learn to leverage the sparse signal better than simple expansion techniques. Nevertheless, our proposed can jointly learn with the depth network to achieve better results.
The assumption of the confidence weighting for baseline methods may not hold all the time. The expansion of baselines can enlarge the guided field, but it would also provide false guidance to disparity discontinuous areas, where disparity changes sharply. The ablation study results demonstrate the learnable confidence weighting can avoid the ill assumption and improve performance.
|Expansion Model||Avg Error||Avg Error (Fine-tune)|
We also test the robustness of by sampling different density of the external signal (shown in Figure 6). Surprisingly, our method with merely of sparse data beats GSM with , which strongly supports the idea to increase density of sparse data for guidance. In addition, our prediction suffers little performance drop until the external cue is extremely sparse, which emphasizes the robustness of to work under extreme environment.
5.5 Impact of Sparse Signal Superdensity
Our analysis about the impact of sparse signal focuses on the following questions: (1) How much improvement comes from sparse signal guidance? (2) How many more pixels are further improved due to the proposed method? and (3) are further improved pixels easy or hard cases? In Table 7, relatively less pixels are largely improved by comparing the “ d” and “ d” columns. Furthermore, with our method, more pixels are guided and thus average pixel error is lower. Finally, our method shows about times of improvement on “ d”, which is much larger than “ d”. This highlights that our can improve more on hard cases.
|Method||% of pixel improved||Avg Error|
|GSM + Ours||8.2||15.2||27.5||96.9||1.125|
|GDC + Ours||1.1||2.7||5.9||21.0||0.904|
In the paper, we propose framework to improve depth estimation results by considering the defective property of sparse signals. Our idea is deployable to existing sparse-guidance methods. Extensive experiments show consistent improvement among guidance approaches, and strengthen the idea that expansion on sparse signal can solve low density and imbalanced distribution problem. Our framework could become an important reference for future exploration on sparse-guided methods.
This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 110-2634-F-002-026 and FIH Mobile Limited. We benefit from NVIDIA DGX-1 AI Supercomputer and are grateful to the National Center for High-performance Computing.