CSPN++: Learning Context and Resource Aware Convolutional Spatial Propagation Networks for Depth Completion

11/13/2019 ∙ by Xinjing Cheng, et al. ∙ Baidu, Inc. 32

Depth Completion deals with the problem of converting a sparse depth map to a dense one, given the corresponding color image. Convolutional spatial propagation network (CSPN) is one of the state-of-the-art (SoTA) methods of depth completion, which recovers structural details of the scene. In this paper, we propose CSPN++, which further improves its effectiveness and efficiency by learning adaptive convolutional kernel sizes and the number of iterations for the propagation, thus the context and computational resources needed at each pixel could be dynamically assigned upon requests. Specifically, we formulate the learning of the two hyper-parameters as an architecture selection problem where various configurations of kernel sizes and numbers of iterations are first defined, and then a set of soft weighting parameters are trained to either properly assemble or select from the pre-defined configurations at each pixel. In our experiments, we find weighted assembling can lead to significant accuracy improvements, which we referred to as "context-aware CSPN", while weighted selection, "resource-aware CSPN" can reduce the computational resource significantly with similar or better accuracy. Besides, the resource needed for CSPN++ can be adjusted w.r.t. the computational budget automatically. Finally, to avoid the side effects of noise or inaccurate sparse depths, we embed a gated network inside CSPN++, which further improves the performance. We demonstrate the effectiveness of CSPN++on the KITTI depth completion benchmark, where it significantly improves over CSPN and other SoTA methods.



There are no comments yet.


page 3

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Image guided depth completion, or depth completion for short in this paper, is the task of converting a sparse depth map from devices such as LiDAR [36] or algorithms such as structure-from-motion (SfM) [41] and simultaneously localization and mapping (SLAM) [12] to a per-pixel dense depth map with the help of reference images. The technique has a wide range of applications for the perception of indoor/outdoor moving robots such as self-driving vehicles [2], home/indoor robots [8], or applications such as augmented reality [37].

Figure 1: Output assembling or selection over an unrolled CSPN. The color of each dot indicates the computational resources need at the point, where blue indicates low resource usage while red indicates high resource usage.

One of the state-of-the-art (SoTA) methods for this task is CSPN, which is an efficient local linear propagation model with learned affinity from a convolutional neural network (CNN). In CSPN,  


claim three important properties should be considered for the depth completion task, 1) depth preservation, where the depth value at sparse points should be maintained, 2) structure alignment, where the detailed structures, such as edges and object boundaries in estimated depth map, should be aligned with the given image, and 3) transition smoothness, where the depth transition between sparse points and their neighborhoods should be smooth.

In real applications, depths from devices like LiDAR, or algorithms such as SfM or SLAM could be noisy [35] due to system or environmental errors. Datasets like KITTI adopt stereo and multiple frames to compensate the errors for evaluation. Here in this paper, we do not assume that the sparse depth map is the ground truth, rather, we consider that it may include errors as well. So the depth value at sparse points should be conditionally maintained with respect to its accuracy. Secondly, all pixels are considered equally in CSPN, while intuitively the pixels at geometrical edges and object boundaries should be more focused for structure alignment and transition smoothness. Therefore, in CSPN++, we propose to find a proper propagation context, to further improve the performance of depth completion.

To be specific, as illustrated in Fig. 1, in CSPN++, numerous configurations of convolutional kernel size and number of iteration are first defined for each pixel , then we utilize to weight different proposals of kernel size, and use to weight outputs after different iterations. Based on these hyper-parameters, we induce context-aware and resource-aware variants for CSPN++. In context-aware CSPN (CA-CSPN), we propose to assemble the outputs, and CSPN++ is structurally similar to networks such as InceptionNet [32] or DenseNet [16], where gradient from the final output can be directly back-propagated to earlier propagation stages. We find the model learns stronger representation yielding significant performance boost comparing to CSPN.

In resource-aware CSPN (RA-CSPN), CSPN++ sequentially selects one convolutional kernel and one number of iteration for each pixel by minimizing the computational resource usage, where the learned computational resource allocation speeds up CSPN significantly (25 in our experiments) with improvements of accuracy. In addition, RA-CSPN can also be automatically adapted to a provided computational budget with the awareness of accuracy through a budget rounding operation during the training and inference.

In summary, our contribution lies in two aspects:

  1. Base on the observation of error sparse depths, we propose a gate network to guide the depth preservation, and make the output more robust to noisy sparse depths.

  2. We propose an effective method to adapt the kernel sizes and iteration number for each pixel with respect to image content for CSPN, which induces two variants, named as context-aware and resource-aware CSPN. The former significantly improves its performance, and the later speeds up the algorithm and makes the CSPN++ adapt to computational budgets.

Figure 2:

Framework of our networks for depth completion with resource and context aware CSPN (best view in color). At the end of the network, we generate the depth confidence for each sparse point, affinity matrix for CSPN, and weighting variables

and for model assembling and selection.

Related Work

Depth estimation, completion, enhancement/refinement and models for dense prediction with dynamic context and compression have been center problems in computer vision and robotics for a long time. Here we summarize those works in several aspects without enumerating them all due to space limits, and we majorly clarify their core relationships with CSPN++ proposed in this paper.

Depth Completion.

The task of depth completion [34] recently attracts lots of interests in robotics due to its wide application for enhancing 3D perception for robotics [24]. The provided depths are usually from LiDAR, SfM or SLAM, yielding a map with valid depth partially available in some of the pixels. Within this field, some works directly convert sparse 3D points to dense ones without image guidance [48, 21, 34]

, which produce impressive results with deep learning. However, conventionally, jointly considering the structures from reference images for guiding depth completion/enhancement 

[25, 13] yields better results. With the rising the deep learning for depth estimation from a single image [10, 38], researchers adopt similar strategies to image guided depth completion. For example, [27] propose to treat sparse depth map as an additional input to a ResNet based depth predictor [22], producing superior results than the depth output from CNN with solely image inputs. Later works are further proposed by focusing on improving the efficiency [20], separately modeling the features from image and sparse depths [33], recovering the structural details of depth maps [5], combing with multi-level CRF [42] or adopting auxiliary training losses using normal [44] or 3D representation [29, 4]

from self-supervised learning strategy 


Among all of these works, we treat CSPN [5] as our baseline strategy due to its clear motivation and good theoretical guarantee in the stability of training and inference, and our resulted CSPN++ provides a significant boost both on its effectiveness and efficiency.

Context Aware Architectures.

Assembling multiple contexts inside a network for dense predictions has been an effective component for recognition tasks in computer vision. In our perspective, the assembling strategies could be horizontal or vertical. Horizontal strategies assemble outputs from multiple branches in a single layer of a network, which include modules of Inception/Xception [32], pyramid spatial pooling (PSP) [45], atrous spatial pyramid pooling (ASPP) [3], and vertical strategies assemble outputs from different layers include modules of HighwayNet [30], DenseNet [16], etc. Some recent works combine these two strategies together such as networks of HRNet [31] or models of DenseASPP [43]. Most recently, to make the context to be better conditioned on each pixel or provided image, attention mechanism with the cost of additional computation is further induced inside the network for context selection such as skipnet [40], non-local networks [39]

or context deformation such as spatial transformer networks 

[18] or deformable networks [47].

In the field of depth completion, [6] propose the atrous convolutional spatial pyramid fusion (ACSF) module which extends ASPP by additionally adding affinity for each pixel, yielding stronger performance, which can be treated as a case of combining horizontal assembling with attention from affinity values. In our case, CA-CSPN of CSPN++ extends context assembling idea into CSPN with both horizontal and vertical strategies via attention. Horizontally, it assembles multiple kernel sizes, and vertically it assembles the outputs from different iteration stages as illustrated in Fig. 1. Here we would like to note that although mathematically in forward process, performing one step CA-CSPN with kernels of 77, 55, 33 together is equivalent to performing CSPN with a single 77 kernel since the full process are linear, the backward learning process is different due to the auxiliary parameters (, ), and our results are significantly better.

Resource Aware Inference.

In addition, the dynamic context intuition can be also applied for efficient prediction by stopping the computation after obtained a proper context, which is also known as adaptive inference [14]. Specifically, the relevant strategies have been adopted in image classification such as a multi-scale dense network (MSDNet) [15], object detection such as trade-off balancing [17] or semantic segmentation such as regional convolution network (RCN) treating each pixel differently [23].

In RA-CSPN of CSPN++, we first embed such an idea in depth completion, and adopt functionality of RCN in CSPN for efficient inference. To minimize the computation, each pixel chooses one kernel size and then one number of iterations sequentially from the proposed configurations. Besides, we can easily add a provided computation budget, such as latency or memory constraints, into our optimization target, which could be back-propagated for operation selection similar to resource constraint architecture search algorithms [46, 1].


To make the paper self-contained, we first briefly review CSPN [6], and then demonstrate how we extend it with context and resource awareness. Given one depth map that is output from a network taken input as an image , CSPN updates the depth map to a new depth map

. Without loss of generality, we follow their formulation by embedding depth to a hidden representation

, and the updating equation for one step propagation can be written as,


where represents one step CSPN given a predefined size of convolutional kernel . is the neighborhood pixels in a kernel, and the affinities output from a network are properly normalized which guarantees the stability of the module. The whole process will iterate times to obtain the final results. Here, needs to be tuned in the experiments, which impacts the final performance significantly in their paper.

For depth completion, CSPN preserves the depth value at those valid pixels in a sparse depth map by adding a replacement operation at the end of each step. Formally, let to be the corresponding embedding for , the replacement step after performing Eq. (1) is,


where is an indicator for the validity of sparse depth at .

Context and Resource Aware CSPN

In this section, we elaborate how CSPN++ enhances CSPN by learning a proper configuration for each pixel by introducing additional parameters to predict. Specifically, predicting for weighting various convolutional kernel size and for weighting different number of iterations given a kernel size . As shown in Fig. 2, both variables are image content dependent, and are predicted from a shared backbone with CSPN affinity and estimated depths.

Context-Aware CSPN

Given the provided and , context-aware CSPN (CA-CSPN) first assembles the results from different steps. Formally, the propagation from to could be written as,



is the sigmoid function, and

is the outputs from the network. In the process, progressively aggregates the output from each step of CSPN based on . Finally, we assemble different outputs from various kernels after iterations,


Here, both and are properly normalized with their norm, so that our output maintains the stabilization property of CSPN for training and inference.

When there are sparse points available, CSPN++ adopts a confidence variable predicted at each valid depth in the sparse depth map, which is output from the shared backbone in our framework (Fig. 2). Therefore, the replacement step for CSPN++ can be modified accordingly,


where , where is predicted from a network after a convolutional layer.

Complexity and computational resource analysis.

From CSPN, we know that theoretically with sufficient amount of GPU cores and large memory storage, the overall complexity for CSPN with a kernel size of and iteration is . In CA-CSPN, with induced convolutional kernels, the computation complexity is , where is the maximum kernel size since all branch can be performed simultaneously.

However, in the real application, the expected computational resource is limited and latency of memory request with large convolutional kernel could be time consuming. Therefore, we need to utilize a better metric for estimating the cost. Here, we adopt the popularly used memory cost and Mult-Adds/FLOPs as an indicator of latency or computational resource usage on a device. Specifically, based on the CUDA implementation of convolution with im2col [19], performing CSPN with a kernel would require memory cost of , and FLOPs of , given a single feature block with a size of . In summary, given kernels, the latency from big estimation for CA-CSPN would be . Finally, we would like to note that the memory and computational configuration varies with given devices, so does the latency estimation. A better strategy would be directly testing over the target device as proposed in [1]. Here, we just provide a reasonable estimation with the commonly used GPU.

Network architectures. As illustrated in Fig. 2, for the backbone network, we adopt the same ResNet-34 structure proposed in  [26]. The only modification is at the end of the network, it outputs the per-pixel estimation of assembling parameters , , noisy guidance for replacement and affinity matrix using a convolutional layer with a kernel. For handling the affinity values for various propagation kernels, we use a shared affinity matrix since the affinity between different pixels should be irrelevant to the context of propagation, which saves the memory cost inside the network.

Training context-aware CSPN. Given the proposed architecture, based on our computational resource analysis w.r.t. latency, we add additional regularization term inside the general optimization target, which minimizes the expected computational cost by treating

as probabilities of configuration selection. It is shown to be effective in improving the final performance in our experiments. Formally, the overall target for training CA-CSPN can be written as,


where is the network parameters, and is weight decay regularization. is the expected computational cost given the assembling variables based on our analysis. are height and width of the feature respectively. and is the output depth map from CA-CSPN and ground truth depth map correspondingly. Here, our system can be trained end-to-end.

Figure 3: The proposed regional im2col and conv. operation for efficient testing. Here, let the regions of green (1⃝), red (2⃝) and blue (3⃝) have kernel size of 3, 7, 5 and iteration number of t, t+1, t+1 respectively. We convert each region to a matrix of for performing parallel conv. through im2col, where is the feature dimension, and is the number of pixels in the corresponding region. If pixels belong to a region does not need propagation (i.e. region 1⃝ at time step as illustrated), we direct copy its feature to next step.

Resource Aware Configuration

As introduced in our complexity analysis, CSPN with large kernel size and long time propagation is time consuming. Therefore, to accelerate it, we further propose resource-aware CSPN (RA-CSPN), which selects the best kernel size and number of iteration for each pixel based on the estimated . Formally, its propagation step can be written as,

where (7)

Here each pixel is treated differently by selecting a best learned configuration, and we follow the same process of replacement as Eq. (2) for handling depth completion.

Computational resource analysis.

Given the selected configuration of convolutional kernel and number of iteration at each pixel, the latency estimation for each image that we proposed in Sec. Complexity and computational resource analysis. is changed to , where and are the average iteration step and kernel size in the image respectively. Both of the numbers are guaranteed to be smaller than the maximum number of iteration and kernel size .

Training RA-CSPN.

In our case, training RA-CSPN does not need to modify the multi-branch architecture shown in Fig. 1, but switches from the weighted average assembling as described in Eq. (Context-Aware CSPN) and Eq. (Context-Aware CSPN

) to max selection that only one path is adopted for each pixel. In addition, we need modify our loss function in Eq. (

Complexity and computational resource analysis.) by changing the expected computational cost as,

where (8)

In practice, to implement configuration selection, we can reuse the same training pipeline as CA-CSPN via converting the obtained soft weighting values in and to one-hot representation through an argmax operation.

Efficient testing.

Practically, there are two issues we need to handle when making the algorithm efficient at testing: 1) how to perform different convolution simultaneously at different pixels, and 2) how to continue the propagation for pixels whose neighborhood pixels stop their diffusion/propagation process. To handle these issues, we follow the idea of regional convolution [23].

Specifically, as shown in Fig. 3, to tackle the first one, we group pixels to multiple regions based on our predicted kernel size, and prepare corresponding matrix before convolution for each group using region-wise im2col. Then, the generated matrix can be processed simultaneously at each pixel using region-wise convolution. To tackle the second issue, when the propagation of one pixel stops at time step , we directly copy the feature of to the next step for computing convolution at later stages. In summary, RA-CSPN can be performed in a single forward pass with less resource usage.

Method SPP CSPN configuration GR LR Results (Lower the better)
Normal assemble kernel assemble iter. RMSE(mm) MAE(mm)
 [26] 799.08 265.98
 [26] 788.23 247.55
CSPN 765.78 213.,54
CSPN 756.27 215.21
CA-CSPN 732.46 210.61
CA-CSPN 732.34 209.20
CA-CSPN 725.43 207.88
Table 1: Ablation Study on KITTI Depth Completion validation dataset. ‘GR‘ stands for guided replacement.‘LR‘ stands for latency regularization for the model. ‘CPSN++‘ is our proposed strategies.
Figure 4: Framework of our networks for depth completion with resource and context aware CSPN(best view in color).

Learning with provided computational budget.

Finally, in real applications, rather than providing an optimal computational resource, usually there is a hard constraint for a deployed model, either the memory or latency of inference. Thanks to the adaptive resource usage of CSPN++, we are able to directly put the required budget into our optimization target during training. Formally, given a target memory cost and a latency cost for resource-aware CSPN, our optimization target in Eq. (Complexity and computational resource analysis.) could be modified as,

s.t. (9)

where is the expected memory cost, and is the expected latency cost defined in Eq. (Training RA-CSPN.). The two constraints can be added to our target easily with Lagrange multiplier. Formally, our optimization target with resource budges is,


where the hinge loss is adopted as our surrogate function for satisfying the constraints.

Last but not the least, since our primal problem, i.e. optimization with deep neural network, is highly non-convex, thus during training, there is no guarantee that all samples will satisfy the constraints. In addition, during testing, the predicted configuration might also violate the given constraints, e.g.  . Therefore, for these cases, we propose a resource rounding strategy to hard constraint its overall computation within the budgets. Specifically, we calculate the average cost at each pixel, and for the pixels violating the cost, as illustrated in Fig. 1, we are are able to find the Pareto optimal frontier [28] that satisfying the constraint, and we pick the one with largest iteration since it obtains the largest reception field.


method kernel iter. m. c. l. c. Lower the Better
(MB) (ms) RMSE(mm)
CSPN 7x7 12 1.0 1.0 829 28.88 756.27
CA-CSPN assemble 12 0.680 1.0 2125 67.23 732.46
CA-CSPN assemble assemble 0.316 0.446 2125 67.23 725.43
RA-CSPN select select 0.268 0.439 626.29 10.03 732.32
RA-CSPN select select 0.35 0.35 0.333 0.303 625.30 9.84 742.17
Table 2: Comparison of efficiency between CSPN and CSPN++. is the expected kernel size and is the expected number of iterations using learned and . is the real cost of memory and is the real time latency on device. is short for memory constraints and is short for latency constraints. Both constraints and expected values are normalized by the corresponding resource used in the CSPN baseline. Note here the number of memory cost is not proportion to since the majority is taken by affinity matrix in our case. Here, we set a minimum cost of using kernel size of and propagation steps of , and one may achieve additional acceleration by dropping the minimum cost.

For experiments, we majorly perform CSPN++ over the KITTI depth completion benchmark [34]. In this section, we will first introduce the dataset, metrics and our implementation details. Then, extensive ablation study of CSPN++ is conducted on the validation set to verify our insight of each proposed components. Finally, we provide qualitative comparison of CSPN++ versus other SoTA method on testing set.

Experimental setup


The KITTI Depth Completion benchmark is a large self-driving real-world dataset with street views from a driving vehicle. It consists 86k training, 7k validation and 1k testing depth maps with corresponding raw LiDAR scans and reference images. The sparse depth maps are obtained by projecting the raw LiDAR points through the view of camera, and the ground truth dense depth maps are generated by first projecting the accumulated LiDAR scans of multiple timestamps, and then removing outliers depths from occlusion and moving objects through comparing with stereo depths from image pairs.

Metrics. We adopt error metrics same as KITTI depth completion benchmark, including root mean square error (RMSE), mean abosolute error (MAE), inverse RMSE (iRMSE) and inverse MAE (iMAE), where inverse indicates inverse depth representation, i.e.converting to .

Implementation details. We train our network with four NVIDIA Tesla P40, and use batchsize of 8. In all our experiments, we adopt kernel sizes of , and , and sample outputs after times of propagation. All our models are trained with Adam optimizer with . The learning rate start from

and reduce by half for every 5 epochs. Here, for training context-aware CSPN in Eq. (

Complexity and computational resource analysis.), the parameter for weight decay, i.e. , is set to 0.0005, and the parameter for resource regularization, i.e.  is set to 0.1. For training resource-aware CSPN in Eq. (Training RA-CSPN.), we set and . All our parameters are induced for balancing value scale of different losses without exhaustively tuning.

Ablation studies

Ablation study of context-aware CSPN (CA-CSPN). Here, we conduct experiments to verify each module adopted in our framework, including our baselines, i.e. CSPN with spatial pyramid pooling(SPP), and our newly proposed modules in context-aware CSPN. Specifically, to make the validation efficient, we only train each network 10 epochs to obtain its results. For SPP, we adopt pooling sizes of and for CSPN, we use the kernel size of and set the number of iteration as . As shown in Tab 1, by adding SPP and CSPN module to the baseline from [26], we can significantly reduce the depth error due to the induced pyramid context in SPP and refined structure with CSPN. With additional confidence guided replacement(GR) (Eq. (5)), our module better handles the noisy sparse depths, and the RMSE is significantly reduced from to . Then, at rows with ‘assemble kernel‘, we add the component of learning to horizontally assemble predictions from different kernel size via the learned . It further reduce the error from to . At rows with ‘assemble iter.‘, we include the component of learning to vertically assemble outputs after different iterations via the learned . Finally, at rows with ‘LR‘, we add our proposed latency regularization term (Eq. (Complexity and computational resource analysis.)) into the training losses, yielding the best results of our context-aware CSPN.

In Fig. 4, we visualize the learned configurations of and at each pixel. Typically, we find majority pixels on ground and walls only need small kernel and few iterations for recovery, while pixels further away and around object and surface boundary need large kernels and more iterations to obtain larger context for reconstruction. This agrees with our intuition since in real cases, sparse points are denser close by and the structure is simpler in planar regions, thus it is easier for depth estimation.

Figure 5: Qualitative comparison with UberATG-FuseNet on KITTI test set, where the zoom regions show that our method recover better and detailed structure.

Ablation study of resource-aware CSPN (RA-CSPN). To verify the efficiency of our proposed RA-CSPN, we study the computational improvement w.r.t. vanilla CSPN and CA-CSPN. As list in Tab 2, at row ‘CSPN‘, we list its memory cost and latency on device. At row ‘CA-CSPN‘, although the memory cost and latency are in practice larger, but the expected kernel size and iteration steps are much smaller using our latency regularization terms. This indicates that most pixels only need small kernel and few iteration for obtaining better results. At row of ‘RA-CSPN‘, we train with resource-aware objective as in Eq. (Training RA-CSPN.), and show that RA-CSPN not only outperforms CSPN for efficiency (almost 3 faster), but also improves RMSE from to . More importantly, we can train RA-CSPN with computational budget to fit different devices as proposed in Eq. (10). At the last row, with a hard constrain that the m.c. and l.c. is less than 35% of the vanilla CSPN, we found that, our method will adjust kernel sizes and iteration actively. In this case, the reduce from 0.439 to 0.303 but increase from 0.268 to 0.333, which means that the network chooses larger kernel sizes with less iteration automatically to satisfied our hard constraints, while still produces better results and demonstrate the effectiveness of our method.

Comparisons against other methods

Finally, to compare against other SoTA methods for depth estimation accuracy, we use our best obtained model from CA-CSPN, and finetune it with another 30 epochs before submitting the results to KITTI test server. As summarized in Tab. 3, CA-CSPN outperforms all other methods significantly and currently rank 2nd on the bench mark. However, our results are better in three out of the four metrics. Here, we would like to note that our results are also better than methods adopted additional dataset, e.g.  DeepLiDAR [29] uses CARLA [9] to better learn dense depth and surface normal tasks jointly, and FusionNet [35] used semantic pre-trained segmentation models on CityScape [7]. Our plain model only trained on KITTI dataset and outperforms all other methods.

In Fig. 5, we qualitatively compare the dense depth maps estimated from our proposed mehtod with UberATG-FuseNet [4] together with the corresponding error maps. We found our results are better at detailed scene structure recovery.

SC [34] 4.94 1.78 1601.33 481.27
CSPN [5] 2.93 1.15 1019.64 279.46
NConv [11] 2.60 1.03 829.98 233.26
StD [26] 2.80 1.21 814.73 249.95
FN [35] 2.19 0.93 772.87 215.02
DL [29] 2.56 1.15 758.38 226.25
Uber [4] 2.34 1.14 752.88 221.19
CA-CSPN 2.07 0.90 743.69 209.28
Table 3: Comparisons against state-of-the-art methods on KITTI Depth Completion benchmark.


In this paper, we propose CSPN++ for depth completion, which outperforms previous SoTA strategy CSPN [6] by a large margin. Specifically, we elaborate two variants using the same framework of model selection, i.e. context-aware CSPN and resource-aware CSPN. The former significantly reduces estimation error, while the later achieves much better efficiency with comparable accuracy with the former. We hope CSPN++ could motivate researchers to better adopt data-driven strategies for effective learning hyper-parameters in various tasks. In the future, we would like merge the two variants, and consider replacing more modules in network with CSPN for multiple tasks such as segmentation and detection.


  • [1] H. Cai, L. Zhu, and S. Han (2019) Proxylessnas: direct neural architecture search on target task and hardware. ICLR. Cited by: Resource Aware Inference., Complexity and computational resource analysis..
  • [2] C. Chen, A. Seff, A. Kornhauser, and J. Xiao (2015) Deepdriving: learning affordance for direct perception in autonomous driving. In ICCV, pp. 2722–2730. Cited by: Introduction.
  • [3] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587. Cited by: Context Aware Architectures..
  • [4] Y. Chen, M. Liang, B. Yang, and R. Urtasun (2019) Learning joint 2d-3d representations for depth completion.. ICCV. Cited by: Depth Completion., Comparisons against other methods, Table 3.
  • [5] X. Cheng, P. Wang, and R. Yang (2018) Depth estimation via affinity learned with convolutional spatial propagation network. In ECCV, pp. 103–119. Cited by: Depth Completion., Depth Completion., Table 3.
  • [6] X. Cheng, P. Wang, and R. Yang (2018) Learning depth with convolutional spatial propagation network. arXiv preprint arXiv:1810.02695. Cited by: Introduction, Context Aware Architectures., Preliminaries, Conclusion.
  • [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In CVPR, pp. 3213–3223. Cited by: Comparisons against other methods.
  • [8] G. N. DeSouza and A. C. Kak (2002) Vision for mobile robot navigation: a survey. TPAMI 24 (2), pp. 237–267. Cited by: Introduction.
  • [9] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: an open urban driving simulator. arXiv preprint arXiv:1711.03938. Cited by: Comparisons against other methods.
  • [10] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In NIPS, Cited by: Depth Completion..
  • [11] A. Eldesokey, M. Felsberg, and F. S. Khan (2019) Confidence propagation through cnns for guided sparse depth regression. TPAMI. Cited by: Table 3.
  • [12] J. Engel, T. Schöps, and D. Cremers (2014) LSD-slam: large-scale direct monocular slam. In ECCV, Cited by: Introduction.
  • [13] D. Ferstl, C. Reinbacher, R. Ranftl, M. Rüther, and H. Bischof (2013) Image guided depth upsampling using anisotropic total generalized variation. In ICCV, pp. 993–1000. Cited by: Depth Completion..
  • [14] A. Graves (2016)

    Adaptive computation time for recurrent neural networks

    arXiv preprint arXiv:1603.08983. Cited by: Resource Aware Inference..
  • [15] G. Huang, D. Che, T. Li, F. Wu, L. van der Maaten, and K. Weinberger (2018) Multi-scale dense networks for resource efficient image classification. arxiv preprint arxiv: 170309844. ICLR. Cited by: Resource Aware Inference..
  • [16] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In CVPR, pp. 4700–4708. Cited by: Introduction, Context Aware Architectures..
  • [17] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 7310–7311. Cited by: Resource Aware Inference..
  • [18] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In NIPS, pp. 2017–2025. Cited by: Context Aware Architectures..
  • [19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: convolutional architecture for fast feature embedding. In MM, pp. 675–678. Cited by: Complexity and computational resource analysis..
  • [20] J. Ku, A. Harakeh, and S. L. Waslander (2018) In defense of classical image processing: fast depth completion on the cpu. In CRV, pp. 16–22. Cited by: Depth Completion..
  • [21] L. Ladicky, O. Saurer, S. Jeong, F. Maninchedda, and M. Pollefeys (2017-10) From point clouds to mesh using regression. In ICCV, Cited by: Depth Completion..
  • [22] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab (2016) Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pp. 239–248. Cited by: Depth Completion..
  • [23] X. Li, Z. Liu, P. Luo, C. Change Loy, and X. Tang (2017) Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade. In CVPR, pp. 3193–3202. Cited by: Resource Aware Inference., Efficient testing..
  • [24] Y. Liao, L. Huang, Y. Wang, S. Kodagoda, Y. Yu, and Y. Liu (2017) Parse geometry from a line: monocular depth estimation with partial laser observation. ICRA. Cited by: Depth Completion..
  • [25] J. Liu and X. Gong (2013) Guided depth enhancement via anisotropic diffusion. In Pacific-Rim Conference on Multimedia, pp. 408–417. Cited by: Depth Completion..
  • [26] F. Ma, G. V. Cavalheiro, and S. Karaman (2019) Self-supervised sparse-to-dense: self-supervised depth completion from lidar and monocular camera. In ICRA, pp. 3288–3295. Cited by: Depth Completion., Complexity and computational resource analysis., Table 1, Ablation studies, Table 3.
  • [27] F. Ma and S. Karaman (2018) Sparse-to-dense: depth prediction from sparse depth samples and a single image. ICRA. Cited by: Depth Completion..
  • [28] W. B. Mock (2011) Pareto optimality. Encyclopedia of Global Justice, pp. 808–809. Cited by: Learning with provided computational budget..
  • [29] J. Qiu, Z. Cui, Y. Zhang, X. Zhang, S. Liu, B. Zeng, and M. Pollefeys (2019) Deeplidar: deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. In CVPR, pp. 3313–3322. Cited by: Depth Completion., Comparisons against other methods, Table 3.
  • [30] R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: Context Aware Architectures..
  • [31] K. Sun, B. Xiao, D. Liu, and J. Wang (2019)

    Deep high-resolution representation learning for human pose estimation

    CVPR. Cited by: Context Aware Architectures..
  • [32] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In CVPR, pp. 2818–2826. Cited by: Introduction, Context Aware Architectures..
  • [33] J. Tang, F. Tian, W. Feng, J. Li, and P. Tan (2019) Learning guided convolutional network for depth completion. arXiv preprint arXiv:1908.01238. Cited by: Depth Completion..
  • [34] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger (2017) Sparsity invariant cnns. 3DV. Cited by: Depth Completion., Table 3, Experiments.
  • [35] W. Van Gansbeke, D. Neven, B. De Brabandere, and L. Van Gool (2019) Sparse and noisy lidar completion with rgb guidance and uncertainty. arXiv preprint arXiv:1902.05356. Cited by: Introduction, Comparisons against other methods, Table 3.
  • [36] Velodyne Lidar (2018) HDL-64E. Note: http://velodynelidar.com/[Online; accessed 01-March-2018] Cited by: Introduction.
  • [37] J. Ventura and T. Höllerer (2008) Depth compositing for augmented reality.. In SIGGRAPH posters, pp. 64. Cited by: Introduction.
  • [38] P. Wang, X. Shen, B. Russell, S. Cohen, B. Price, and A. L. Yuille (2016) Surge: surface regularized geometry estimation from a single image. In NIPS, pp. 172–180. Cited by: Depth Completion..
  • [39] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, pp. 7794–7803. Cited by: Context Aware Architectures..
  • [40] X. Wang, F. Yu, Z. Dou, T. Darrell, and J. E. Gonzalez (2018) Skipnet: learning dynamic routing in convolutional networks. In ECCV, pp. 409–424. Cited by: Context Aware Architectures..
  • [41] C. Wu et al. (2011) VisualSFM: a visual structure from motion system. Cited by: Introduction.
  • [42] D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci (2018) Structured attention guided convolutional neural fields for monocular depth estimation. In CVPR, pp. 3917–3925. Cited by: Depth Completion..
  • [43] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang (2018) Denseaspp for semantic segmentation in street scenes. In CVPR, pp. 3684–3692. Cited by: Context Aware Architectures..
  • [44] Y. Zhang and T. Funkhouser (2018) Deep depth completion of a single rgb-d image. In CVPR, pp. 175–185. Cited by: Depth Completion..
  • [45] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2016) Pyramid scene parsing network. CVPR. Cited by: Context Aware Architectures..
  • [46] Y. Zhou, P. Wang, S. Arik, H. Yu, S. Zawad, F. Yan, and G. Diamos (2019) EPNAS: efficient progressive neural architecture search. BMVC. Cited by: Resource Aware Inference..
  • [47] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In CVPR, pp. 9308–9316. Cited by: Context Aware Architectures..
  • [48] K. Zimmermann, T. Petricek, V. Salansky, and T. Svoboda (2017) Learning for active 3d mapping. ICCV. Cited by: Depth Completion..