Learning Joint 2D-3D Representations for Depth Completion

by   Yun Chen, et al.

In this paper, we tackle the problem of depth completion from RGBD data. Towards this goal, we design a simple yet effective neural network block that learns to extract joint 2D and 3D features. Specifically, the block consists of two domain-specific sub-networks that apply 2D convolution on image pixels and continuous convolution on 3D points, with their output features fused in image space. We build the depth completion network simply by stacking the proposed block, which has the advantage of learning hierarchical representations that are fully fused between 2D and 3D spaces at multiple levels. We demonstrate the effectiveness of our approach on the challenging KITTI depth completion benchmark and show that our approach outperforms the state-of-the-art.


page 1

page 4

page 5

page 8


BIDCD - Bosch Industrial Depth Completion Dataset

We introduce BIDCD - the Bosch Industrial Depth Completion Dataset. BIDC...

DVMN: Dense Validity Mask Network for Depth Completion

LiDAR depth maps provide environmental guidance in a variety of applicat...

Learning Depth with Convolutional Spatial Propagation Network

Depth prediction is one of the fundamental problems in computer vision. ...

Adaptive LiDAR Sampling and Depth Completion using Ensemble Variance

This work considers the problem of depth completion, with or without ima...

Robust Depth Completion with Uncertainty-Driven Loss Functions

Recovering a dense depth image from sparse LiDAR scans is a challenging ...

FCDSN-DC: An Accurate and Lightweight Convolutional Neural Network for Stereo Estimation with Depth Completion

We propose an accurate and lightweight convolutional neural network for ...

In Defense of Classical Image Processing: Fast Depth Completion on the CPU

With the rise of data driven deep neural networks as a realization of un...

1 Introduction

In the past few years, the use of sensors that contain both image information as well as depth has increased significantly. They are typically used in applications such as self-driving vehicles, robotic manipulation as well as gaming. While passive sensors like cameras typically generate dense data, active sensors like LiDAR (Light Detection and Ranging) produce sparse depth observation of the environment. As a result, this semi-dense representation of the world can be inaccurate at regions close to object boundaries. One solution is to use high-end depth sensors with higher data density, but they are usually very expensive. A more affordable alternative is depth completion (shown in Figure 1

), which takes the sparse depth observation and dense image as input, and estimates the dense depth map. In practice, depth completion is often employed as a precursor to downstream perception tasks such as detection, semantic segmentation or instance segmentation.

Figure 1: Illustration of the depth completion task. The model takes a sparse depth map (projection of the LiDAR point cloud) and a dense RGB image as input, and produces a dense depth map.

Despite many attempts to solve the problem, depth completion remains unsolved. Challenges such as the inherent ambiguity in extracting depth from images, as well as the noise and uncertainty in the unstructured sparse depth observation, make depth completion a non-trivial task.

Many approaches [33, 7, 21, 26, 34]

reason in the 2D space only by projecting the 3D point cloud to 2D image space. Convolutional neural networks (CNNs) are typically used to learn multi-modality representations in 2D space. However, as the metric space is distorted after the camera projection, such approaches have difficulty capturing precise 3D geometric clues. As a result, auxiliary task like surface normal estimation is added to better supervise the feature learning

[26]. Other methods [32] reason in 3D space only by extracting 3D features (Truncated Signed Distance Function [24]) from the sparse depth image of the scene and applies 3D CNN to learn 3D representations and complete the scene densely in 3D. The drawback is the lack of exploitation of the dense image data, which can provide discriminative appearance clues.

In contrast, in this paper, we take advantage of representations in both 2D and 3D spaces and design a simple yet effective architecture that fuses the information between these representations at multiple levels. Specifically, we design a 2D-3D fuse block that takes feature map in 2D image space as input, branches into two sub-networks that learn representations in 2D and 3D spaces via multi-scale 2D convolutions and continuous convolutions [37] respectively, and then fuses back into the 2D image space. Thanks to the modular design, we can create networks of various model sizes by simply stacking the 2D-3D fuse blocks sequentially. Compared with other multi-sensor fusion based representations [38, 17] that typically fuse the features from each sensor only once in the whole network, our proposed modular based model has the advantage of dense feature fusion at multiple levels through the network. As a result, while the domain-specific sub-networks inside the block extract specialized 2D and 3D representations separately, stacking such blocks together leads to hierarchical joint representation learning that fully exploits the complementary information between the two sensor modalities.

We validate our approach on the challenging KITTI depth completion benchmark [33], and show that our approach outperforms all previous state-of-the-art methods in terms of Root Mean Square Error (RMSE) on depth. Note that our model is trained from scratch using KITTI training data only, and still surpasses other methods that exploit external data or multi-task learning. This further showcases the superiority of the proposed model in learning joint 2D-3D representations. We also conduct detailed ablation study to investigate the effect of each component of the model, and show that our model achieves better trade-off in accuracy versus model size compared with the state-of-the-art.

2 Related Work

In this section, we review previous literatures on the topics of depth estimation from RGB data, depth completion from RGBD data, and representation learning for RGBD data.

2.1 Depth Estimation from RGB data

Early approaches [20, 14, 15, 28] estimated depth from single RGB images by applying probabilistic graphical models to hand-crafted features. With the recent advance in image recognition by deep convolutional neural networks (CNNs), CNN based methods are applied to depth estimation as well. Eigen [6] designed a multi-scale deep network for depth estimation from a single image. Laina [16] tackled the problem at a single scale by using a deep fully convolutional neural network. Liu [18] combined deep representation with a continuous conditional random field (CRF) to get smoother estimations. Roy and Todorovic [27]

proposed to combine deep representations with random forests and achieved a good trade-off between prediction smoothness and efficiency. Recently unsupervised approaches

[9, 10] exploited view synthesis as the supervisory signal, while some [22, 35, 40] further extended the idea to videos. However, due to the inherent ambiguity in depth from images, these approaches have difficulty producing high-quality dense depth.

2.2 Depth Completion from RGBD data

Different from depth estimation, the task of depth completion tries to exploit a sparse depth map (point cloud scan from a LiDAR sensor) and possibly image data as well to predict high-resolution dense depth. Early work [11, 19]

resorted to wavelet analysis to generate dense depth/disparity from sparse samples. Recently, deep learning methods achieve superior performance in depth completion. Uhrig

[33] proposed sparse invariant CNNs to extract better representation from sparse input only. Ma [23] proposed to concatenate sparse depth together with RGB image and fed into an encoder-decoder based CNN for depth completion. A similar approach was also applied to the self-supervised setting [21]. Instead of using CNN, Cheng [2]

used a recurrent convolution to estimate the affinity matrix for depth completion. Apart from the network architecture side, other methods exploited semantic contexts from multi-task learning. Schneider

[29] extracted object boundary cues for cleaner depth estimation. Semantic segmentation task was also exploited to jointly learn better semantic features of the scene [13, 34]. Qiu [26] added the auxiliary task of surface normal estimation to depth completion. Yang [39] learned a depth prior on images by training on large-scale simulation data. Compared with these approaches that focused on better network architecture and exploiting more context or prior from other dataset and labels. Our method improves performance simply by learning better representations. This is achieved by a new neural network block that’s specially designed for RGBD data. We show in experiments that we are able to learn strong joint 2D-3D representations from the RGBD data with the proposed method and achieve state-of-the-art performance in depth completion.

2.3 Representation for RGBD data

Figure 2: Architecture of the 2D-3D fuse block. The 2D-3D fuse block consists of two branches, a multi-scale 2D convolution branch and a 3D continuous convolution branch. Conv denotes 2D convolution with kernel size

, stride

and output channels

. The gray numbers in brackets denote the shape of features. The multi-scale 2D branch has two scales. One has the same scale as the input and is composed of one convolution. The other is downsampled by a stride 2 convolution, followed by a convolution and then bilinearly upsampled by 2. In the 3D branch, we first extract point features as the image features at the projection locations of the points, then apply two continuous convolutions, and finally project the points to image space to form a sparse image feature map. Continuous convolution uses K-Nearest-Neighbors algorithm to find the neighbors of each point. In the figure, we use K=3 as an example and only show the convolution operation on the red point. Note that the neighboring points in 2D space are not necessarily close to each other in 3D space. All convolutions are followed by batch normalization and ReLU.

Song [30] extracted multiple hand-crafted features (TSDF [24], point density, 3D normal, 3D shape) from depth image for 3D object detection. In [31] RGBD based joint representation was learned by applying 3D CNN to a 3D volume of depth image and 2D CNN to the RGB image and concatenating them together. Chen [1] extracted 3D features by applying 2D CNN on multi-view projection of the 3D point cloud and combining with image features at ROI level. Xu [38] used the similar approach but adopted a PointNet [25] to extract 3D features on raw points directly. In [36] the same representation was further extended to pixel-level by fusing pixel feature with point feature. Liang [17] first discretized the sparse LiDAR points into a dense bird’s eye view voxel representation, and applied 2D CNN to extract BEV representations. The 2D image features are fused back to BEV space densely via continuous convolution [37]

to interpolate the sparse correspondence. Compared with these methods, our approach uses domain specific network for 2D and 3D representation learning, and both features are fused back to 2D image space at multiple levels across the whole network instead of only fusing once. As a result, we are able to learn more densely fused representation from the RGBD data.

3 Learning Joint 2D-3D Representations

We tackle the problem of depth completion from RGBD sensors. Existing approaches typically rely on either 2D or 3D representations to solve this task. In contrast, in this paper, we take advantage of both types of representations and design a simple yet effective architecture that fuses the information between these representations at multiple levels. In particular, we propose a new building block for neural networks that operates on RGBD data. It is composed of two branches that live in different metric spaces. In one branch we use traditional 2D convolutions to extract appearance features from dense pixels in 2D metric space. In the other branch, we use continuous convolutions [37] to capture geometric dependencies from sparse points in 3D metric space. Our approach can be seen as spreading features to both 2D and 3D metric spaces, learning appearance and geometric features in each metric space separately, and then fusing them together.

We build our depth completion networks simply by stacking the 2D-3D fuse blocks. This modular design has two benefits. First, the network is able to learn joint 2D and 3D representations which are fully fused at multiple levels (all blocks). Second, the network architecture is simple and convenient to modify for the desired trade-off of performance and efficiency.

The remainder of the section is organized as follows: we first introduce our 2D-3D fuse block. We then give an example of deploying the proposed block to build a neural network for depth completion. Finally, we provide training and inference details of our depth completion network.

3.1 2D-3D Fuse Block

We show a diagram of the proposed 2D-3D fuse block in Figure 2. The block takes as input a 2D feature map of shape and a set of 3D points of shape . We assume that we are also given the projection matrix with which we can project the points from the 3D metric space to the 2D feature map. The output of the block is a 2D feature map with the same resolution as the input, which makes it straightforward to build a network by stacking the blocks for pixel-wise prediction tasks like depth completion. Inside the block, its architecture can be divided as two sub-networks: a multi-scale 2D convolution network and a 3D continuous convolution network. The input features are distributed to and processed in each sub-network, and their outputs are combined with a simple fusion layer. We refer readers to Figure 2 for an illustration of our method.

Figure 3: Example receptive fields of conv(, ), conv(, ) and continuous convolution. In 2D convolution, the neighbors are defined over image grids and are not necessarily close to each other in 3D space. The receptive field may cover both foreground and background objects. In the shown example convolution is performed at the red pixel. Green pixels are on the near car, and yellow pixels are on the distant car. In contrast, the neighbors in continuous convolution are based on the exact 3D geometric correlation.

Multi-scale 2D convolution net:

We use a 2D convolution network to extract appearance features. We denote a 2D convolutional layer as conv(, , ), where represents filter size, denotes the convolution stride, and denotes the number of output channels. We adopt a two-branch network structure in order to extract multi-scale features. The first branch has the same resolution as the input and we simply apply conv(3, 1, ). The second branch consists of conv(3, 2, ), conv(3, 1, ) and upsample(2), where the first layer down-samples the feature map by , and the last layer up-samples the feature map back to original resolution via bilinear interpolation. Batch normalization and ReLU non-linearity are used after each convolution. The outputs of both branches have the same shape as the input, and we combine them simply by element-wise summation.

Figure 4: Depth completion network based on 2D-3D fuse blocks. The 2D-3D fused network takes image and sparse depth as input and predicts dense depth output. The main part of the network is the stacking of 2D-3D fuse blocks. We also apply some convolution layers at the input and the output stage.

3D continuous convolution net:

We exploit continuous convolutions [37] directly on the 3D points to learn geometric features in 3D metric space. The key concept of continuous convolution is the same as traditional 2D convolution, in that the output feature of each point is a weighted sum of transformed features of neighbors in a geometric space. But they use different ways to find neighbors and perform the weighted sum. For 2D convolution the data is grid-structured so it is natural to use surrounding pixels as the neighbors of a center pixel. Moreover, each neighbor has its corresponding weight which is used to transform its feature before the summation. However, 3D points can be arbitrarily placed and their neighbors are not so natural as in grid data. In continuous convolution, we use K-Nearest-neighbors algorithms to find the

neighbors of a point based on the Euclidean distance. We also parameterize the weighting function using a Multi-layer Perceptron (MLP). In practice, we use the following implementation of continuous convolution:


where is the index of points, is the index of neighbors, denotes the 3-dimensional location of points, and denote the features, is a weight matrix, and denotes element-wise product. Note that the output of MLP has the shape as

. This implementation can be regarded as a continuous version of separable convolution. The MLP and weighted sum perform depth-wise convolution, while the linear transformation resembles

convolution. We make this separation to reduces the memory and computation overhead.

In our block, we first query the feature of each 3D point by projecting the point to the 2D feature map and extracting the feature at the projected pixel. After this step, we get 3D points of shape along with point features of shape . We then apply two continuous convolutions to the point feature. We use a two-layer MLP whose hidden feature dimension and output feature dimensions are and respectively. Each continuous convolution is followed by batch normalization and ReLU non-linearity. We then project the 3D points back to an empty 2D feature map and assign the point features to corresponding projected pixels. In this way, we obtain a sparse 2D feature map as the output of the 3D sub-network. The output has the same shape as the outputs of the 2D sub-network.


Since the output feature maps of the 2D and 3D sub-networks have the same shape, we fuse them simply by element-wise summation. We then apply a conv(3, 1, ) layer to get the output of the 2D-3D fuse block. To facilitate training, we also add a shortcut connection from the input to the output when they have the same feature dimension.

Figure 3 illustrates the receptive field of 2D convolution and continuous convolution. While 2D convolution operates on neighboring pixels on grid-structured image feature maps, continuous convolution finds neighbors based on distance in 3D geometric space. By fusing the outputs of the two branches, the learned representation captures correlations in both spaces. At object boundaries, where depth estimation is usually hard for 2D convolution based methods, our approach has the potential to capture non-smooth representations for more accurate shape reconstruction by leveraging the geometric features in 3D space. We will show in experiments that our model predicts sharper and clear borders than other 2D representation methods.

3.2 Stack 2D-3D Fuse Blocks into a Network

Our 2D-3D fuse block can be used as a basic module to build the network. We simply stack a set of blocks plus a few convolution layers at the input and output stages to get our depth completion model. In Figure 4 we show the architecture of an example network with 2D-3D fuse blocks.

The inputs to the network include a depth image and an RGBD image. We first apply two convolution layers separately to each of the inputs. For the depth image, we use conv(3, 2, 16) and conv(3, 1, 16). For the RGBD image, we use conv(3, 2, 32) and conv(3, 1, 32). We then concatenate the two outputs and feed them to a stack of 2D-3D fuse blocks. The 3D points are obtained from the depth image and used by the blocks. We up-sample the output of the block set by 2 so that it has the same size as the input images. Finally, we apply another two convolution layers to obtain the output dense depth image. By stacking the blocks, the deep network is able to capture both large-scale context and local-scale clues, and the geometric and appearance features are fully fused in multiple levels.

3.3 Learning and Inference

We use a weighted sum of loss and smooth loss averaged over all image pixels that have depth labels as our default objective function.


where is the coefficient to control the balance between the two losses. The smooth loss of a pixel is defined as:


where and are the predicted and ground truth depth, respectively.

Note that some other approaches use multi-task objective functions which leverage other tasks such as semantic segmentation to improve depth completion. Although we expect further performance gain with the multi-task objective function, we opt for the single task loss as the objective function is orthogonal to this work. During both training and inference, we pre-compute the indexes of nearest neighbors for all 3D points for continuous convolution, and apply the network to RGBD data and get the predicted results. No post-processing is required.

4 Experimental Evaluation

We conduct extensive experiments on KITTI depth completion benchmark [33] to validate the effectiveness of our approach. Specifically, we compare with other depth completion methods on the test set by submitting to the KITTI evaluation server and show that our approach surpasses all previous state-of-the-art methods. We also conduct extensive ablation studies on the validation set to compare and analyze different model variants. Lastly, we provide some qualitative results of our approach.

4.1 Experimental Setting


The KITTI depth completion benchmark [33] contains frames for training, frames for validation, and

frames for testing. Each frame has one sweep of LiDAR scan and an RGB image from the camera. The LiDAR and camera are calibrated already with the known transformation matrix. For each frame, a sparse depth image is generated by projecting the 3D LiDAR point cloud to the image. The ground-truth for depth completion is represented as a dense depth image, which is generated by accumulating multiple sweeps of LiDAR scans and projecting to the image. Note that depth outliers that are inconsistent with the stereo disparity label

[12] (caused by occlusion, dynamic objects or measurement artifacts) are removed from the ground-truth by ignoring the corresponding pixels during training and evaluation. We use both the RGB image and the sparse depth image as the input to our model.

Evaluation metrics:

Four metrics are reported by the KITTI depth completion benchmark, which are Root Mean Square Error and Mean Absolute Error on depth (RMSE, MAE) and inverse depth (iRMSE, iMAE) respectively. We mainly focus on RMSE among all these metrics when comparing to other methods as it measures the error directly on depth and penalizes more on larger errors. The KITTI leaderboard also ranks methods based on RMSE. Additionally, we conduct an ablation study where we optimize the model with different objective functions and show that trade-off in different metrics can be controlled by different objective functions. Finding the best objective function for depth completion is out of the scope of this paper and we leave that for future work.

Implementation details:

All images in KITTI validation and test sets are already cropped to the uniform size of , while the training images are not. Therefore we randomly crop the training images (RGB, sparse depth and dense depth) to the size of during training. Thanks to the modular design of the proposed model, we can create different variants by changing the width (number of feature channels ) and depth (number of blocks ) of the network. For all model variants we initialize the network weights randomly, and train on 16 GPUs with a batch size of 32 frames. The training schedule goes as follows. We first train the model with

loss for 100 epochs, with 0.0016 initial learning rate which is decayed by 0.1 at 65, 80, 85, 90 epochs respectively. We then fine-tune the model with the sum of

and smooth loss for 50 epochs, with 0.00016 initial learning rate which is decayed by 0.1 at 30 epochs. In the 3D continuous convolution branch of the 2D-3D fuse block, we randomly sample 10, 000 points and apply a K-D tree to calculate the indices of 9 nearest neighbors and their relative distances for each point in advance.

4.2 Comparison with State-of-the-art

(mm) (mm) (1/km) (1/km)
SparseConvs [33] 1601.33 481.27 4.94 1.78
NN+CNN [33] 1419.75 416.14 3.25 1.29
MorphNet [4] 1045.45 310.49 3.84 1.57
CSPN [2] 1019.64 279.46 2.93 1.15
Spade-RGBsD [13] 917.64 234.81 2.17 0.95
NConv-CNN-L1 [7] 859.22 207.77 2.52 0.92
DDP [39] 832.94 203.96 2.10 0.85
NConv-CNN-L2 [7] 829.98 233.26 2.60 1.03
Sparse2Dense [21] 814.73 249.95 2.80 1.21
DeepLiDAR [26] 775.52 245.28 2.79 1.25
FusionNet [34] 772.87 215.02 2.19 0.93
Our FuseNet 752.88 221.19 2.34 1.14
Table 1: Comparison with state-of-the-art methods on the test set of KITTI depth completion benchmark, ranked by RMSE. indicates models trained with additional data and labels.

We evaluate our best single model on the KITTI test set, which has blocks stacked sequentially in the network, each with feature channels. We show the comparison results with other state-of-the-art methods on the KITTI depth completion benchmark in Table 1. For a fair comparison, we mark methods that use external training data and labels in addition to KITTI training data. For example, DDP [39] exploits the Virtual KITTI dataset [8] to learn the conditional prior of dense depth given an image. DeepLiDAR [26] pre-trains the model on the synthetic dataset generated from the CARLA simulator [5] to jointly learn the dense depth and surface normal tasks. FusionNet [34]

uses pre-trained semantic segmentation network on Cityscapes dataset

[3]. These methods rely on more data and various types of labels to learn good representations for depth completion. In contrast, our model, which is trained on KITTI training data only, outperforms all these methods considerably. This shows the superiority of the proposed model in learning joint 2D-3D representations from RGBD data over other methods. Specifically, our model significantly surpasses the second-best method with/without external data by 20/62 mm in RMSE respectively. We also achieve state-of-the-art results in other three metrics among methods that are trained on KITTI data only.

4.3 Ablation Studies

We conduct extensive ablation studies on the validation set of KITTI depth completion benchmark to justify the micro and macro design choices in the proposed model. We first compare different variants of the 2D-3D fuse block and then analyze the effect of different network configurations and objective functions. For faster experimentation, we conduct ablation studies on different network configurations with 100 training epochs only.

Receptive field of the continuous convolution branch:

The proposed 2D-3D fuse block is composed of three branches, one 2D convolution branch, another 2D convolution branch with stride 2, and one 3D continuous convolution branch. Since we have varied the receptive fields of the 2D convolution by explicitly enumerating two different scales (stride 1 and stride 2), we wonder how to choose the receptive field of the 3D continuous convolution branch, which is controlled by the number of nearest neighbors. We show the ablation results in Table 2, where we can see that the model is quite robust to this hyper-parameter. In practice, we use nearest neighbors.

Architecture of the 2D-3D fuse block:

We compare different architecture design of the 2D-3D fuse block in Table 3. In particular, we want to know how much each convolution branch: the stride 1 and stride 2 2D convolutions and the continuous convolution, contributes to the final performance. As shown in Table 3, multi-scale 2D convolution and continuous convolution are complementary. We rely on stride 1 convolution to extract the local features and continuous convolution to get 3D geometric features. Also, we need stride 2 convolution to extract better global features and propagate the sparse 3D geometric feature to a larger field. The results indicate that these three components are all necessary to the design of the 2D-3D fuse block for depth completion.

K nearest neighbors 3 6 9 12 15
RMSE 813 810 810 816 812
Table 2: Ablation study on number of nearest neighbors in the continuous convolution branch. Network config: .
stride_1 stride_2 cont. RMSE
conv conv conv (mm)
Table 3: Ablation study on the architecture of the 2D-3D fuse block. Network config: .
790 232 2.51 1.16
smooth 839 197 2.23 0.91
, + smooth 785 217 2.36 1.08
Table 4: Ablation study on objective function. Network config: .

Network configuration:

We compare different network configuration by varying the width (number of feature channel ) and depth (number of blocks ) of the network. As a result, we are able to achieve different trade-offs between performance and model size. We plot the results in comparison with other methods in Figure 5, where we show that our model achieves better performance with a smaller model size compared with other methods.

Objective function:

We note that performance on different metrics can be controlled by employing different loss functions. Intuitively better RMSE metric could be achieved by

loss, while better MAE metric could be achieved by loss. We validate this by comparing models trained with loss and smooth loss respectively for 100 epochs. The results are shown in Table 4. To get a better balance on all four metrics, our best single model is trained with loss for 100 epochs first and then trained with the sum of and smooth loss for another 50 epochs.

[A] Sparse2Dense [21] 5540 857
[B] Spade-RGBsD [13] 5300 917
[C] NConv-CNN-L2 [7] 355 872
[D] FusionNet [34] 2091 811
FuseNet-C32-N6 322 830
FuseNet-C32-N9 445 810
FuseNet-C32-N12 568 803
FuseNet-C32-N15 692 799
FuseNet-C64-N12 1898 785
Figure 5: Trade-off between accuracy and model size by varying feature channel number and block number of the network.
              Sparse2Dense [21] NConv-CNN-L2 [7] Ours
Figure 6: Qualitative results in comparison with two state-of-the-art methods (better viewed in color). Our model produces sharper boundaries of objects especially in the long range.

4.4 Qualitative Results

We show some qualitative results of the proposed method in comparison with two state-of-the-art methods NConv-CNN [7] and Sparse2Dense [21] on the test set of KITTI depth completion benchmark. As shown in Figure 6, due to the use of continuous convolution that captures accurate 3D geometric features, our approach produces cleaner and sharper object boundaries in both near and distant regions. We get significantly better results for distant objects where 2D convolution can barely handle due to limited appearance clues. This suggests that in the task of depth completion, the description of the scale-invariant geometric feature in 3D is very important, and the proposed 2D-3D fuse block provides a simple yet effective solution to learn joint 2D and 3D representations.

5 Conclusion

In this paper, we have proposed a simple yet effective architecture that fuses information between 2D and 3D representations at multiple levels. We have demonstrated the effectiveness of our approach on the challenging KITTI depth completion benchmark and show that our approach outperforms the state-of-the-art. In the future, we plan to extend our approach to fuse other sensors and reason about video sequences.


  • [1] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In CVPR, 2017.
  • [2] Xinjing Cheng, Peng Wang, and Ruigang Yang. Depth estimation via affinity learned with convolutional spatial propagation network. In ECCV, 2018.
  • [3] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In CVPR, 2016.
  • [4] Martin Dimitrievski, Peter Veelaert, and Wilfried Philips. Learning morphological operators for depth completion. In Advanced Concepts for Intelligent Vision Systems, 2018.
  • [5] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In CoRL, 2017.
  • [6] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In NIPS, 2014.
  • [7] Abdelrahman Eldesokey, Michael Felsberg, and Fahad Shahbaz Khan. Confidence propagation through cnns for guided sparse depth regression. arXiv preprint arXiv:1811.01791, 2018.
  • [8] Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, 2016.
  • [9] Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In ECCV, 2016.
  • [10] Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017.
  • [11] Simon Hawe, Martin Kleinsteuber, and Klaus Diepold. Dense disparity maps from sparse disparity measurements. In ICCV, 2011.
  • [12] Heiko Hirschmuller. Stereo processing by semiglobal matching and mutual information. TPAMI, 2008.
  • [13] Maximilian Jaritz, Raoul de Charette, Emilie Wirbel, Xavier Perrotton, and Fawzi Nashashibi. Sparse and dense data with cnns: Depth completion and semantic segmentation. In 3DV, 2018.
  • [14] Kevin Karsch, Ce Liu, and Sing Bing Kang. Depth transfer: Depth extraction from video using non-parametric sampling. TPAMI, 2014.
  • [15] Janusz Konrad, Meng Wang, and Prakash Ishwar. 2d-to-3d image conversion by learning depth from examples. In CVPRW, 2012.
  • [16] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. In 3DV, 2016.
  • [17] Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urtasun. Deep continuous fusion for multi-sensor 3d object detection. In ECCV, 2018.
  • [18] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocular images using deep convolutional neural fields. TPAMI, 2016.
  • [19] Lee-Kang Liu, Stanley H Chan, and Truong Q Nguyen. Depth reconstruction from sparse samples: Representation, algorithm, and sampling. TIP, 2015.
  • [20] Miaomiao Liu, Mathieu Salzmann, and Xuming He. Discrete-continuous depth estimation from a single image. In CVPR, 2014.
  • [21] Fangchang Ma, Guilherme Venturelli Cavalheiro, and Sertac Karaman. Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera. In ICRA, 2019.
  • [22] Reza Mahjourian, Martin Wicke, and Anelia Angelova. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In CVPR, 2018.
  • [23] Fangchang Mal and Sertac Karaman. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In ICRA, 2018.
  • [24] Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In 2011 IEEE International Symposium on Mixed and Augmented Reality, 2011.
  • [25] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
  • [26] Jiaxiong Qiu, Zhaopeng Cui, Yinda Zhang, Xingdi Zhang, Shuaicheng Liu, Bing Zeng, and Marc Pollefeys. Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. In CVPR, 2019.
  • [27] Anirban Roy and Sinisa Todorovic. Monocular depth estimation using neural regression forest. In CVPR, 2016.
  • [28] Ashutosh Saxena, Sung H Chung, and Andrew Y Ng. Learning depth from single monocular images. In NIPS, 2006.
  • [29] Nick Schneider, Lukas Schneider, Peter Pinggera, Uwe Franke, Marc Pollefeys, and Christoph Stiller. Semantically guided depth upsampling. In

    German Conference on Pattern Recognition

    , 2016.
  • [30] Shuran Song and Jianxiong Xiao. Sliding shapes for 3d object detection in depth images. In ECCV, 2014.
  • [31] Shuran Song and Jianxiong Xiao. Deep sliding shapes for amodal 3d object detection in rgb-d images. In CVPR, 2016.
  • [32] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In CVPR, 2017.
  • [33] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity invariant cnns. In 3DV, 2017.
  • [34] Wouter Van Gansbeke, Davy Neven, Bert De Brabandere, and Luc Van Gool. Sparse and noisy lidar completion with rgb guidance and uncertainty. In International Conference on Machine Vision Applications (MVA), 2019.
  • [35] Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. Learning depth from monocular videos using direct methods. In CVPR, 2018.
  • [36] Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martín-Martín, Cewu Lu, Li Fei-Fei, and Silvio Savarese.

    Densefusion: 6d object pose estimation by iterative dense fusion.

    In CVPR, 2019.
  • [37] Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei Pokrovsky, and Raquel Urtasun. Deep parametric continuous convolutional neural networks. In CVPR, 2018.
  • [38] Danfei Xu, Dragomir Anguelov, and Ashesh Jain. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In CVPR, 2018.
  • [39] Yanchao Yang, Alex Wong, and Stefano Soatto. Dense depth posterior (ddp) from single image and sparse range. In CVPR, 2019.
  • [40] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.