Deep Depth Completion: A Survey

Depth completion aims at predicting dense pixel-wise depth from a sparse map captured from a depth sensor. It plays an essential role in various applications such as autonomous driving, 3D reconstruction, augmented reality, and robot navigation. Recent successes on the task have been demonstrated and dominated by deep learning based solutions. In this article, for the first time, we provide a comprehensive literature review that helps readers better grasp the research trends and clearly understand the current advances. We investigate the related studies from the design aspects of network architectures, loss functions, benchmark datasets, and learning strategies with a proposal of a novel taxonomy that categorizes existing methods. Besides, we present a quantitative comparison of model performance on two widely used benchmark datasets, including an indoor and an outdoor dataset. Finally, we discuss the challenges of prior works and provide readers with some insights for future research directions.



page 1

page 7

page 11

page 12

page 20


Deep Learning for Visual Speech Analysis: A Survey

Visual speech, referring to the visual domain of speech, has attracted i...

RGB-D Salient Object Detection: A Survey

Salient object detection (SOD), which simulates the human visual percept...

Recent Advances and Trends in Multimodal Deep Learning: A Review

Deep Learning has implemented a wide range of applications and has becom...

Sparse Depth Completion with Semantic Mesh Deformation Optimization

Sparse depth measurements are widely available in many applications such...

Salient Object Detection in the Deep Learning Era: An In-Depth Survey

As an important problem in computer vision, salient object detection (SO...

Approaches, Challenges, and Applications for Deep Visual Odometry: Toward to Complicated and Emerging Areas

Visual odometry (VO) is a prevalent way to deal with the relative locali...

Image-Guided Depth Sampling and Reconstruction

Depth acquisition, based on active illumination, is essential for autono...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Acquiring correct pixel-wise scene depth plays a substantial role in various tasks such as scene understanding

[max-S-and-D], autonomous driving [song2021self], robotic navigation [ma2019sparse, aerial_depth], simultaneous localization and mapping [XiyueGuo2021], intelligent farming [farkhani2019sparse], and augmented reality [Du2020DepthLab]. Thus, it has been a long-term goal studied in past decades. One cost-effective way of obtaining scene depth is to directly estimate it from a single image with monocular depth estimation algorithms [Godard2017UnsupervisedMD, Laina2016DeeperDP, Hu2019RevisitingSI, Fu2018DeepOR]. However, visual methods often yield a low inference accuracy and poor generalizability and thus are vulnerable to real-world deployment.

On the other hand, depth sensors provide accurate and robust distance measurements with true scene scales. Therefore, they are more applicable for applications that require a security guarantee and high performance [S-d-selfsuper, song2021self, fu2019lidar]

, e.g., self-driving cars. In fact, measuring depths with LiDARs is probably still the most deployable way to obtain reliable depth in industrial applications. However, neither LiDAR nor commonly used RGBD cameras, like Microsoft Kinect, can provide a dense pixel-wise depth map. As shown in Fig.

1, the depth map captured by Kinect has small holes and the map captured by LiDAR is significantly more sparse. It is, therefore, necessary to fill the void pixels in practice.

Since there is a clear difference between depth maps captured by Kinect and LiDAR, following [eldesokey2019confidence, huang2019hms], we technically distinguish completion and estimation tasks for Kinect and LiDAR data as follows:

  1. Depth enhancement: Also referred to as depth hole-filling, aims at filling irregular and rare small holes from a dense raw depth map. A typical application is the enhancement for Kinect.

  2. Depth completion: Aims at recovering a dense depth map from a highly sparse input depth map, usually dealing with LiDAR data. Intuitively, depth completion is more challenging than depth enhancement due to the extreme sparsity of inputs.

Fig. 1: Comparison between captured depth maps by different sensors. The raw sparse depth maps are shown in the middle. The left one is captured by a Kinect in an indoor scenario, and the right one is captured by a LiDAR in an outdoor street. Clearly, the map captured by LiDAR is significantly more sparse. The bottom row shows the completed depth map from the raw sparse map.
Fig. 2: A timeline for deep learning based depth completion methods. We show some selected works to visualize the evolution process. Unguided methods: SI-CNN [SICNN], ADNN [chodosh2018deep], HMS-Net [huang2019hms], Ncon-CNN [eldesokey2019confidence], IR L2 [from_depth_what], pNCNN [eldesokey2020uncertainty]. RGB guided methods: 1) Early fusion models: S2D [S-D-single-image], SS-S2D [S-d-selfsuper], 3coeff [depth_coefficient], S2DNet [hambarde2020s2dnet], Qu et al. [qu2020depth], Long et al. [long2021depth]. 2) Late fusion models: Spade-RGBD [max-S-and-D], DDP [ddp], DfineNet[zhang2019dfinenet], GuideNet [learning-guided], VOICED [wong2020unsupervised], MSG-CHN [cascade], KBNet [wong2021unsupervised], RigNet [yan2021rignet], ScaffFusion [wong2021scaffnet]. 3) Explicit 3D representation models: PwP [xu2019depth], DeepLidar [DeepLiDAR], 2D-3D fuseNet [2d-3d], ABCD [jeon2021abcd], ACMNet [zhao2021adaptive], Du et al. [Du2022DepthCU]. 4) SPN-based models: CSPN [cspn], NLSPN [nlspn], CSPN++ [cspn++], PENet [penet], DySPN [Lin2022DynamicSP]. 5) Residual depth models: FCFR-Net [fcfr], KernelNet [liu2021learning], DenseLidar [gu2021denselidar], Zhu et al. [zhu2021robust].

In recent years, deep learning based methods have shown compelling performance on the task and have led the development trend. It is shown in prior works that a network with several convolutional layers [SICNN], or a simple auto-encoder[Lu2022DepthCA] can complete missing depths. Moreover, depth completion can be further improved by leveraging RGB information. A typical method of this type [max-S-and-D, shivakumar2019dfusenet] is to use dual encoders for extracting features from a sparse depth map and its corresponding RGB image, respectively, and later fuse them with a decoder.

To push the envelope of depth completion, recent approaches tend to use complicated network structures and complex learning strategies. In addition to multi-branches used for feature extraction from multi-modality data, e.g., image and sparse depths, researchers have begun to integrate surface normal


, affinity matrix

[cspn], residual depth map [gu2021denselidar], etc., into their frameworks. Besides, to cope with the lack of supervised pixels, some works introduced exploiting multi-view geometric constraints [S-d-selfsuper] and adversarial regularization [khan2021sparse]. These efforts have greatly facilitated the progress in the depth completion task.

Despite the tremendous progress made by learning based approaches, to the best of our knowledge, a comprehensive survey is lacking. This article aims to depict the development of learning based depth completion through hierarchically analyzing and categorizing existing methods and provide readers with a straightforward understanding of deep depth completion with some valuable instructions. Typically, we hope to answer the following questions:

  1. What are the common characteristics of previous methods for achieving highly accurate depth completion?

  2. What are the pros and cons of RGB guided approaches compared to unguided methods?

  3. Since most previous works employed both visual and LiDAR data, what are the most effective strategies for multi-modal data fusion?

  4. What are the current challenges?

With the above questions being considered, we survey the related works from January 2017 to May 2022 (at the time of writing). Fig. 2 visualizes the timeline of the selected methods based on the proposed taxonomy, where the bottom and the top show the unguided and five types of RGB guided methods, respectively. It is seen that although early studies tackle depth completion in an unguided fashion, we observed that studies published after 2020 have been gradually dominated by RGB guided methods. In this article, we investigate the previous studies from the aspects of network structure, loss function, learning strategy, and benchmark datasets. We especially stress methods with the proposal of novel algorithms or significant performance boosts and properly provide visual descriptions of their technical contributions to promote the clarification. Furthermore, we provide quantitative comparisons of existing methods with essential characteristics on the most popular benchmark datasets. Through the in-depth analysis of previous studies, we wish the reader can gain a clear understanding of deep depth completion.

Main categories Sub-categories Major characteristics
Unguided methods (Sec. 3) Sparsity-aware CNNs (SACNN, Sec. 3.1) Using the binary validity mask to indicate missing elements during convolution.
Normalized CNNs (NCNN, Sec. 3.2) 1). Built on normalized convolution 2). Replacing the validity mask with continuous confidence mask.
Training with Auxiliary Images (TwAI, Sec. 3.3) Integrating image reconstruction into latent or output space to encourage learning semantic cues. Image guided training and unguided inference are employed.
RGB guided methods (Sec. 4) Early fusion models (EFM, Sec. 4.1)
  • Encoder-decoder networks (EDN, Sec. 4.1.1)

  • Coarse to refinement prediction (C2RP, Sec. 4.1.2)

Directly aggregating the image and sparse depth map input or fusing the multi-modality features at the first convolutional layer.

Late fusion models (LFM, Sec. 4.2)
  • Dual-encoder networks (DEN, Sec. 4.2.1)

  • Double encoder-decoder networks (DEDN, Sec. 4.2.2)

  • Global and Local Depth Prediction (GLDP, Sec. 4.2.3)

The framework usually consists of dual encoders or two sub-networks; the one is used for extracting RGB features and the other is used for extracting depth features. Fusion is conducted at the intermediate layers, e.g., fusing extracted features from encoders.
Explicit 3D representation models (E3DR, Sec. 4.3)
  • 3D-aware convolution (3DAC, Sec. 4.3.1)

  • Intermediate surface normal representation (ISNR, Sec. 4.3.2)

  • Learning from point clouds (LfPC, Sec. 4.3.3)

Explicitly learning 3D representations, such as applying 3D convolutions, embedding surface normals, and learning from 3D point clouds.
Residual depth models (RDM, Sec. 4.4) Learning a coarse depth map and a residual depth map. Their combination generates the final depth map.
SPN-based models (SPM, Sec. 4.5) 1). Based on the spatial propagation network. 2). First learning the affinity matrix, and then applying affinity based depth refinement.
TABLE I: A brief overview of the proposed taxonomy.

In summary, our key contributions are as follows:

  • To the best of our knowledge, this is the first survey for depth completion. We give an in-depth and comprehensive review, including both unguided and RGB guided methods.

  • We propose a novel taxonomy to categorize previous methods and visualize their main characteristics, including network structures, loss functions, and learning strategies.

  • The article covers the most advanced and recent progress of deep learning based depth completion with performance comparison on benchmark datasets. It provides readers with state-of-the-art methods.

  • We provide several open issues and promising future research directions.

The remainder of this article is organized as follows: Section 2 gives the formulation of deep learning based depth completion and provides the proposed taxonomy. Section 3 reviews unguided methods, and Section 4 elaborates RGB guided methods. Section 5 introduces the loss functions employed in previous approaches. Section 6

lists the benchmark datasets and introduces the evaluation metrics for the depth completion task. Section

7 compares the previous methods from comprehensively different perspectives. Section 8 summarizes the open challenges and provides valuable directions for future research. Section 9 gives the conclusion.

2 Deep Learning Based Depth Completion

In this section, we first give a common formulation of the depth completion task. Then, we outline the proposed taxonomy. Noting that some methods share common characteristics, we group them by jointly considering network structures and main technical contributions.

2.1 Problem Formulation

In depth completion, a deep neural network

with parameters predicts a dense depth map of a given sparse depth map by


Unguided depth completion: In (1), depth completion is performed using only the sparse input without guidance from different modality data. Therefore, it is called unguided depth completion. These methods are reviewed in detail in Section 3.

RGB guided depth completion: In many works, both the sparse depth map and its corresponding RGB image are utilized for inputs. In this case, the task is formulated by


where denotes the RGB image whose pixels are aligned with . Then, task employed by (2) is referred to as RGB guided depth completion which is elucidated in Section 4.

The parameters of the network are optimized to train the network by solving


where denotes the set of ground truth depth maps, and is a loss function which is usually defined to penalize pixel-wise discrepancy between the prediction and the ground truth on the valid pixels through back-propagation while training . Depending on the specific learning strategies, some other losses, such as unsupervised photometric loss, adversarial loss, and regularization terms on depth maps, are properly applied. An in-depth discussion of learning objectives and loss functions is given in Section 5.

2.2 Taxonomy

In this article, we propose a detailed taxonomy by jointly considering network structures and main technical contributions. An existing method is firstly categorized into either an unguided method or an RGB guided approach. Then, it is further classified into a more specific sub-category. Table 

I gives an overview of the proposed taxonomy with descriptions of the major factors for identifying categories.

As seen, unguided methods have three sub-categories, including methods 1) employing sparsity-aware CNNs, 2) employing normalized CNNs, and 3) training with Auxiliary Images. Guided methods include five sub-categories. Some of them also have more concrete classes. For the first and second categories, i.e., early fusion and late fusion models, the fusion strategy is the main factor considered in our taxonomy. For the late three categories, i.e, explicit 3D representation models, residual depth models, and spatial propagation network (SPN) based models, the fusion strategy is not the major factor in identifying their types since they hold distinct characteristics and both early fusion and late fusion are used in previous methods.

3 Unguided Depth Completion

Given a sparse depth map, unguided methods aim at directly completing it with a deep neural network model. Previous methods can be generally categorized into three groups: methods using 1) sparsity-aware CNN, 2) normalized CNN, and 3) training with auxiliary images.

3.1 Sparsity-Aware CNNs

Uhrig et al. [SICNN]

proposed the first deep learning based unguided method. They first verified that normal convolutions are not able to handle sparse input and proposed a new sparse convolution operation. Then, they introduced a 6-layers CNN assembled with the proposed sparse convolution. The sparse convolution uses a binary validity mask to distinguish between valid and missing values and performs convolution among only valid data. The value of the validity mask is determined by its local neighbors via max-pooling. This first deep learning based method outperforms non-learning methods and shows the potential of deep learning on the task. Moreover, it inspired lots of subsequent studies.

However, the sparse convolution is not suitable to be directly applied to classical encoder-decoder networks, which can fully leverage the multi-scale features. Huang et al.[huang2019hms] introduced three sparsity invariant (SI) operations, including SI upsampling, SI average, and SI concatenation, and built an encoder-decoder based HSMNet. They also demonstrated an application using RGB inputs by adding a small branch to HSMNet.

Chodosh et al.[chodosh2018deep] formulated the depth completion as a multi-layer convolutional compressed sensing problem and proposed an end-to-end multi-layer dictionary learning algorithm. It is achieved by applying compressed sensing to the deep component analysis (DeepCA) objective [murdock2018deep] and optimizing by ADMM (alternation direction method of multipliers). The over-complete dictionaries are learned with a few convolutional layers via back-propagation.

3.2 Normalized CNNs

The sparsity-aware methods require validity masks to identify missing values for performing convolutions. As argued in [Eldesokey2018PropagatingCT, max-S-and-D, eldesokey2019confidence], validity masks can degrade the model performance due to the saturation of the mask at early layers in CNNs. To tackle this issue, inspired by normalized convolution [Knutsson1993NormalizedAD], Eldesokey et al. [Eldesokey2018PropagatingCT]

introduced the normalized convolutional neural network (NCNN) that generates continuous uncertainty maps for depth completion. The essential difference is that features obtained using the NCNN are weighed with continuous uncertainty maps instead of binary validity masks. In addition, convolution filters are constrained to be non-negative by the SoftPlus function

[glorot2011deep] for faster convergence.

Although NCNN still takes a sparse mask as an initial input, it yields a continuous confidence map to indicate useful information across the intermediate layers. In reality, disturbed measurements exist due to the LiDAR projection errors. The initial sparse confidence input cannot exclude such noisy inputs. To solve this problem, Eldesokey et al. [eldesokey2020uncertainty] further developed a self-supervised approach to estimate a continuous input confidence map for suppressing the disturbed measurements with a network. NCNN is also applied to RGB guided depth completion in [hua2018ANC, eldesokey2019confidence].

3.3 Training with Auxiliary Images

To overcome the lack of semantic cues, Lu et al. [from_depth_what] employed an auxiliary learning branch in their framework. Instead of directly using an image as input, they only take a sparse depth map as input and simultaneously predict a reconstructed image and a dense depth map. The RGB images are only used in the training stage as a learning objective to encourage acquiring more complementary image features. A similar method is also seen in [DenseLivox] where RGB and normal are used for auxiliary training. In [Lu2022DepthCA], an auto-encoder is employed to generate RGB data in latent space, and then the auto-encoder predicts the final depth from it. Although these methods are RGB guided in training, they aim at performing unguided depth completion in inference. Therefore, we categorize them into unguided methods.

4 RGB Guided Depth Completion

Unguided methods usually underperform RGB guided methods and suffer from blur effect and distortion of object boundaries. The RGB images provide plentiful semantic cues, which are critical for filling missing values of objects with irregular shapes. Therefore, the vast majority of previous works seek to employ RGB information to boost depth completion and demonstrate significantly better performance than unguided methods. To date, different types of methods have been proposed, and they can be categorized into mainly five types: 1) early fusion models, 2) late fusion models, 3) explicit 3D representation models, 4) residual depth models, and 5) spatial propagation network (SPN) based models.

4.1 Early Fusion Models

Early fusion methods directly concatenate a sparse depth map and an RGB image before passing them through a deep model [S-D-single-image, dimitrievski2018learningmorph, DeepLiDAR], or aggregate multi-modal features at the first convolutional layer of a model [depth_coefficient, xu2019depth, long2021depth]. Previous methods of early fusion can be divided into two types: methods employing 1) encoder-decoder network and 2) two-stage coarse to refinement prediction.

4.1.1 Encoder-decoder Networks

This type of method utilizes a traditional encoder-decoder network (EDN) to solve the pixel-to-pixel regression problem. An early work is shown in [S-D-single-image] where Ma et al. proposed to accomplish depth completion from both a sparse depth map and its corresponding RGB image. Toward this end, they directly concatenated the RGB image and the sparse depth map and then fed them to an encoder-decoder network built on a ResNet-50 network [He2016DeepRL].

To better enforce the prediction to be consistent with the measurements, Qu et al. [qu2020depth] replaced the last convolutional layer with a least squares fitting module. In this model, the extracted features obtained from the penultimate layer are treated as a set of bases, and the weights of these bases are obtained through a least squares fit on the depths at valid pixels. As discussed in the paper [qu2020depth], the method is unable to handle extremely sparse input due to the lack of supervision with enough depth points.

Motivated by spatially-adaptive denormalization (SPADE) [park2019semantic], Dmitry et al. [senushkin2020decoder] proposed to learn spatially-dependent scale and bias for normalized features. They introduced a novel decoder assembled with SPADE blocks with a modulation branch. The modulation branch takes the valid mask as input and predicts multi-scale modulation signals. These modulation signals are sent to the multiple SPADE blocks in the decoder at each spatial scale to update features. The method’s effectiveness has been validated on both indoor depth enhancement and outdoor depth completion.

Instead of the direct concatenation, several approaches [S-d-selfsuper, zhang2019dfinenet, depth_coefficient, long2021depth] used two separate convolutional units to extract features from RGB and depth input at the first layer of the encoder-decoder network, respectively. Then, the multi-modal features were concatenated and sent to the rest of the layers to obtain a complete depth map.

4.1.2 Coarse to Refinement Prediction

Some methods employ a two-stage coarse to refinement prediction (C2RP) to achieve more accurate depth estimation. This kind of methods firstly estimates a coarse depth map in the first coarse prediction stage, then applies the second refinement prediction from the coarse depth map and the RGB image. For instance, Dimitrievski et al. [dimitrievski2018learningmorph] integrated a learnable morphological operator (two contraharmonic mean filter layers [masci2013learning-CHM]) into a U-net [Ronneberger2015UNetCN] based framework. After the morphological operation, the predicted coarse depth map and the RGB image are passed through a U-Net to get a refined output. Similarly, Hambarde et al. [hambarde2020s2dnet] proposed S2DNet which consists of two pyramid networks: S2DCNet and S2DFNet. The S2DCNet performs the first coarse prediction, and the S2DFNet performs the second refinement..

Unlike the above methods, several methods proposed to generate multiple depth maps in the coarse prediction stage. For instance, Chen et al. [chen2018estimating]

generated a dense map with the nearest neighbor interpolation and a prior distance map between depth points based on a Euclidean distance transform. Recently, Hedge et al.

[hegde2021deepdnet] proposed the DeepDNet. The main difference from [chen2018estimating] is that, in the coarse prediction stage, the original sparse input is first transformed into a grid sparse depth map with quad tree based preprocessing. Then, two coarse maps are generated by applying the nearest neighbor interpolation and Bi-cubic interpolation from the grid sparse map, respectively.

The idea of rectifying from a coarse prediction is also frequently leveraged in subsequent studies, such as those built on SPNs and residual depth learning frameworks.

4.2 Late Fusion Models

Late fusion models usually employ two sub-networks to extract features from (i) RGB images using an RGB encoder network, and (ii) sparse depth inputs using a depth encoder network. The fusion is conducted at intermediate layers of the two sub-networks. Most of the previous methods exploit the late fusion strategy with various network structures. Specifically, they are categorized into three types: methods employing 1) dual-encoder network, 2) double encoder-decoder network, and 3) global and local depth prediction.

4.2.1 Dual-encoder Networks

Methods built on a dual-encoder network (DEN) commonly use an RGB encoder and a depth encoder for extracting multi-modal features. Then, these features are aggregated and fed into a decoder. In [max-S-and-D], Jaritz et al. introduced a two-branch encoder network based on a modified NASNet [nasnet], where the intermediate features extracted from all encoders are directly concatenated and then outputted to a decoder. Notably, Jaritz et al. verified that the validity mask is not necessary for performance improvement for large networks. Instead of direct channel-wise concatenation, features extracted from the RGB encoder and the depth encoder are fused in element-wise summation in [shivakumar2019dfusenet, ryu2021scanline].

Lately, more complicated fusion strategies have been explored. Fu et al. [fu2020depth] improved the straightforward concatenation of RGB and depth features with an inductive fusion adapted from the conditional neural process [Garnelo2018ConditionalNP]. Zhong et al. [zhong2019deep] suggested using the correlation between RGB and depth information. For this purpose, they proposed the CFCNet which extracts the most semantically correlated features from multi-modal inputs by applying deep canonical correlation analysis [yang2017canonical] between the sparse depth points and their corresponding pixels in RGB images.

The above approaches only fuse the outputted features from the RGB branch and depth branch at a single spatial scale. To establish a hierarchical joint representation, Zhang et al. [zhang2020multiscale] proposed a multi-scale adaptation fusion network (MAFN). The main contribution of MAFN is the adaptation fusion module (AFM) that incorporates features extracted from RGB and depth modalities and passes them to a neighbor attention module to enhance their local neighboring relational information. AFM is applied between the RGB and depth branches at multiple scales, as seen in Fig. 3.

Fig. 3: The diagram of the multi-scale adaptation fusion network (MAFN). The framework is a dual-encoder network where features extracted from the RGB encoder and the depth encoder are fused with the adaption fusion (AFM) module at multi-scales. From [zhang2020multiscale].

Li et al. [cascade] introduced a cascaded hourglass network that consists of a branch (image encoder) used to extract features from images and three hourglass branches used to extract features from depth at different scales (1/4, 1/2, 1). The feature maps obtained from the image encoder at different scales are merged with the corresponding depth features by skip connection. The ground truth is down-sampled to different scales to make use of the multi-scale supervision.

To better tackle the sparsity, many works seek to exploit additional constraints to guide the learning process. A common solution is to apply epipolar constraints between temporally adjacent frames [wong2020unsupervised, wong2021adaptive, feng2022advancing, wong2021scaffnet, wong2021unsupervised, song2021self, choi2021selfdeco], or stereo pairs [ddp, shivakumar2019dfusenet]

. Another constraint is adversarial loss which comes from adversarial training with the use of a generative adversarial network (GAN)

[Goodfellow2014GenerativeAN]. Although these constraints provide unsupervised guidance to the models for the depth completion task, they require additional inputs or other guidance networks during their training.

4.2.2 Double encoder-decoder Networks

As discussed above, DEN-based methods usually consist of an RGB encoder, a depth encoder, and a decoder. The fusion is conducted between the two encoders. A double encoder-decoder network (DEDN) is an improvement of the dual-encoder network. A vanilla DEDN contains two encoder-decoder networks. In like manner, one takes an image input, and the other takes sparse depth input. The image network is also called the guided network. For methods built on DEDN, the fusion is usually conducted between the decoder of the image branch and the encoder of the depth branch at multi-scales.

As a representative method depicted in Fig. 4, GuideNet [learning-guided] aims to learn a more effective fusion of RGB and depth features. Inspired by guided image filtering [guided-image-filtering] and bilateral filtering [bilateral-filtering], GuideNet introduced the guided convolution which automatically generates spatially-variant kernels from the image features and applies them to assign weights to the depth features. The guided convolution is applied to multi-scale image features. To reduce the computational complexity, motivated by MobileNet-V2 [mobilenetv2], the guided convolution is factorized into a channel-wise and a cross-channel convolution.

Fig. 4: The architecture of the GuideNet. The framework is a double encoder-decoder network in which the guided convolution learns fusion kernels from RGB features and applies them to depth features. From [learning-guided].

Inspired by [learning-guided] and [SICNN], Schuster et al. [schuster2021ssgp] proposed sparse spatial guided propagation (SSGP) which combines image guided spatial propagation and sparsity convolution. SSGP is applicable to not only depth completion but also other interpolation problems such as optical flow and scene flow. More recently, Yan et al. [yan2021rignet] proposed RigNet with a novel repetitive design to handle blurry object boundaries and better recover scene structures. In RigNet, the branch used for extracting image features is implemented using a repetitive hourglass network (RHN), i.e., multiple encoder-decoder networks, to produce perceptually clear image features. The branch of RigNet used for extracting depth features is also a hourglass network stacked with a repetitive guidance module (RG). RG plays a similar role as the guided convolution [learning-guided] and is built on dynamic convolution [chen2020dynamic]. Since RG implements dynamic convolution repetitively, the convolution factorization proposed in [learning-guided] becomes less efficient. Thus, they designed an efficient guidance algorithm in which the kernel size in the channel-wise convolution drops from 33 to 11 by using global average pooling. RigNet achieves an extraordinary performance and currently ranks second on the KITTI depth completion dataset [SICNN].

4.2.3 Global and Local Depth Prediction

In several prior works, RGB and LiDAR data are referred to as global information, and the LiDAR data is referred to as local information. The global and local depth prediction (GLDP) methods employ a global network to infer depth from global information and a local network to estimate depth from local information. The final dense depth map is obtained by merging the outputs of the global and local networks.

To exploit both the global and local features, a global depth and local depth map, as well as related confidence maps, were predicted in [sparse-and-noisy]. The confidence map predicted at each branch was used as a cross-guidance to refine the depth map predicted by the other branch. A similar method was also introduced in [lee2020deep] where Lee et al. made two improvements. First, in order to extend the receptive field, they designed a residual atrous spatial pyramid (RASP) block to replace the traditional residual block. Second, unlike [sparse-and-noisy] where the confidence map was directly used to refine a depth map via element-wise multiplication, they introduced a new guidance module that applies both channel-wise and pixel-wise attention operations. The same framework was likewise used to address depth completion from the extremely sparse depths in order to explore depth completion from single-line depth maps in [lu2021sgtbn].

4.3 Explicit 3D Representation Models

Most previous studies of RGB guided depth completion learn 3D geometric relationships in an implicit yet ineffective manner. Typically, the difficulty comes from the incapability of normal 2D convolution to capture the 3D geometric clues from the sparse input where the observed depth values are irregularly distributed. Hence, another type of previous approaches promotes explicit 3D representations (E3DR). Previous methods of this type can be classified into the methods employing 1) 3D-aware convolution, 2) intermediate surface normal representation, and 3) methods of learning geometric representations from point clouds.

4.3.1 3D-aware Convolution

In [2d-3d], features extracted from an RGB branch and a depth branch are fused by several 2D-3D fusion blocks that jointly learn 2D and 3D representations. The 2D-3D fusion block uses a multi-scale branch to extract appearance features in 2D grid space with normal convolution operations, and a branch to learn 3D geometric representations by applying two continuous convolutions [wang2018deep] on K-nearest neighbors of a center point in 3D space. The idea of learning from spatially close K-nearest neighbors is then commonly employed in subsequent studies.

For instance, in the ACMNet [zhao2021adaptive], the nearest neighbors are identified similarly by comparing the spatial differences. Unlike [2d-3d], the non-grid convolution is implemented by graph propagation. As seen in Fig. 5, ACMNet has a DEDN structure where the encoder is composed of co-attention guided graph propagation modules (CGPMs), and the decoder is a stack of symmetric gated fusion modules (SGFMs). CGPM adaptively applies attention based graph propagation in both the image and depth encoders for multi-modality feature extraction, and SGFMs apply symmetric cross guidance between two decoders for multi-modality feature fusion.

Fig. 5: The diagram of the ACMNet where the encoder uses several co-attention guided graph propagation modules (CGPMs) for multi-modality feature extraction and the decoder uses several symmetric gated fusion modules (SGFMs) for multi-modality feature fusion. From [zhao2021adaptive].

Xiong et al. [Xiong2020SparsetoDenseDC] considered a graph model for depth completion and introduced a graph neural network (GNN) based depth completion algorithm. Note that the 3D graph of nearest neighbors is only constructed for a valid point in [2d-3d, zhao2021adaptive], while it is constructed for each point from a pre-enhanced dense depth map in [Xiong2020SparsetoDenseDC].

4.3.2 Intermediate Surface Normal Representation

A few works utilized surface normal as an intermediate 3D representation of depth map and introduced methods employing surface normal guided completion. As studied in [DDC_of_single, huang2019indoor], surface normal is a reasonably intermediate representation and can promote indoor depth enhancement. However, as pointed out by Qiu et al. [DeepLiDAR] that reconstructing depth from normal in outdoor scenes is more sensitive to noise and occlusion; how to utilize surface normal in this case is still an open question. To address this issue, they proposed DeepLIDAR, a two-branch network consisting of a color pathway and a surface normal pathway depicted in Fig. 6. Both branches produce a dense depth map. In the surface normal branch, surface normal is utilized as the intermediate representation of the produced depth map.

Fig. 6: The pipeline of the DeepLIDAR where surface normal is used as an intermediate representation of a depth map. From [DeepLiDAR].

The use of surface normal is straightforward for the method proposed in [DeepLiDAR]. As argued in [xu2019depth], the relation between depth and surface normal can be established via the tangent plane equation in the camera coordinate system. By this intuition, Xu et al. [xu2019depth] proposed the plane-origin distance that forces the consistency between depth and surface normal to regularize depth completion.

4.3.3 Learning from Point Clouds

Recently, a few studies directly learned geometric representations from point clouds. For example, Du et al. [Du2022DepthCU] proposed to firstly learn a geometric-aware embedding from point clouds with edge convolution [Wang2019DynamicGC]. Then, a DEN was utilized to perform depth completion from RGB images and geometric embeddings. Jeon et al. [jeon2021abcd] also used a point cloud as input. By incorporating the attention mechanism into bilateral convolution [su2018splatnet], they designed an attention bilateral convolutional layer (ABCL) based encoder for feature extraction from 3D point clouds. Their framework also implements a DEN where a point cloud encoder is used to extract 3D features, and an image encoder is used to extract 2D features from an RGB image and a sparse depth input.

4.4 Residual Depth Models

Residual depth models (RDMs) predict a depth map and a residual map, and their linear combination obtains the final depth. Through the prediction of the residual map, the model can refine the blur depth prediction and yield finer results on object boundaries.

These methods usually apply a two-stage coarse to refinement prediction procedure. A simple application is shown in [liao2017parse] where a sparse depth map is firstly completed to a dense map, and a residual map is then predicted. Finally, the element-wise summation of them generates the final depth map. Gu et al. [gu2021denselidar] proposed DenseLIDAR, a similar method as shown in Fig. 7. In DenseLIDAR, a pseudo depth map with morphological operations is firstly predicted. . Then, the pseudo depth map, the RGB image, and the sparse depth input are sent to a CNN to predict a residual map. Finally, the pseudo depth map is rectified with the residual map to yield the final depth map.

Fig. 7: The pipeline of the DenseLIDAR where depth completion is decomposed as learning of a coarse depth map and a residual depth map. From [gu2021denselidar].

For other approaches, the improvement is derived from boosting the estimation of either the coarse depth map or the residual depth map. For instance, motivated by kernel regression, a differentiable kernel regression network was proposed to replace the hand-crafted interpolation for performing the coarse depth prediction from the sparse input in [Nadaraya1964OnER, liu2021learning]. In addition, FCFR-Net [fcfr] implemented an energy based operation for multi-modal feature fusion to boost the residual map learning.

Aiming at handling the uneven distribution and dealing with the outlier issue, Zhu et al.

[zhu2021robust] introduced a novel uncertainty based framework which consists of two networks: a multi-scale depth completion block and an uncertainty attention residual learning network. Like other residual based methods, the former network yields a coarse prediction, and the later network performs refinement. The uncertainty based framework prevents over-fitting from outliers by relaxing constraints of the highly uncertain regions in the first completion stage and guides the network to generate the residual map in the refinement stage. Zhang et al. [zhang2021multi] combined the late fusion with residual learning and proposed a DEN-based multi-cue guidance network. Unlike other methods, the final depth is the combination of the sparse input and the estimated residual map.

4.5 SPN-based Models

An affinity matrix, also called a similarity matrix, expresses how close or similar data points are to each other. It is used to refine and gain a fine-grained prediction in vision tasks. In spatial propagation networks (SPN) [spn], learning an affinity matrix is formulated as learning a group of transformation matrices. Following [spn, nlspn], the affinity refinement process of SPN is defined by


where and denote the coordinates of reference and neighbor pixels, respectively, and is a set of neighbor pixels of the reference pixel at . denotes the iteration step of refinement. and are the affinity of the reference pixel and the affinity between the pixels at and , respectively, where .

Since a depth point is correlated to its neighbors, the SPN is reasonably applicable to depth regression problems, and a family of previous studies developed their algorithms based on SPNs. Cheng et al. proposed the pioneering convolutional spatial propagation network (CSPN) [cspn, Cheng2020LearningDW] which is the first SPN-based model used for depth completion. Compared to the original SPN [spn], CSPN has two major improvements. First, in SPN, a point is linked to three local neighbors from the nearest row or column, while in CSPN, a local window is used to connect local neighbors. Second, CSPN efficiently propagates a local area in all directions via a convolution operation instead of propagating in different directions and integrating with max-pooling as SPN. The final value of a depth point is determined by its local neighbors via the diffusion process with the affinity matrix. Specifically, the network proposed in [S-D-single-image] is modified with skip connections and an additional output branch to generate the affinity matrix. Given a coarse predicted depth map and the affinity matrix, a CSPN is plugged into the network [S-D-single-image] for refinement, as shown in Fig. 8. The hyper-parameters including kernel size (size of local neighbors) and the number of iterations, need to be tuned by hyper-parameter search.

Fig. 8: The framework of CSPN based depth completion. The CSPN module is plugged into the network to rectify a coarsely predicted depth map. From [Cheng2020LearningDW].

To solve the difficulty of determining kernel sizes and iteration numbers, Cheng et al. further proposed CSPN++ [cspn++] that enables an context aware CSPN (CA-CSPN) and an resource aware CSPN (RA-CSPN). For the implementation of CA-CSPN, various configurations of kernel sizes and numbers of iterations are first defined, and two extra hyper-parameters are introduced to weigh different kernel sizes and iterations adaptively. Thus, CA-CSPN consumes a large number of computational resources. To tackle this issue, RA-CSPN selects the best kernel size and number of iterations for each pixel by minimizing the computational resource usage. To this end, a computational cost function is aggregated to the optimization target to balance the trade-off between accuracy and training time.

While CSPN and CSPN++ mainly focus on the refinement from an existing encoder-decoder method [S-D-single-image], PENet [penet] takes advantage of both SPN and late fusion models. PENet uses the DEDN structure where one network predicts from RGB images and sparse depths, and the other network predicts from sparse depths and a pre-densified depth map. A CSPN++ is then applied to the fused depth map of these predictions.

The above methods use fixed local neighbors for spatial propagation during affinity learning. However, this will involve the unnecessary use of irrelevant local neighbors. To address this problem, Park et al. proposed a non-local SPN [nlspn] where non-local neighbors with affinities and a depth confidence map are learned, and the propagation is implemented through the K non-local neighbors. Besides, they also designed the confidence-incorporated affinity normalization module to encourage more affinity combinations and reduce the negative effect of unreliable depth values.

In [dspn], a deformable spatial propagation network (DSPN) is proposed to adaptively generate different receptive fields and affinity matrices for each pixel. Likewise, [Lin2022DynamicSP] introduced attention based dynamic SPN (DySPN) that can learn an adaptive affinity matrix by decoupling neighboring pixels based on their distances. Such attention mechanism recursively generates different attention maps to refine the affinity matrix and bring us the new state-of-the-art method for depth completion. DySPN currently ranks first on the KITTI depth completion benchmark [SICNN].

5 Learning Objectives for Training Models

Since depth completion and monocular depth estimation have the same target outputs, i.e., predicting dense depth maps, they share the same learning objectives, such as depth loss, surface normal loss, and photometric loss. In this section, we describe the learning objectives used in previous studies. A brief overview is given in Table 

II in which we will review the commonly used objectives in detail in the following sections.

5.1 Depth Consistency

Given a sparse input , the predicted dense map where , and the semi-dense ground truth depth map , many works [max-S-and-D, ddp, long2021depth, tsuji2018non, shivakumar2019dfusenet] used the loss (mean absolute error) between the predicted depth map and the ground truth depth map on valid pixels by


where denotes the norm, and denote the predicted depth and the ground truth depth at pixel, and is the total number of valid depth points from . Also, most existing methods [S-D-single-image, zhang2020multiscale, Du2022DepthCU] used the loss, also known as root mean squared error (RMSE) by


where denotes the norm. Note that in many methods [S-D-single-image, DeepLiDAR, S-D-single-image, S-d-selfsuper, lee2020deep], the loss is referred to as MSE. Therefore, in this article, we do not technically distinguish between the RMSE and MSE when they are used as loss functions.

The loss treats each valid pixel equally, while the loss is more sensitive to outliers and usually penalizes distant depth points more heavily. To take advantage of both losses, some methods attempt to combine them from different aspects. For example, several approaches [hambarde2020s2dnet, Lin2022DynamicSP] linearly combined them as a loss function. Van Gansbeke et al. [sparse-and-noisy] proposed focal-MSE where the mean absolute error was taken as a focal term for weighing the loss of depth. Also, some works [qu2020depth, eldesokey2019confidence] used the Huber loss [huber1992robust] combining and to reduce the influence of large errors. It is defined by


where denotes the absolute value operator and is usually set to 1. Besides, a few studies [lopez2020project, eldesokey2019confidence] employ the Berhu loss [owen2007robust_berhu] which is a reversion of Huber loss defined by


Fig. 9 visualizes the comparisons of MAE, MSE, Huber, and the Berhu loss functions for . As shown, the Huber norm acts as when the error is less than and acts as otherwise. On the other hand, the Berhu norm acts inversely to the Huber norm, i.e., acts as when the error is less than and acts as otherwise.

Fig. 9: The comparison of MAE, MSE, Huber and Berhu norm.

Another attempt for handling the above issue of regression is to formulate depth prediction as a classification problem as an early work [cao2017estimating] on monocular depth estimation. In this case, the depth range is discretized into a set of bins and a cross entropy loss is used. For depth completion, [depth_coefficient, liu2021learning] exploit this setting.

Loss function Type Notation Explanation
Depth Consistency Supervised loss of depth on valid pixels, Eq.(5)
loss of depth on valid pixels, Eq.(6).
Huber loss of depth on valid pixels, Eq.(7).
Berhu loss of depth on valid pixels, Eq.(8).
Cross entropy of depth on valid pixels by formulating depth regression as a classification task.
Uncertainty driven loss of depth on valid pixels, Eq.(12).
Structural loss Supervised Gradient loss between the predicted depth map and the pseudo ground truth depth map.
Negative cosine difference of surface normal.
SSIM loss between the predicted depth map and the pseudo ground truth depth map.

Smoothness regularization
Unsupervised [chodosh2018deep] Total variation of the predicted depth map.
norm on second-order derivative of predicted depth map, Eq.(13) or edge-aware smoothness loss, Eq.(14).

Geometric constraint
Unsupervised Photometric loss derived from temporally adjacent images or stereo images, Eq.(16).
loss of depth between the predicted depth map and the pseudo ground truth depth map generated from stereo images.

Adversarial loss
Unsupervised Adversarial loss between the predicted depth map and the pseudo ground truth depth map, Eq.(17).

Supervised [wong2021scaffnet] loss between the prior (initial) depth map and the final estimated depth map.
[ddp] loss between an estimated depth map and its reconstruction from the conditional prior network.
[long2021depth] Cosine similarity between the predicted depth map and the pseudo ground truth depth map.
[xu2019depth] Loss for learning the confidence map.
[from_depth_what] Loss for image reconstruction.
[zhu2021robust] Uncertainty aware loss for learning the residual depth map.
TABLE II: A list of loss functions used for depth completion in previous works.

Besides the above discussed loss functions, to tackle the outliers and inherent noises of the sparse input, uncertainty aware learning objectives are also exploited. Uncertainty estimation [kendall2017uncertainties] has been originally proposed to improve the robustness and accuracy of deep models. Inspired by [kendall2017uncertainties], a couple of methods [eldesokey2020uncertainty, zhu2021robust]

introduce the uncertainty driven depth loss function where the completion is posed as maximizing the posterior probability. Assuming the likelihood term

is modeled by a Gaussian distribution, following

[eldesokey2020uncertainty, zhu2021robust], then


and can be obtained via maximum likelihood estimation by


where denotes the uncertainty of prediction at the pixel. Given equation  (10), the uncertainty driven depth loss for depth completion is defined by


In practice, an exponential function is usually applied to avoid division by zero during the training and the following uncertainty aware learning objective is used instead:


In both works [eldesokey2020uncertainty, zhu2021robust], the uncertainty map is estimated with an additional branch within the depth completion framework.

5.2 Structural Loss Functions

A common problem of previous works is that the predicted depth maps suffer from blur effects and distorted boundaries. To overcome this problem, researchers proposed to apply regularization to scene structures by introducing loss functions of depth gradient, surface normal, and perceptual quality. Specifically, the gradient loss , is implemented by minimizing the mean absolute error [gu2021denselidar, liu2021learning]. For surface normal difference denoted by , the negative cosine difference is commonly utilized [xu2019depth, DeepLiDAR]. The effect of gradient and surface normal loss has been well studied in [Hu2019RevisitingSI]. As shown in Fig. 10, the gradient loss contributes to penalizing errors emerging at the boundary of an object, while the surface normal loss can alleviate minor structural errors. Lastly, the structural similarity index measure (SSIM) loss [Wang2004ImageQA], denoted by , is penalized to ensure the perceptual quality [gu2021denselidar, multitask_gan]. Since dense ground truth depth maps are required, previous methods using the structural loss need to generate pseudo dense ground truth maps if they are not available from training data.

Fig. 10: Robustness of depth, gradient, and surface normal loss to depth differences. For simplicity, the solid and dotted lines denote two one-dimensional depth maps, respectively. It is observed that depth loss is insensitive to the shift and the occlusion of edges, while gradient and surface normal loss can handle these structural differences. From [Hu2019RevisitingSI].

5.3 Smoothness Regularization

Smoothness regularization is utilized to suppress noises and ensure local smoothness for depth prediction. There are typically two frequently used learning objectives for imposing depth smoothness. The first objective used in [revisiting-scinn, S-d-selfsuper, zhang2019dfinenet, shivakumar2019dfusenet] is to minimize the norm on second-order derivative of predicted depth map by


where and denotes the gradients along the horizontal and vertical direction of the dense depth map. The second is the edge-aware smoothness loss used in [wong2020unsupervised, ryu2021scanline, wong2021scaffnet, wong2021unsupervised, song2021self, choi2021selfdeco] that allows depth discontinuity at boundaries by


Besides, the total variation is also used in [chodosh2018deep] for noise suppression.

5.4 Multi-view Geometric Constraints

One of the most challenging issues for depth completion is the lack of dense and high quality ground truth. To cope with this problem, researchers also attempt to seek solutions from the perspective of utilizing loss functions. Among them, temporal photometric loss obtained from consecutive images provides an unsupervised supervision signal 111This signal is also called self-supervised signal in some studies [ito2021seeing, S-d-selfsuper, song2021self] to guide depth completion.

Ma et al. [S-d-selfsuper] are the first that introduce photometric loss for depth completion. Based on the epipolar geometry, the predicted depth map of an image is warped to the nearby frame. Then, the differences at corresponding pixels are penalized. Formally, given two consecutive images and , the warping of a pixel from to is computed by


where denotes the camera intrinsic matrix and denotes the relative pose from time to . is the predicted depth of the pixel of the image . Then, the photometric loss between two images is defined by


where denotes the number of missing pixels.

Researchers attempt to improve the above photometric loss from different perspectives in subsequent studies. The photometric loss is susceptible to moving objects. To alleviate this problem, Chen et al. [chen2020spatiotemporal] integrated a MaskNet into the self-supervised framework. The MaskNet predicts the masks of moving objects such that the influence of moving objects can be reduced. Also, to ensure the perceptual consistency, Wong et al. [wong2020unsupervised, wong2021unsupervised] integrated the SSIM difference between warped and original images into the photometric loss.

Different approaches to calculating the photometric loss have also been explored. In [ito2021seeing]

, optical flow is used to estimate the relative pose between two consecutive frames, and a pose estimation net is used for this purpose in

[zhang2019dfinenet]. In [song2021self], relative poses are calculated in feature spaces at multiple scales. Specifically, consecutive frames are sent to the FeatNet for multi-scale feature extraction. The relative pose is calculated with the Gauss-Newton algorithm [NavarroPedreo1996NumericalMF] at each scale.

As pointed out in [wong2021adaptive], the conventional use of the photometric loss treats each pixel equally. Unfortunately, this incurs significant meaningless errors at occluded regions. To address this issue, Wong et al. [wong2021adaptive]

proposed an adaptive weighting function that acts as a flipped sigmoid function. The weights for the photometric loss are approximately equal to 1 at each pixel at the beginning, and will get smaller if the residual at certain pixel increases during the training procedure.

A few previous works studied depth completion under the stereo setting except for the temporal photometric loss. When a stereo pair is available, as seen in [ddp], the multi-view photometric consistency can be derived in a different fashion. Besides, in order to handle the lack of supervision, stereo images are used to generate ground truth depths for missing pixels in [shivakumar2019dfusenet]. However, despite these advantages, the stereo setting inevitably lowers the generalizability of these methods [ddp, shivakumar2019dfusenet] in practice.

5.5 Adversarial Loss

Several approaches also adopt adversarial loss to promote depth completion [tsuji2018non, to_complete_or_to_estimate, multitask_gan, khan2021sparse]. In these works, a generator is used to infer a depth map from the RGB and sparse depth map, and a discriminator is used to distinguish between the reconstructed depth map and ground truth by


where is dense ground truth which is usually obtained by other completion algorithms, and are the generator and discriminator, respectively.

6 Datasets and Evaluation Metrics

In this section, we introduce the benchmark datasets commonly used in previous works in detail. We also comprehensively survey the related datasets for reference.

6.1 Real-world Datasets

KITTI depth completion dataset [SICNN]: The KITTI dataset is a widely used large-scale outdoor dataset that contains over 93,000 semi-dense depth maps with the corresponding raw sparse LiDAR scans and RGB images. The training, validation, and test set have 86,000, 7,000 and 1,000 samples, respectively. The full resolution of images and depth maps can reach , which is larger than most existing RGBD datasets. The raw LiDAR scans are captured by a Velodyne HDL-64E. In order to have a semi-dense ground truth depth map with rare outliers, Uhrig et al. [SICNN] purified the raw data with the semi-global matching (SGM) and densified the sparse depth map by accumulating 11 laser scans. KITTI is the most popular dataset used for performance evaluation of depth completion methods and provides a public benchmark that ranks existing methods. We compare essential characteristics of these methods, including network structures, loss functions, and RMSE on the benchmark KITTI dataset in the following sections.

It should be noted that the ground truths can be used differently in implementing previous methods. The density of the original sparse depth maps is only about (as observed in Fig. 11 (b)), and the semi-dense ground truths provided by the KITTI benchmark can reach about (as visualized in Fig. 11 (c)). Most previous works take the denser ground truths to implement their methods, whereas several unsupervised approaches [ddp, wong2020unsupervised, wong2021scaffnet, wong2021adaptive, wong2021unsupervised] assume that only original sparse depth maps are available. In this case, the depth consistency is only applied to those valid pixels.

(a) (b) (c)
Fig. 11: Sample images from the KITTI depth completion dataset [SICNN]. (a) RGB images. (b) Raw sparse depth maps. (c) Ground truth depth maps.

NYU-v2 [NYUv2]: The NYU-v2 dataset consists of 464 indoor scenes with 408,000 RGBD images captured by Microsoft Kinect with an original resolution of . Although the original RGBD data is only applicable to depth enhancement methods, previous studies of depth completion implement their methods by randomly selecting 200 (Fig. 12 (b)) or 500 depth points (Fig. 12 (c)) as sparse inputs. The total valid pixels are less than in both cases. Most methods evaluated on the NYU-v2 dataset are RGB guided. In the following sections, we show the essential characteristics of previous methods, including network structure, loss function, RMSE, etc., on the dataset.

(a) (b) (c) (d)
Fig. 12: A sample image from the NYU-v2 dataset [NYUv2]. (a) An RGB image. (b) A spare depth map (200 points). (c) A sparse depth map (500 points). (d) The corresponding ground truth depth map.
Method Publication Year Type Loss Function Learning RMSE (mm) Platform Code
SI-CNN[SICNN] 3DV 2017 SACNN S 1601.33 TensorFlow
DCCS[chodosh2018deep] ACCV 2018 SACNN + S 1325.37 TensorFlow
HMS-Net[huang2019hms] TIP 2019 SACNN S 937.48 - -
NConv-CNN[Eldesokey2018PropagatingCT] BMVC 2018 NCNN S 1268.22 PyTorch
pNCNN[eldesokey2020uncertainty] CVPR 2020 NCNN S 960.05 PyTorch
DCAE [Lu2022DepthCA] WACVW 2022 TwAI + S 1464.69 PyTorch -
IR L1[from_depth_what] CVPR 2020 TwAI + S 915.86 PyTorch -
IR L2[from_depth_what] CVPR 2020 TwAI + S 901.43 PyTorch -

Summary of essential characteristics of existing unguided methods on the KITTI dataset. For denoting the loss function, we omit the coefficient of each loss term for simplicity. S denotes supervised learning of models.

Method Publication Year Type Loss Function Learning RMSE (mm) Platform Code
3coef[depth_coefficient] CVPR 2019 EFM/EDN S 965.87 TensorFlow
EncDec-Net[EF][eldesokey2019confidence] TPAMI 2019 EFM/EDN S 965.45 Pytorch
Qu et al.[qu2020depth] WACV 2020 EFM/EDN S 998.80 Pytorch -
Long et al.[long2021depth] JVCIR 2021 EFM/EDN S 776.13 - -
Morph-Net[dimitrievski2018learningmorph] ACIVS 2018 EFM/C2RP S 1045.45 Matlab
S2DNet[hambarde2020s2dnet] TCI 2020 EFM/C2RP S 830.57 PyTorch -

3DV 2018 LFM/DEN iMAE S 917.64 - -
MS-Net[LF][eldesokey2019confidence] TPAMI 2019 LFM/DEN S 859.22 PyTorch

WACV 2020 LFM/DEN S 762.19 PyTorch
MAFN [zhang2020multiscale] IJCNN 2020 LFM/DEN S 803.50 - -
Ryu et al.[ryu2021scanline] RAL 2021 LFM/DEN S 809.09 - -
DVMN[reichardt2021dvmn] ITSC 2021 LFM/DEN S 776.31 - -

TIP 2020 LFM/DEDN S 736.24 PyTorch
SSGP[schuster2021ssgp] WACV 2021 LFM/DEDN S 838.22 - -
RigNet[yan2021rignet] Arxiv 2021 LFM/DEDN S 712.66 PyTorch -

Van et al.[sparse-and-noisy]
MVA 2019 LFM/GLDP focal-MSE S 772.87 PyTorch
CrossGuidance[lee2020deep] Access 2020 LFM/GLDP S 807.42 PyTorch -
2D-3D FuseNet[2d-3d] ICCV 2019 E3DR/3DAC S 752.88 - -
ACMNet[zhao2021adaptive] TIP 2021 E3DR/3DAC S 732.99 PyTorch

CVPR 2019 E3DR/ISNR + S 758.38 PyTorch
PwP[xu2019depth] ICCV 2019 E3DR/ISNR S 777.05 PyTorch -

RAL 2021 E3DR/LfPC S 764.61 PyTorch -
Du et al. [Du2022DepthCU] Arxiv 2022 E3DR/LfPC S 773.90 PyTorch

AAAI 2020 RDM S 735.81 - -

RAL 2021 RDM S 755.41 - -

Zhu et al. [zhu2021robust]
AAAI 2022 RDM S 751.59 PyTorch -

ECCV 2018 SPM S 1019.64 PyTorch
CSPN++[cspn++] AAAI 2020 SPM S 743.69 - -
NLSPN[nlspn] ECCV 2020 SPM S 741.68 PyTorch
PENet[penet] ICRA 2021 SPM S 730.08 PyTorch

DySPN [Lin2022DynamicSP]
AAAI 2022 SPM S 709.12 PyTorch -

SS-S2D (d)[S-d-selfsuper]
ICRA 2019 EFM/EDN + U 1299.85 PyTorch
DFineNet[zhang2019dfinenet] Arxiv 2019 EFM/EDN + + S&U 943.89 PyTorch
DDP[ddp] CVPR 2019 LFM/DEN S&U 1263.19 TensorFlow -
DFuseNet[shivakumar2019dfusenet] ITSC 2019 LFM/DEN + + S&U 1206.66 PyTorch
VOICED[wong2020unsupervised] RAL 2020 LFM/DEN + + S&U 1169.97 TensorFlow -
ScaffFusion-U[wong2021scaffnet] RAL 2021 LFM/DEN S&U 847.22 TensorFlow
ScaffFusion-S&U[wong2021scaffnet] RAL 2021 LFM/DEN U 1121.89 TensorFlow
KBNet[wong2021unsupervised] ICCV 2021 LFM/DEN S&U 1069.47 PyTorch
Song et al.[song2021self] TITS 2021 LFM/DEN S&U 1216.26 PyTorch

Summary of essential characteristics of selected existing RGB guided methods on the KITTI dataset. For denoting loss functions, we omit the coefficient of each loss term for simplicity. S and U denotes supervised learning and unsupervised learning of models, respectively. Accordingly, the top and bottom parts of the table show the supervised and unsupervised methods implemented for depth completion, respectively.

DenseLivox [DenseLivox]: One challenge of acquiring ground truth depth in outdoor scenarios is the price of a high-end LiDAR. For instance, the Velodyne HDL-64E mentioned above in KITTI costs nearly 100,000. Therefore, Yu et al. [DenseLivox] argue that the cheaper and more reliable solid-state LiDAR seems to be a more reasonable choice in practice. Consequently, they use the Livox LiDAR to obtain the dense ground truth (with the density of ) of both indoor and outdoor scenes. DenseLivox also provides some extra data like bound-occlusion and normal. This dataset was employed to evaluate the method in [DenseLivox].

VOID [wong2020unsupervised]: The VOID dataset contains 56 sequences collected with the Intel RealSense D435i camera from both indoor and outdoor scenes, in which 48 sequences (approximately 47,000 frames) are designed for training, and the rest of 8 sequences are used for testing. The resolution of each frame is . Each sequence has three different density levels with 1500, 500, and 150 points. This dataset was employed to evaluate the methods in [wong2020unsupervised, wong2021adaptive, wong2021scaffnet, wong2021unsupervised].

ScanNet [dai2017scannet]: ScanNet contains 2.5 Million RGBD images from 1513 indoor scenes. Unlike other indoor datasets, the RGBD dataset is collected by an iPad Air2 equipped with a structure sensor. This leads to input with different resolutions between depth maps () and RGB images (). This dataset was employed to evaluate the methods in [DEN, senushkin2020decoder].

6.2 Synthetic Datasets

It is incredibly challenging to acquire pixel-level ground truth depth maps in real-world applications because depth sensors are prone to failures for scenarios like capturing the sky and transparent or reflective surfaces. For the implementation of depth completion algorithms, some researchers turn to virtual datasets, from which dense and high-quality ground truth can be generated faithfully and efficiently. In this section, we review various synthetic datasets.

SYNTHIA [Synthia]: The SYNTHIA dataset is a virtual dataset of urban driving scenes. Ros et al. [Synthia] employed the unity game engine to create a virtual city that includes street blocks, highways, suburban areas, and other common objects in the urban environment. To render the output genuine, the virtual city has four different appearances corresponding to four seasons in reality. In addition, different illumination conditions and other details are also applied to improve the quality of the virtual RGB images. The total dataset can be divided into two complementary sets, SYNTHIA-Rand and SYNTHIA-Seqs. The dataset of the former (13,400 frames with the resolution of ) is obtained randomly within the city, while the dataset of the latter (200,000 frames with the same resolution) is captured from a virtual vehicle across different seasons. This dataset was employed to evaluate the methods in [max-S-and-D, qu2020depth].

Aerial depth [aerial_depth]: Because of the large variation of viewpoints during a flight, models trained with the above-mentioned ground datasets are far from satisfaction when dealing with aerial images. Therefore, Teixiera et al. proposed the Aerial depth dataset as a virtual outdoor dataset that is specially designed for simulating data captured in UVA working conditions. The dataset contains 83797 RGB and depth images from 18 virtual 3D models, and 67435 of them are selected for training and the rest for validation of models. This dataset was employed to evaluate the method in [aerial_depth].

Virtual KITTI [gaidon2016virtual]: As the name suggests, this dataset is a virtual version of the KITTI dataset. Five videos of the KITTI (0001/0002/0006/0018/0020) are cloned through the Unity engine. The whole dataset consists of 35 virtual videos (about 17000 frames). Each cloned virtual video will be further modified to obtain 7 variations. The modification includes changing features of the objects, changing the camera’s position and orientation, and changing the lighting condition. This dataset was employed to evaluate the methods in [shivakumar2019dfusenet, qu2020depth, jeon2021abcd, wong2021scaffnet].

SceneNet RGB-D [mccormac2016scenenet]: The SceneNet RGB-D dataset aims at providing a large scale dataset which is suitable for model pre-training. This dataset contains 5 Million RGBD indoor images from over 15,000 synthetic trajectories with image resolution. Each trajectory has 300 rendered frames. Due to ray-tracing, the generated images can reach the real-photo level quality. This dataset was employed to evaluate the method in [wong2021scaffnet].

6.3 Evaluation Metrics

Depth completion and monocular depth estimation generally share the same evaluation metrics. We list the most commonly used measures as follows:

  • RMSE: Root mean squared error defined in equation (6).

  • MAE: Mean absolute error defined in equation (5).

  • iRMSE: RMSE of the inverse depth, defined by .

  • iMAE: MAE of the inverse depth, defined by .

The above four measures are metrics commonly used to evaluate models in the KITTI benchmark. Among them, KITTI ranks algorithms in competitions in the order of RMSE. Thus, many previous methods have aimed to choose RMSE () as a loss function to train models. Besides, several metrics are also frequently used in many methods for depth evaluation, such as

  • REL: Mean relative error defined by .

  • : Thresholded accuracy defined by where is a given threshold.

REL and are commonly used for evaluation of models on indoor datasets, e.g., NYU-v2.

Evaluation of depth maps is an open issue. The above metrics cannot precisely measure the quality of reconstructed compositional patterns such as objects. Therefore, researchers also attempted to propose new evaluation metrics. In [Hu2019RevisitingSI], object boundaries extracted from the depth map are measured. Koch et al. [Koch2020ComparisonOM] introduced the planarity error and location accuracy of depth boundaries. Jiang et al. [jiang2021plnet] proposed two metrics for quantifying the flatness of planes and the straightness of lines for depth maps. However, owing to the lack of dense ground truth, such metrics are still difficult to be applied to depth completion.

7 Experimental Analyses

In this section, we compare and review previous methods from comprehensive aspects. Specifically, we select some representative works from each category and elucidate their major characteristics, including network structure, loss function, learning strategy, model performance, etc. Table III and Table IV show a comparison of existing unguided and RGB guided methods on the KITTI dataset, respectively, where the RMSE values are taken from either the public KITTI benchmark or the original papers. Table V shows a comparison of RGB guided methods on the NYU-v2 dataset. We use the RMSE metric for performance comparison. In the following sections, our findings are summarized.

7.1 Main Characteristics of Existing Methods

  1. A relatively smaller number of prior works employ the route of performing completion from the sparse depth input. In comparison, more recent works are RGB-guided, among which the majority route is to perform late fusion of RGB and depth images instead of early fusion.

  2. PyTorch is the most popular deep learning library for implementing depth completion methods. The overwhelming majority of previous studies implement their methods with PyTorch.

  3. KITTI is the most popularly used evaluation benchmark. Almost all leading methods provide results on this dataset. Moreover, NYU-v2 is the second most popular dataset. Since depth maps of NYU-v2 are captured by Kinect, previous works implement their methods by randomly and uniformly sampling 200 or 500 pixels as valid depth points.

  4. More complicated neural network modules have been recently developed to advance the performance of depth completion models. For example, many methods propose to embed surface normal, affinity matrices, and residual maps into their network models.

  5. The learning objectives identified for depth completion tasks are intuitive and relatively straightforward to optimize. For example, many methods penalize just or loss of depth maps, and still achieve good performance.

7.2 Unguided and Guided Methods

There are two benefits of unguided methods. First, unguided methods are more robust to environments with light or weather changes since they only take sparse depth maps as inputs. Moreover, for the same reason, they are more computationally efficient. However, unguided methods show inferior performance due to the lack of semantic cues and the irregular distribution of captured depth points. As seen in Table III, the best unguided method [from_depth_what] yields RMSE of 901.43 millimeters on the KITTI dataset. Note that [from_depth_what] also uses RGB images to guide model training. The best result obtained using an RGB-free method in both the training and inference stage is demonstrated in [huang2019hms] with RMSE of 937.48. On the other hand, as seen in Table IV, the best RGB guided method, i.e., DySPN, demonstrates a significantly better result with RMSE of 709.12. Moreover, many RGB guided methods can easily beat the best unguided approach. Specifically, except for 3coef [depth_coefficient], EncDec-Net[EF] [eldesokey2019confidence], Morph-Net [dimitrievski2018learningmorph] and CSPN [cspn], all other RGB guided methods with supervised learning outperform HMS-Net, showing the advance of leveraging RGB information. Another difference is that unguided methods cannot utilize additional unsupervised losses derived from images, e.g., photometric loss.

Method Publication Year Type Loss Function Learning RMSE (mm) Platform Code
3coef[depth_coefficient] CVPR 2019 EFM/EDN S 131 TensorFlow
Long et al.[long2021depth] JVCIR 2021 EFM/EDN S 100 - -
DFuseNet[shivakumar2019dfusenet] ITSC 2019 LFM/EDN + + S&U 219 PyTorch
MS-Net[LF][eldesokey2019confidence] TPAMI 2019 LFM/DEN S 129 PyTorch

ICCV 2021 LFM/DEN S&U 105 PyTorch
SelfDeco[choi2021selfdeco] ICRA 2021 LFM/DEN S&U 178 PyTorch -

TIP 2020 LFM/DEDN S 101 PyTorch

Arxiv 2021 LFM/DEDN S 90 PyTorch -
PwP[xu2019depth] ICCV 2019 E3DR/ISNR S 112 PyTorch -
DeepLiDAR[DeepLiDAR] CVPR 2019 E3DR/ISNR S 115 PyTorch
ACMNet[zhao2021adaptive] TIP 2021 E3DR/3DAC S 105 PyTorch
FCFR-Net[fcfr] AAAI 2020 RDM S 106 - -
KernelNet [liu2021learning] TIP 2021 RDM S 111 PyTorch
CSPN[cspn] ECCV 2018 SPM S 117 PyTorch
CSPN++[cspn++] AAAI 2020 SPM S 115 - -
NLSPN[nlspn] ECCV 2020 SPM S 92 PyTorch
DySPN[Lin2022DynamicSP] AAAI 2022 SPM + S 90 PyTorch -

TABLE V: Summary of essential characteristics of existing RGB guided methods on the NYU-v2 dataset. For denoting loss functions, we omit the coefficient of each loss term for simplicity. S and U denotes supervised learning and unsupervised learning of models, respectively.

7.3 Comparison of RGB Guided Methods

For RGB guided methods, from Table IV, we can observe the following results:

  • Early fusion models generally underperform other types of methods.

  • For later fusion approaches, although a considerable number of methods are built on DEN, approaches [learning-guided, yan2021rignet] based on DEDN demonstrate more significant performance improvement.

  • Explicit 3D representation methods, SPN-based methods, and residual depth methods show more advanced performance and generally outperform other approaches.

More specifically, the Top-10 performing methods on the KITTI dataset are (i) four SPN-based models; DySPN [Lin2022DynamicSP], PENet [penet], NLSPN [nlspn], and CSPN++ [cspn++], (ii) two residual depth models; FCFR-Net [fcfr] and [zhu2021robust], (iii) two late fusion methods built on DEDN; RigNet [yan2021rignet] and GuideNet [learning-guided], and (iv) two explicit 3D representation models; ACMNet [zhao2021adaptive] and 2D-3D FuseNet [2d-3d]. Based on that, we can say that the naive fusion strategy such as aggregating inputs at an early stage or concatenating features extracted by a dual-encoder network in late stage is not sufficient for achieving satisfactory performance. The common feature of the Top-10 performing methods is that they propose to either explicitly model geometric relationship of depth points by applying 3D-aware convolution as ACMNet and 2D-3D FuseNet, refinement with residual depth map as residual depth models and affinity matrix as SPN-based methods; or learn more effective guided kernel to weigh depth features with a complicated network design as RigNet and GuideNet.

Consistent results are also observed in analyses on the NYU-v2 dataset. As shown in Table V, the best results are demonstrated by DySPN and RigNet. Besides, GuideNet, ACMNet, FCFR-net, and NLSPN also show improved performance compared to other methods.

Intuitively, the performance of depth completion has the potential to be further improved by aggregating core technical components of the above methods. For instance, by taking advantage of 3D representation networks and spatial propagation networks, we can not only learn the 3D relationship within the model in a feature space but also apply post-refinement with an affinity matrix in output space. In addition, we can also incorporate a DEDN with guided kernel learning into residual depth learning models. Such combinations are straightforward, nevertheless, can be considered in practical applications to pursue high accuracy.

7.4 Results of unsupervised Approaches

The bottom of Table IV shows methods with unsupervised photometric loss. Results of purely unsupervised methods (without using depth consistency loss) are calculated by aligning the scale of the predicted depth map to the scale of ground truth. First, for methods without leveraging depth consistency, such as SS-S2D (d) [S-d-selfsuper] and ScaffFusion-U [wong2021scaffnet], we can see that purely unsupervised methods demonstrate unsatisfactory performance. Second, we also observe that their performances are still inferior to supervised methods even leveraging both depth consistency loss and additional photometric loss. As also discussed in Sec. 6.1, this is because these methods [wong2021scaffnet, wong2021unsupervised, ddp, song2021self, choi2021selfdeco] use sparser depth maps as ground truths with a density of than supervised methods with a density of .

8 Open Challenges and Future Directions

8.1 Depth Mixing Problem

The depth mixing problem, also called the depth smearing problem, is attributed to the difficulty of correctly identifying pixels near object boundaries, and usually causes blurry edges and artifacts. In order to alleviate this problem, 3coef. [depth_coefficient]

formulates depth completion as a one-hot encoding problem by dividing a depth map into a set of bins with fixed depth ranges. Imran et al.

[Imran2021DepthCW] isolate the foreground and background depths in occlusion-boundary regions and models them, respectively. NLSPN [nlspn] makes the network learn non-local relative neighbors such that the pixels can be separated during an iterative propagation. A more simple way of achieving this separation process is to leverage the K-nearest algorithm [2d-3d, zhao2021adaptive, dan-conv]. Besides, a boundary consistency network was added after depth completion to encourage predicting more sharp and clear boundaries [huang2019indoor, tao2021dilated]. However, this problem is still difficult for depth estimation tasks and needs to be continuously investigated.

8.2 Flawed Ground Truth

Another problem is the existence of defects in ground truth depths. First, unlike semantic segmentation, none of the existing real-world datasets can provide pixel-wise ground truth because of the limitation of depth sensors. Although many existing methods are trained in a supervised way, most pixels cannot be sufficiently supervised. Second, the semi-dense annotations are not entirely reliable due to outliers caused by occlusions, dynamic objects, etc. To overcome the sparsity problem, some researchers [S-d-selfsuper, song2021self] turn to self-supervised frameworks to alleviate the lack of ground truth depths. To cope with the second problem, Zhu et al. [zhu2021robust] handle outliers by incorporating uncertainty estimation into the depth completion network. Besides, a few works [to_complete_or_to_estimate, multitask_gan] leverage synthetic datasets for model training. However, the domain gap between real-world and synthetic data prevents a wide application of these methods. Despite the above efforts made by previous studies, it is still an open issue how to exclude the effects of unreliable depths, and there are still lots of room for improvement.

8.3 Light-weight Networks

Most previous methods have complex network structures with a large number of parameters. Moreover, many of them take two-stage coarse-to-refinement prediction. Thus, these methods are time-consuming and require high usage of hardware resources. However, for applications such as autonomous driving and robotic navigation, computation resources are limited and real-time inference is required. Although a few prior studies [tao2021dilated, dan-conv, eldesokey2019confidence, bai2020depthnet] have partially considered the real-time problem, the model parameters are still about 10 times greater compared to that of the models used in other vision tasks, e.g., monocular depth estimation [Wofk2019FastDepthFM, hu2021boosting]. Without sacrificing too much accuracy, developing light-weight methods with fast inference speed has enormous potential for real-world deployment, thus, is a valuable research point in future work.

8.4 Un/self-supervised Frameworks

As discussed before, un/self-supervised learning frameworks are commonly employed solutions in the absence of dense ground truths. As shown in Table IV, the accuracy of current un/self-supervised methods is still lower, and thus there is much room for further improvement of these methods compared to supervised learning based methods. However, this kind of methods is not robust to dynamic objects, distant regions, etc. Therefore, the improvements can be brought by leveraging more effective network structures for performing auxiliary tasks, such as pose estimation and outlier removing.

8.5 Loss Functions and Evaluation Metrics

Employment of proper loss functions is also critical to achieving satisfactory performance for depth completion. Commonly used loss functions are usually defined by a weighted sum of or loss functions with other auxiliary loss functions, e.g., smoothness loss and SSIM loss. However, as discussed in [depth_coefficient], both and loss functions have their own drawbacks. The choice of them is usually dataset dependent. Similarly, current metrics cannot precisely measure the quality of scene structures. Although several new metrics have been introduced in [Hu2019RevisitingSI, depth_coefficient, Koch2020ComparisonOM, jiang2021plnet] for evaluating depth maps, they have not gained broad popularity. Thus, designing more effective loss functions and convincing evaluation metrics is also a potential future research direction.

8.6 Domain Adaptation

Current benchmark datasets face the challenge of the lack of reliable depth points. Moreover, the data is captured under ideal lighting conditions in limited scenarios. Thus, models trained using this type of data have no guarantee of generalization in different working conditions and domains. Accordingly, it is reasonable to manipulate deep networks in simulated environments. Thereby, we can have not only per-pixel ground truth but also changeable lighting or weather conditions with a great number of different scenarios. Moreover, it encourages the development of more advanced methods that are difficult to be implemented in the real world. The challenge is then how to transfer the model from simulated environments to real-world scenarios. A few works explored domain adaptation methods for depth completion [lopez2020project, to_complete_or_to_estimate]. However, this under-explored problem remains unknown and is worthy of further exploration.

8.7 Transformer-based Network Structures

Recently, visual transformers (ViT) have attracted extensive attention and continuously introduced new state-of-the-art results for many perception tasks, including classification [DosovitskiyB0WZ21], semantic segmentation [Strudel2021SegmenterTF], object detection [zhang2021vit] and monocular depth estimation [Bhat2021AdaBinsDE]. Unlike CNNs, ViT receives a set of image patches as input and uses self-attention for local and global feature interactions. It may bring a new paradigm shift for depth completion where more effective multi-modality data fusion and novel strategies for handling input sparsity may exist.

8.8 Visualization and Interpretability

A few works have attempted to understand and visualize the mechanism of CNNs for monocular depth estimation. It is shown in [hu2019analysis, hu2019visualization, dijk2019neural] that CNNs tend to use some monocular cues from RGB images for inferring depths. In addition, as observed in [you2021towards] that the features generated inside CNNs are highly disentangled and activated to different depth ranges. An intriguing question is what will be different if we estimate depths when a few sparse depth points are available in inputs. Exploring and answering the above question is essential to the interpretation of learning based approaches, and has promising applications for improvement of their generalization ability, e.g., facilitating domain adaptation; and robustness of deep learning based depth completion methods.

8.9 Robustness to Different Sensors

Existing methods are only applicable to particular sensors. For instance, the most frequently used KITTI dataset is captured by a 64-line LiDAR. There is no guarantee that previous methods can be applied to lower scanline sensors, such as 32-line, 16-line LiDARs, and 1-line LiDARs. As demonstrated by [lu2021sgtbn, S-d-selfsuper, Yoon2020BalancedDC, ryu2021scanline], the performance degradation is significant from a 64-line sensor to lower scanline sensors. Hence, maintaining the same level of accuracy for lower scanline sensors is challenging. This under-explored problem is also practical in real-world applications since higher scanline sensors are more expensive than lower ones. Therefore, ensuring the accuracy of learning based methods for various lower scanline sensors is also an important and valuable research topic.

9 Conclusion

In this article, we present a comprehensive survey of deep learning based depth completion methods. Our review covers traditional and state-of-the-art network structures, loss functions, learning strategies, benchmark datasets, and evaluation metrics. To depict the evolution process and draw the connections between existing works, we provide a fine-grained taxonomy that categorizes existing methods by jointly considering network structures and main technical contributions. Moreover, we visualize the main characteristics of existing methods as well as their quantitative performance on the most popular benchmark datasets to provide an intuitive and straightforward comparison. We then perform in-depth analyses that summarize their performances, similarities, and differences. Finally, we provide open challenges and promising future research directions. Through the above efforts, we hope our work can help readers navigate this field.