1 Introduction
Recently, multisensor systems that sense both image and depth information have gained increasing attention, which is widely used in navigation applications such as autonomous driving and robotics. For these applications, the accurate and dense depth information is crucial for the obstacle avoidance [1], object detection [2, 3], and 3D scene reconstruction tasks [4]. On the perception platform of autonomous driving, the prerequisite of sensor fusion is the time synchronization of the system. However, there is an inherent limitation in LiDAR sensors, which provide dependable 3D spatial information at a low frequency (around 10Hz). To achieve time synchronization, the frequency of the camera (around 20Hz) has to be decreased, leading to an inefficient multisensor system. In addition, LiDAR sensors only obtain sparse depth measurements, e.g. 64 scan lines in the vertical direction. Such a low frequency and sparse depth sensing are insufficient for the actual applications. Therefore, for the synchronous sensing of multisensor systems, it would be promising to increase the frequency of LiDAR data to match the high frequency of cameras. The highfrequency and dense point cloud sequences are of great significance in the highspeed and complicated application scenarios.
Due to the huge volume of the point cloud captured by LiDAR, directly processing and learning on 3D space is timeconsuming. PLIN [5] presents the first pipeline for the PseudoLiDAR point cloud interpolation task, in which the PseudoLiDAR point cloud is obtained by the interpolated dense depth map and camera parameters. PLIN increases the frequency of LiDAR sensors by interpolating adjacent point clouds, to solve the problem of frequency mismatching between LiDAR and camera. Using a coarsetofine architecture, PLIN can progressively perceive multimodal information and generate the intermediate PesudoLiDAR point clouds. However, PLIN has several limitations as follows.
1) The spatial motion information is derived from the 2D optical flow between color images of adjacent consecutive frames. Nevertheless, the 2D optical flow only represents the movement deviation of the planar pixels, and cannot represent the movement in 3D space. Thus, the motion relationship used in PLIN causes an inferior temporal interpolation of point cloud sequence.
2) PLIN only supervises the generation of intermediate depth maps during the training process. Consequently, the quality of generated point clouds only depends on the synthetic depth maps. Moreover, the network does not provide any spatial constraints on the point cloud generation and does not measure the quality of the point cloud.
3) The fusion way of multimodal features is plain. PLIN roughly concatenates the texture and depth features and feeds these features into an interpolation neural network. However, this type of fusion cannot emphasize the effective complementary message passing between different features.
Based on the above limitations, our work focuses on these challenges. In this paper, we present a novel network to improve the motion representation and spatial supervision for PseudoLiDAR point cloud interpolation. In particular, since the optical flow does not describe the motion information in 3D space, we use the scene flow to guide the generation of the PseudoLiDAR point cloud. The scene flow represents a 3D motion field from two consecutive stereo pairs, and we design a spatial motion perception module to estimate it. In addition, we implement a point cloud reconstruction loss to constrain the interpolation of the PseudoLiDAR point cloud, which enables us to generate more realistic results with respect to the spatial distribution.
For the architecture of our approach, we design a dual branch structure consisting of texture completion and temporal interpolation. In the texture completion branch, the intermediate color image is used to provide rich textures for the dense depth map generation. In the temporal interpolation branch, we exploit a warping layer with two adjacent point clouds and the estimated scene flow to synthesize the intermediate depth map. To facilitate the efficient fusion of texture and depth features, we introduce a multimodal deep aggregation module. As the benefits of the improved motion representation, loss function, and model structure, our approach gains significant improvements on the PseudoLiDAR point cloud interpolation task. As illustrated in Fig. 1, we compare the depth map and point cloud interpolated by PLIN and our approach, and our results display more accurate appearances than PLIN and keep denser distribution than the original.
The contributions of this work are summarized as follows.

Considering the full representation of 3D motion information, we design a spatial motion perception module to guide the generation of PseudoLiDAR point cloud.

We design a reconstruction loss function to supervise and guide the generation of PseudoLiDAR point cloud in 3D space, and further introduce a quality metric of the point cloud.

We propose a multimodal deep aggregation module to effectively fuse the feature of the texture completion branch and temporal interpolation branch.
The remainder of this paper is organized as follows. In Sec. 2, we describe the related work of point cloud interpolation. In Sec. 3, we introduce the overall model structure and describe each module in detail. Finally, we conduct experiments, and the performance and the results are presented in Sec. 4.
2 Related Work
In this section, related works on the topic of depth estimation, depth completion, and video interpolation will be discussed.
2.1 Depth Estimation
Depth estimation focuses on obtaining the depth information of each pixel using a single color image. Earlier works used traditional methods [6, 7]
, which applied handcrafted features to the depth of color images in probability map models. Recently, with the popularity of convolutional neural networks in image segmentation and detection, learningbased methods have been applied to the depth estimation task. For example, for supervised methods, Eigen et al.
[8] adopted a multiscale convolutional architecture to obtain depth information. Li et al. [9] used the conditional random field (CRF) postprocessing refinement step to perform regression on the features, to obtain highquality depth output. For unsupervised methods, the supervision is provided by perspective synthesis. Xie et al. [10] constructed a stereo pair to calculate and estimate an intermediate disparity image by generating corresponding perspectives. To further improve the performance, [11] used geometric constraints to constrain the consistency of the differences between the left and right images. However, due to the inherent ambiguity and uncertainty of the depth information obtained from color images, the depth map obtained by these depth estimation methods are still inaccurate for navigation systems.2.2 Depth Completion
Compared with the depth estimation, the depth completion task aims to obtain an accurate dense depth map by using a sparse depth map and possible image data. Depth completion covers a series of issues related to different input modalities. When the input modality is relatively dense depth maps that contain missing values, depth completion can be cast as a variety of techniques, such as the executivebased depth inpainting [12, 13], objectaware interpolation [14], Fourierbased depth filling [15], and depth enhancement [16].
LiDAR is an indispensable sensor in 3D sensing systems such as autonomous driving and robots. When the acquired LiDAR depth data is projected into the 2D image space, the available depth information only takes up about of the image pixels [17]. To improve the application of such sparse depth measurements, various methods try to use sparse depth values to estimate dense and accurate depth maps. For example, Uhrig et al. [17] proposed a simple and effective sparse convolutional layer to take data sparsity into account. Ma et al. [18] considered the depth completion as a regression problem and fed the color image and sparse depth map into an encoderdecoder structure. A similar method in [19] used a selfsupervised training mechanism to achieve the completion task. To extend the confidence of the convolution operation on the continuous network layer, [20] proposed a constrained normalized convolution operating. [21] proposed boundary constraints to enhance the structure and quality of the depth map. [22] jointly learned semantic segmentation and completion tasks to improve the performance. In [23], the surface normal estimation is used for the depth completion task. Chen et al. [24] designed an effective network fusion block, which can jointly learn 2D and 3D representations. Compared with these spatial depth completion methods, our method can generate the temporally and spatially highquality point cloud sequences.
2.3 Video Interpolation
In the field of video processing, video interpolation is a popular research topic. Video frame interpolation aims to synthesize the nonexistent frames from the original adjacent frames. It makes sense to generate highquality slowmotion videos from existing videos. Liu et al. [25] proposed a deep voxel flow network to synthesize video frames by flowing the pixel values of existing frames. To achieve the realtime temporal interpolation, Peleg et al. [26] adopted an economic structured framework and regarded the interpolation task as a classification problem rather than a regression problem. Jiang et al. [27] jointly modeled the motion interpretation and occlusion inference to achieve variablelength multiframe video interpolation. Bao et al. [28] proposed a depthaware stream projection layer to guide the detection of occlusion using depth information in the video frame interpolation task. Although there are many works studied in the video frame interpolation, the point cloud sequence interpolation task gains little attention due to the huge volume and complicated structure of point cloud.
PLIN [5] is the first work to interpolate the intermediate PseudoLiDAR point cloud given two consecutive stereo pairs. It utilized a coarsetofine network structure to facilitate the perception of multimodal information such as optical flow and color images. Compared with PLIN, our approach uses the improved motion representation, training loss function, and model structure, achieving significant improvements on the PseudoLiDAR point cloud interpolation task.
3 Method
In this section, we introduce the overall model structure and describe each module in detail. Given two consecutive sparse depth maps ( and ) and RGB image (), PseudoLiDAR point cloud interpolation aims to produce an intermediate dense depth map , which is backprojected to obtain the intermediate PseudoLiDAR point cloud () using known camera parameters. As illustrated in Fig. 2, the whole framework mainly consists of two branches: the texture completion branch and temporal interpolation branch. The texture completion branch takes the image and consecutive sparse depth maps as inputs and outputs the feature maps encoded with rich textures. One of the feature maps is further combined with the sparse depth maps and scene flow estimated by spatial perception module, generating the feature maps guided by spatial motion information. The feature maps derived from the dual branch are then integrated by the fusion layer to produce the intermediate dense depth map. Finally, the intermediate PseudoLiDAR point cloud is generated by backprojecting the intermediate dense depth map.
3.1 Texture Completion Branch
The sparse depth map is difficult to represent the detailed relationship of context information due to its lots of missing pixel values. Therefore, the rich texture information of color images is conducive to the corresponding prediction depth, especially in boundary regions. There are many works that use color images to estimate depth information, which indicates that it can provide corresponding depth inference clues. In our work, to extract the texture and semantic features, the adjacent sparse depth maps and color images are concatenated and fed into the texture completion branch. Moreover, the texture completion branch can be used as a prior to guide the temporal interpolation branch.
We consider the interpolation of PseudoLiDAR point cloud as a regression problem. The texture completion network implements an encoderdecoder structure with skip connections. We concatenate the color image and adjacent sparse depth maps into a tensor, which is fed into the residual block of the encoder. In the encoder, the backbone network uses the residual network ResNet34
[29]. In the decoder, the lowdimensional feature maps are upsampled to the same size as the original feature map through five deconvolution operations. In addition, the multiple skip connections are used to combine lowlevel features with highlevel features. Except for the last layer of convolution, ReLU and BatchNormalization are performed after all convolutions. Finally, the last layer of the texture completion branch uses a
convolution kernel to reduce the multichannel feature map into a 3channel feature map. Note that these features only contain the texture and structure information, which cannot describe the accurate motion information yet. Thus, we introduce a spatial motion perception module to further improve the interpolation performance.3.2 Spatial Motion Perception Module
In the video interpolation task, the optical flow is indispensable since it contains the motion relationship between adjacent frames. Optical flow represents the motion deviation of the 2D image plane, while the scene flow is represented by the motion field in 3D space. Scene flow is the counterpart of optical flow in threedimensional space, it is able to more explicitly represent the real spatial motion relationship of objects. In PLIN, the optical flow between color images is used to represent the motion relationship of depth maps, but the optical flow only represents the deviation of the movement of plane pixels and cannot fully describe the motion information in real 3D space. Therefore, our approach exploits the scene flow to generate a more realistic PseudoLiDAR point cloud. As shown in Fig. 3, we conducted a comparative experiment. With the same network structure, we use optical flow and scene flow to separately guide the interpolation of the PseudoLiDAR point cloud. The results show that the scene flow facilitates to generate a more realistic point cloud. Compared to optical flow, the point cloud guided by the scene flow has a more reasonable shape. This is attributed to better motion representation.
The scene flow estimation method can be described as follows. The input is adjacent point clouds: at time and at time . Point cloud is a set of points , where is the number of points, and each point may also contain its own attribute features , where refers to the dimension of the attribute feature, such as the reflection intensity, color, and normal. The output is the estimated scene flow for each point in . FlowNet3D [30] explores the motion based on PointNet++ [31], it processes each point and aggregates information through the pooling layers. Our scene flow estimation network is based on FlowNet3D and the improved bilateral convolutional layer operations, which restore the spatial information from unstructured point clouds. The input of the network is 3D point clouds of consecutive frames, and the output is the corresponding deviation of each point. Scene flow calculations can be expressed as follows.
(1)  
where and denote the input point clouds of adjacent frames, is the scene flow estimation function, refer to the estimated scene flow.
Since the input of the scene flow estimation task is the adjacent point clouds, we first convert the adjacent depth maps into point clouds in terms of the prior camera parameters. We will discuss the specific transformation formulas in Section 3.4. The adjacent point clouds are then inputted to our spatial motion perception module to estimate the scene flow. Our scene flow estimation network is similar to the encoderdecoder structure. In the downsampling stage, we adopt a dualinput structure, in which all layers share weights to extract the features of point clouds. By stacking improved bilateral convolutional layer (BCL)[32] to continuously reduce the scale. We also fuse features of different scales. In the upsampling phase, we gradually increase the scale by stacking the improved bilateral convolutional layer to improve the prediction. In each BCL, we consider the relative position of the input. Finally, our scene flow is obtained. We use a warping operation on or to synthesize the point cloud at time , which can be expressed as:
(2) 
or
(3) 
To boost the fast spatial information sensing, we project the obtained intermediate point cloud into the 2D image plane. In this part, we get the accurate but sparse intermediate depth map . To effectively integrate multimodal features and generate an accurate and dense depth map, we introduce a multimodal deep aggregation module to facilitate the efficient fusion of texture and depth features.
3.3 Multimodal Deep Aggregation Module
To generate the accurate and dense depth map, we design a multimodal deep aggregation module to fuse the feature maps of the texture completion branch and the temporal interpolation branch. The texture feature can guide the network to pay more attention to the saliency objects, which contains the more clear structure and edge information. On the other hand, the depth feature can provide precise spatial information in terms of the estimated scene flow.
In particular, we adopt a stacked aggregation unit architecture for the multimodal deep aggregation module. The stacked aggregation unit consists of three aggregation units, each of which has a topdown and bottomup process. Inspired by ResNet [29], we use a residual learning method between aggregation units. In addition, the skip connection operations are applied to introduce the lowlevel feature into the highlevel feature in the same dimension.
In each aggregation unit, the encoder and decoder are composed of three layers of convolutions. The encoder uses two stride convolutions to downsample the feature resolution to the 1/4 original size; the decoder uses two deconvolution operations to upsample the features fused from the encoder and the previous network block. Considering the sparseness of data, the encoder in the first network block does not use the batch normalization operation after convolution. All convolutional layers use a 3
3 convolution kernel with a small receptive field. The output of the multimodal deep aggregation module is a 2channel feature map containing dense spatial distribution information.At the end of the dual branch architecture, we leverage a fusion layer to further combine the different feature maps and obtain the final result. The fusion layer consists of three convolutional layers and the number of filters per convolutional layer is 32, 32, and 1, respectively. Except for the last layer, the BatchNormalization layer with ReLU activation function is implemented after each convolutional layer.
3.4 BackProjection
In this part, we get the 3D point cloud by backprojecting the generated intermediate depth map to 3D space. According to the pinhole camera imaging principle, if the depth value of each pixel coordinate exists in the image, we can derive the corresponding 3D position . The corresponding relationship is described as follows:
(4)  
(5)  
(6) 
where and are the vertical and horizontal focal lengths, respectively. is the center of camera aperture. Based on the prior camera parameters, the generated depth map is backprojected into a 3D point cloud. Since this point cloud is obtained by transforming the depth map, we refer to the point cloud as a PseudoLiDAR point cloud [33].
3.5 Loss Function
Previous work only supervises the generated dense depth maps, which does not constrain the 3D structure of the target point cloud. To this end, we design a point cloud reconstruction loss to supervise the generation of PseudoLiDAR point clouds. Constructing the distance function between the predicted point cloud and the ground truth point cloud is an important step. A suitable distance function should meet at least two conditions: 1) the calculation is differentiable; 2) since data needs to be forwarded and backpropagated for many times, effective calculations are required [34]. The goal of our efforts can be expressed as:
(7) 
where and indicate the prediction and ground truth of each sample, respectively.
We need to find a distance metric to minimize the difference between the generated point cloud and the ground truth point cloud. There are two candidates for the measurement: Earth Mover’s distance (EMD) and Chamfer distance (CD):
Earth Mover’s distance: if two point sets and have the same size. The EMD can be defined as:
(8) 
where is a bijection. EMD is almost differentiable everywhere, but its accurate calculation is expensive for learning models.
Chamfer distance: we can define it between as:
(9)  
This algorithm finds the nearest point of each point in another point set and adds up the squared distances. For each point, searching for the nearest point is independent and easily parallelized. To speed up the nearest point search, a similar KDtree data structure can be applied. Since EDM has a limitation on the number of input points, we use the simple and effective CD distance as our reconstruction loss to evaluate the similarity between generated point cloud and ground truth point cloud. Our reconstruction loss is formed as follows.
(10)  
where denotes the chamfer distance metric. and are the prediction and ground truth point cloud, respectively.
In addition to the point cloud supervision, we also perform the 2D supervision on dense depth maps. We use loss for the generated depth map and ground truth :
(11) 
Our entire loss function is a linear combination of point cloud reconstruction loss and depth map reconstruction loss, which can be expressed as:
(12) 
where and are the balance weights. The weights have been set empirically as and .
4 Experiments
In this section, we conduct extensive experiments to verify the effectiveness of our proposed approach. We compare with previous works and perform a series of ablation studies to show the effectiveness of each module. Since the main application of our model is onboard LiDARs in a multisensor system, our experiments are based on the KITTI dataset [35]. As illustrated in Fig.4, the depth maps obtained by our approach show clear boundaries in visual effects and display denser distributions than the ground truth dense depth maps.
4.1 Experimental Setting
Dataset: Our experiments are performed on the KITTI depth completion dataset and the raw dataset. The KITTI dataset contains 86,898 frames of training data, 6,852 frames of evaluation data, and 1,000 frames of test data. This dataset provides sparse depth maps and color images. Each frame contains LiDAR scan data and RGB color images, in which the sparse depth map corresponds to the projection of the 3D LiDAR scan point cloud. The ground truth corresponding to each sparse depth map is a relatively dense depth map. Our application scenario is based on the outdoor onboard LiDAR, which is generally a scene of relative motion. Since there are scenes where the frames are still in the training dataset, so we choose 48,000 frames with obvious motions.
Evaluation Metrics:
Although our task is not the depth completion, we can use the evaluation metrics of depth completion to evaluate the quality of the generated dense depth map. There are four evaluation metrics in the depth completion task: the root mean square error (RMSE), mean absolute error (MAE), root mean square error inverse depth (iRMSE), and mean absolute error inverse depth (iMAE) . We mainly focus on the RMSE when comparing methods because RMSE directly measures the error in depth, penalizes larger errors, and is the leading metric for depth completion. These four evaluation indicators are defined by the following formulas:

Root mean squared error (RMSE):
(13) 
Mean absolute error (MAE):
(14) 
Root mean squared error of the inverse depth [1/km](iRMSE):
(15) 
Mean absolute error of the inverse depth [1/km](iMAE):
(16)
In order to evaluate the quality of the generated point cloud, we introduce a new evaluation metrics, i.e., CD as follows:
(17) 
Implementation Details: The depth value at the upper end of the depth map is all zero, and this section does not provide any depth information. Therefore, all our data (RGB, sparse depth, and dense depth map) are cropped from the bottom to a uniform 1216256 size. Data enhancement operations are also applied to the training data, such as random flips and color adjustments. In the calculation of scene flow, we randomly sample 17,500 points in the point cloud of each frame as the input of the scene flow network, which is designed based on the HPLFlowNet [36]. Adam optimizer is applied during our training phase with
initial learning rate, which is decayed by 0.1 every 4 epochs. We train our network on a 1080Ti GPU with a batch size of 2 for about 60 hours, which is completed by PyTorch
[37].Configuration  RMSE  MAE  iRMSE  iMAE  CD 

Baseline  1408.80  513.06  7.63  3.01  0.21 
+Aggregation module  1224.91  409.69  4.69  1.95  0.16 
+Scene flow  1124.76  382.15  4.39  1.89  0.14 
+Reconstruction loss  1091.99  371.56  4.21  1.83  0.12 
4.2 Ablation Study
We perform an extensive ablation study to verify the effectiveness of each module. The performance comparison of the proposed approach is shown in Table I. Specifically, we perform four ablation experiments, each of which is based on the addition of a new network element or module to the previous network configuration.
As listed in Table I, the result shows that the complete network achieves the best interpolation performance. For the baseline network, we take two consecutive sparse depth maps and the intermediate color image as the input of the texture completion branch and obtain the intermediate dense depth map. By comparing the experimental results, we have the following observations: 1) Without the spatial motion guidance, our multimodal deep aggregation module can also produce good interpolation results, as it combines the features of the dual branch and is more conducive to the fusion of features. 2) Under the guidance of the scene flow containing motion information, we have greatly improved the performance of interpolation. This benefits from a better representation of spatial motion information. 3) Point cloud reconstruction constraints also further improve the interpolation performance. It can be observed that the value of our evaluation metrics decreases as the number of modules increases, which also proves the effectiveness of each of our network modules. To intuitively compare these different performances, we visualize the interpolated results of two scenes obtained by the above methods in Fig. 5. The complete network generates the most realistic details and distributions of the intermediate point cloud. Note that in the enlarged area, the shape distribution of the car obtained by the complete network is the most similar to the ground truth.
Method  RMSE  MAE  iRMSE  iMAE  CD 

Traditional Interpolation  12552.46  3868.80       
Super Slomo [27]  16055.19  11692.72      27.38 
PLIN[5]  1168.27  546.37  6.84  3.68  0.21 
Ours  1091.99  371.56  4.21  1.83  0.12 
4.3 Comparison with Stateoftheart
We evaluate our model on the KITTI depth completion dataset. We show the comparison results with other stateoftheart point cloud interpolation methods in Table II. Since PLIN is a pioneer work in this field, it is our main comparison object. In addition, we also compare the traditional average depth interpolation method and video interpolation method. For the video frame interpolation method, the Super Slomo [27] network is retrained on the depth completion dataset.
Quantitative Comparison. We show some quantitative results comparing our proposed approach with existing methods in Table II. Experimental results show that our approach is superior to other methods in learning the interpolation of the point cloud from RGBD data. In particular, we achieve stateoftheart results in all metrics. For the traditional method, the intermediate depth map is obtained by averaging consecutive depth maps. Its poor performance is understandable because the pixel values of continuous depth maps do not have a corresponding relationship. For the video frame interpolation method, since the motion relationship between depth maps cannot be obtained, it is difficult to generate satisfactory results. Guided by the color images and bidirectional optical flow, PLIN is designed for the task of point cloud interpolation and achieves good performance, but it lacks the point cloud supervision and spatial motion representation. Compared with these methods, our approach improves the PseudoLiDAR point cloud interpolation task by adopting the scene flow, 3D space supervision mechanism, and multimodal deep aggregation module. As a result, our approach outperforms the classical methods.
Visual Comparison. For the visual comparison, we compare different interpolation results in Fig. 6. In PLIN, it has been shown that the traditional interpolation method cannot handle the point cloud interpolation problem well, and the visual performance is poor. Therefore, we only show the comparison on Super Slomo [27], PLIN, our approach, and ground truth. As illustrated in Fig. 6, our approach produces a more reasonable distribution and shape compared with PLIN. The whole distribution of PseudoLiDAR point cloud is more similar to that of the ground truth point cloud. In the zoomed regions, our method recovers better 3D details for car, road, and tree. This benefited from optimized motion representation, 3D space supervision mechanism and model structure.
5 Conclusion
In this paper, we propose a novel PseudoLiDAR point cloud interpolation network with better interpolation performance than previous works. To more accurately represent the spatial motion information, we use the point cloud scene flow to guide the point cloud interpolation task. We design a multimodal deep aggregation module to facilitate the efficient fusion of features of the dual branch. In addition, we adopt a supervision mechanism in 3D space to supervise the generation of PseudoLiDAR point cloud. As the benefits of the optimized motion representation, training loss function, and model structure, the proposed pipeline significantly improves the performance of interpolation. We have shown the effectiveness of our approach on the KITTI dataset, outperforming the stateoftheart point cloud interpolation techniques with a large margin.
References
 [1] X. Yang, H. Luo, Y. Wu, Y. Gao, C. Liao, and K.T. Cheng, “Reactive obstacle avoidance of monocular quadrotors with online adapted depth prediction network,” Neurocomputing, vol. 325, pp. 142–158, 2019.

[2]
S. Shi, X. Wang, and H. Li, “PointRcnn: 3D object proposal generation and
detection from point cloud,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2019, pp. 770–779.  [3] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3D object detection from RGBD data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 918–927.
 [4] D. Shin, Z. Ren, E. B. Sudderth, and C. C. Fowlkes, “3d scene reconstruction with multilayer depth and epipolar transformers,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2172–2182.
 [5] H. Liu, K. Liao, C. Lin, Y. Zhao, and M. Liu, “Plin: A network for pseudolidar point cloud interpolation,” Sensors, vol. 20, no. 6, p. 1573, 2020.
 [6] M. Liu, M. Salzmann, and X. He, “Discretecontinuous depth estimation from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 716–723.
 [7] K. Karsch, C. Liu, and S. B. Kang, “Depth transfer: Depth extraction from video using nonparametric sampling,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 11, pp. 2144–2158, 2014.
 [8] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2650–2658.

[9]
B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1119–1127.  [10] J. Xie, R. Girshick, and A. Farhadi, “Deep3d: Fully automatic 2dto3d video conversion with deep convolutional neural networks,” in European Conference on Computer Vision. Springer, 2016, pp. 842–857.
 [11] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with leftright consistency,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 270–279.
 [12] A. AtapourAbarghouei and T. P. Breckon, “Extended patch prioritization for depth filling within constrained exemplarbased rgbd image completion,” in International Conference Image Analysis and Recognition. Springer, 2018, pp. 306–314.
 [13] M. Kulkarni and A. N. Rajagopalan, “Depth inpainting by tensor voting,” JOSA A, vol. 30, no. 6, pp. 1155–1165, 2013.
 [14] A. AtapourAbarghouei and T. P. Breckon, “Depthcomp: realtime depth image completion based on prior semantic scene segmentation.” 2017.
 [15] A. AtapourAbarghouei, G. P. de La Garanderie, and T. P. Breckon, “Back to butterwortha fourier basis for 3d surface relief hole filling within rgbd imagery,” in 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 2016, pp. 2813–2818.
 [16] M. Camplani and L. Salgado, “Efficient spatiotemporal hole filling strategy for kinect depth maps,” in Threedimensional image processing (3DIP) and applications Ii, vol. 8290. International Society for Optics and Photonics, 2012, p. 82900E.
 [17] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in 2017 International Conference on 3D Vision (3DV). IEEE, 2017, pp. 11–20.
 [18] F. Mal and S. Karaman, “Sparsetodense: Depth prediction from sparse depth samples and a single image,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1–8.
 [19] F. Ma, G. V. Cavalheiro, and S. Karaman, “Selfsupervised sparsetodense: Selfsupervised depth completion from lidar and monocular camera,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 3288–3295.
 [20] A. Eldesokey, M. Felsberg, and F. S. Khan, “Propagating confidences through cnns for sparse data regression,” arXiv preprint arXiv:1805.11913, 2018.
 [21] Y.K. Huang, T.H. Wu, Y.C. Liu, and W. H. Hsu, “Indoor depth completion with boundary consistency and selfattention,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0–0.
 [22] M. Jaritz, R. De Charette, E. Wirbel, X. Perrotton, and F. Nashashibi, “Sparse and dense data with cnns: Depth completion and semantic segmentation,” in 2018 International Conference on 3D Vision (3DV). IEEE, 2018, pp. 52–60.
 [23] J. Qiu, Z. Cui, Y. Zhang, X. Zhang, S. Liu, B. Zeng, and M. Pollefeys, “Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3313–3322.
 [24] Y. Chen, B. Yang, M. Liang, and R. Urtasun, “Learning joint 2d3d representations for depth completion,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 10 023–10 032.
 [25] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala, “Video frame synthesis using deep voxel flow,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4463–4471.
 [26] T. Peleg, P. Szekely, D. Sabo, and O. Sendik, “Imnet for high resolution video frame interpolation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2398–2407.
 [27] H. Jiang, D. Sun, V. Jampani, M.H. Yang, E. LearnedMiller, and J. Kautz, “Super slomo: High quality estimation of multiple intermediate frames for video interpolation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 9000–9008.
 [28] W. Bao, W.S. Lai, C. Ma, X. Zhang, Z. Gao, and M.H. Yang, “Depthaware video frame interpolation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3703–3712.
 [29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
 [30] X. Liu, C. R. Qi, and L. J. Guibas, “Flownet3d: Learning scene flow in 3d point clouds,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 529–537.
 [31] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in NeurIPS, 2017.
 [32] M. Kiefel, V. Jampani, and P. V. Gehler, “Permutohedral lattice cnns,” 2015.
 [33] Y. Wang, W.L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudolidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8445–8453.
 [34] H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3d object reconstruction from a single image,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 605–613.
 [35] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
 [36] X. Gu, Y. Wang, C. Wu, Y. J. Lee, and P. Wang, “Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on largescale point clouds,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3254–3263.
 [37] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.