In many computer vision tasks, dense and precise depth information of an outdoor environment is extremely important for various applications of autonomous driving and robotics. Recently, in 3D object detection[18, 15], 3D semantic segmentation[23, 24], and depth completion tasks[12, 21, 16], point clouds obtained by LiDAR have gained more and more attention due to their accurate spatial information. However, LiDAR sensors suffer from a low frequency for an autonomous driving system111Velodyne HDL-64E rotating 3D laser scanner, 10 Hz in . Therefore, there is a mismatching time stamp between the LiDAR and other sensors such as cameras222PointGray Flea2 color cameras, 15Hz in . In order to match the scenes collected by these two sensors and achieve system synchronization, the frequency of the camera has to be decreased to that of the LiDAR sensor, which significantly wastes resources and results in inferior performance for high speed applications. Therefore, it is quite appealing for autonomous driving systems to increase the frequency of its LiDAR sensor, and further pursuit the high-quality synchronized perception of a multi-sensor system.
To address the above problem, one possible solution is to interpolate an intermediate point cloud using two consecutive point clouds. However, directly working on the 3D space and generating a new point cloud is challenging. Therefore, previous researches [19, 9, 3] prefer to achieve different tasks on 2D depth maps or other projected views. Moreover, the target point cloud can be constructed in the form of Pseudo-LiDAR  using known camera intrinsics. For Pseudo-LiDAR interpolation, an intermediate depth map is first generated and then back-projected into the 3D space. This method is superior to direct point cloud interpolation methods in term of feasibility and efficiency.
Interpolation techniques have been widely used in lots of computer vision and robotics tasks, which can be classified into two categories, i.e., temporal interpolation[14, 1, 8] and spatial interpolation [12, 26, 10]. In video processing, video interpolation aims to temporally generate an intermediate frame using two consecutive frames. This technique has attracted more attentions due to the increasing demand for high-quality slow-motion videos. For example, Peleg et al. 
formulated the interpolated motion estimation problem as classification rather than regression. This method achieves real-time temporal interpolation for high resolution videos. Bao et al perceived both the depth and flow information to address the strong occlusion problem during new frame synthesis. To interpolate more in-between frames, Jiang et al.  proposed a variable-length multi-frame interpolation method to generate a frame at any time step between two given frames. However, this temporal interpolation technique has not yet been investigated in the field of point cloud interpolation.
In contrast to video interpolation, depth completion spatially fills missing depth values in a sparse depth map to generate a dense depth map. This technique becomes an essential enhancement process for LiDARs as they usually only provide sparse measurements. For example, Ma et al. 
fed the concatenation of a sparse depth map and a color image into an encoder-decoder network to produce a dense depth map using self-supervised learning. To obtain more accurate dense depth maps, Zhang et al. employed a weight matrix to describe a surface normal and occlusion boundary. To perceive both surface normal and contextual information, Lee et al. 
presented an end-to-end convolutional neural network for depth completion, which consists of a geometry network and a context network.
Motivated by aforementioned studies, we propose a Pseudo-LiDAR point cloud interpolation network (PLIN) to generate both temporally and spatially high-quality point cloud sequences. The overall pipeline of the proposed method is illustrated in Fig. 1. Specifically, PLIN consists of a motion guidance module, a scene guidance module, and a transformation module. In the motion guidance module, we obtain bidirectional optical flow maps from color images, then warp two sparse depth maps into an approximate intermediate depth map. The original and warped depth maps, as well as the estimated optical flow maps, are fed into a coarse interpolation network to generate a coarse intermediate depth map. Subsequently, to produce a more accurate and dense depth map, we design a refined interpolation network with the guidance of realistic scenes. This scene guidance module generates a refined depth map using the intermediate color image and estimated depth map. Finally, in the transformation module, the refined depth map is used to construct the target Pseudo-LiDAR point cloud using camera parameters. Compared with video interpolation and depth completion tasks, our method simultaneously performs interpolation in both spatial and temporal domains. To the best of our knowledge, this is the first deep framework for Pseudo-LiDAR point cloud interpolation. Experimental results demonstrate that PLIN achieves promising performance on the KITTI dataset.
In summary, we conclude the following three contributions of the proposed method:
To mitigate the low frequency limitation of LiDAR sensors, we present a Pseudo-LiDAR interpolation network to generate both temporally and spatially high-quality point cloud sequences.
We use the bidirectional optical flow as an explicit motion guidance for interpolation. In addition, a warping layer is applied to improve the accuracy of depth prediction by approximating an intermediate frame. Finally, the in-between color image is leveraged to provide rich texture information of the realistic scene for more accurate and dense spatial reconstruction.
We evaluate the proposed model on the KITTI benchmark , which reasonably recovers the original intermediate 3D scene and outperforms other interpolation methods.
In this section, we describe the detailed architecture of the proposed Pseudo-LiDAR interpolation network (PLIN). Given three consecutive color images captured by a camera and two depth maps obtained by LiDAR, we first interpolate an intermediate 2D dense depth map and then back-project the depth map into Pseudo-LiDAR using prior camera intrinsics. In addition, we explore the guidance of motion and scene to generate a realistic dense depth map, and adopt a warping layer to improve the accuracy of spatial reconstruction. As illustrated in Fig. 2
, PLIN consists of a motion guidance module, a scene guidance module, and a transformation module. As a benefit of this coarse-to-fine cascade structure, our method can progressively perceive the multi-modal information and generate a temporally and spatially high-quality point cloud sequence. Moreover, we introduce the whole training loss function of PLIN in this section.
Ii-a Intermediate Depth Map Interpolation
In this part, we introduce the method for intermediate depth map synthesis. We first present a baseline network to generate an interpolation map using only two consecutive sparse depth maps. Then, to construct more reasonable slow-motion results, we use the motion information included in a bidirectional optical flow to guide the interpolation process. Moreover, a warping operation is applied to input depth maps to produce an intermediate coarse depth map, which contains the explicit motion relationship. Finally, we use the in-between color image to refine the coarse depth map with the guidance of the scene, resulting in a more accurate and dense intermediate depth map.
Ii-A1 Baseline Network
As mentioned in Section I, due to challenges of 3D point clouds, previous works [19, 9, 3] perform different vision tasks on 2D depth maps or other projected views. Inspired by this principle, we first interpolate an intermediate depth map using two consecutive depth maps.
Given two sparse depth maps and , our goal is to synthesize a depth map for the intermediate frame. A straightforward way is to train a baseline network (i.e., an encoder-decoder structure) to predict the depth value of each pixel in the intermediate frame. Specifically, the encoder consists of a set of convolutions to increase the number of channels and reduce the feature resolution. The decoder has a symmetric structure. Moreover, to compensate the interaction of different information, the network structure contains multiple skip connections to combine low-level and high-level features at the same spatial resolution. Before feeding into the encoder, consecutive sparse depth maps are processed using a convolutional layer with eight 3
3 kernels and these extracted features are concatenated as the input to the network. All convolutions are followed by a batch normalization and a ReLU layer in the baseline network, with the exception of the last convolution layer, where a linear activation function is used. In the encoder part, we use ResNet-34
as our backbone. In the decoder part, five fractionally-strided convolutional layers are designed to increase the resolution of image. After these five convolutions layers, a multi-channel feature map is obtained. The feature map is then passed through a 11 convolution kernel to generate a single-channel depth map. Thus, the intermediate dense depth map derived from the baseline network can be expressed as:
where the depth map is the previous frame depth map, is the latter frame depth map, and is an interpolation function learned by the baseline network.
The baseline network adopts a violent approach to learn the relationship between the intermediate depth map and adjacent depth maps without any other guidances. However, due to the sparsity of the data, the complex motion relationship among , , and is difficult to estimate using only depth maps, and thus the obtained result usually shows an inferior appearance with blur artifacts. The results generated by the baseline network are shown in Section III-B. For the interpolation task, neural networks should not only learn to generate the appearance of two input depth data distributions, but also accurately perceive the motion relationship among consecutive depth maps. In order to achieve more reasonable interpolation results, we introduce optical flow to guide the generation of dense depth maps.
Ii-A2 Motion Guidance Module
To consider the motion relationship between consecutive sparse depth maps, we design a motion guidance module to exploit optical flow to explicitly guide the generation of dense depth maps. In the video interpolation task, the optical flow is often used as an important input component, because it represents the direction and level of motion. Inspired by video interpolation, we introduce the optical flow into our Pseudo-LiDAR interpolation problem.
Instead of directly investigating the optical flow on sparse depth maps, we learn the motion relationship on dense color images due to their abundant and precise contextual information. Recently, deep neural networks have shown excellent performance in optical flow estimation [6, 25, 7]. Given two consecutive color images and , video interpolation [14, 1, 8] aims to generate the intermediate color image using a bidirectional optical flow:
where is an optical flow estimation function learned by neural networks. Assuming that the motion of adjacent frames is smooth, the optical flow for color images and , and the optical flow for color images and can be calculated as follows.
Different from the aforementioned video interpolation task, we devote to the generation of an intermediate point cloud. Note that the intermediate color image is available in this problem because the frequency of the camera is higher than that of LiDAR. Therefore, we can easily get and using the optical flow estimation network:
Considering that the LiteFlowNet outperforms FlowNet2  on the challenging Sintel final pass  and KITTI benchmarks , while being 30 times smaller in model size and 1.36 times faster in running speed, we exploit the LiteFlowNet to estimate the optical flow of the consecutive color images.
The original sparse depth maps and the bidirectional optical flow are collectively fed into the motion guidance module. Thus, the intermediate depth map can be expressed by:
where is a depth map interpolation function learned by the motion guidance module.
To make full use of the information provided by the optical flow, we leverage the bidirectional optical flow to directly produce an approximate intermediate depth map , in terms of the warping operation:
where refers to the weighting factor of two input depth maps and is a backward warping function that can be implemented using bilinear interpolation [8, 11]. This warping layer transfers the depth map of adjacent frames to the position of intermediate frame using the estimated optical flow. Instead of roughly feeding the optical flow into neural networks, we further utilize the explicit motion relationship to build an approximate intermediate depth map, contributing to a more accurate 3D reconstruction.
Therefore, the input of the motion guidance module includes the two consecutive sparse depth maps, the estimated bidirectional optical flow, and the warped intermediate depth maps. The final interpolated intermediate depth map can be formulated by:
Ii-A3 Scene Guidance Module
In contrary to the sparse point cloud, color images have richer and denser texture information, which significantly boosts the complete scene understanding. In order to obtain more precise and dense interpolation results, we design a scene guidance module to refine the coarse depth mapderived by the motion guidance module. We first utilize two convolutional layers with the channels of 8 to extract features of the coarse depth map and color image, respectively. Subsequently, the convolved features are fused to form the input of the refined interpolation network. The refined interpolation network is a lightweight U-Net structure, the number of its layers is less than that of the coarse interpolation network. Specifically, the encoder contains five convolutional layers and the decoder contains four deconvolutional layers. The batch normalization and ReLU activation function are implemented to all convolutional and deconvolutional layers, expect for the last deconvolutional layer that uses the linear activation function. In addition, there are skip connections between feature maps with the same spatial resolution, to facilitate the complementation of local and global information. Thus, the intermediate dense depth map generated by the scene guidance module can be expressed as:
Ii-B Transformation Module
Once intermediate depth map is generated, the point cloud can be constructed in the form of the Pseudo-LiDAR. According to the pinhole camera model principle, each spatial point is corresponded to its pixel coordinates , where refers to depth value. Through these camera parameters, the interpolated depth map can be converted into the coordinates of a point cloud. Here, we can derive the 3D position of each pixel in the camera coordinate system as
where is the pixel position corresponding to the center of camera aperture, and and are the vertical and horizontal focal lengths, respectively.
By converting all pixels in the depth map into 3D coordinates, we can get a set of points , where is the number of points. The point cloud obtained from the intermediate depth map is named as Pseudo-LiDAR .
Ii-C Loss Function
The whole loss function of PLIN is a linear combination of the coarse depth loss and the refined depth loss. The ground truth depth map of the intermediate frame can be used to supervise the network prediction. We adopt L2 Loss between the predicted dense depth map and the ground truth as follows
Our final loss function can be expressed as follows:
where refers to the intermediate depth map of the coarse interpolation network, refers to the intermediate dense depth map of the refined interpolation network, and are weights to balance two different loss functions. In this work, and are empirically set to 0.1 and 1, respectively.
In this section, we first describe the training dataset and strategy of the proposed PLIN network. We then perform several ablation experiments to verify the effectiveness of different modules in our network. To demonstrate its superiority, we also compare our method with a traditional method and an advanced video interpolation method. As illustrated in Fig. 3, the depth maps obtained by our method show clear boundaries in visual effects and display denser distributions than the ground truth dense depth maps.
Iii-a Dataset and Strategy
The main application scenario of our model is on-board LiDARs for outdoor scenes. Our experiments were performed on the KITTI depth completion dataset  and the raw data dataset . The KITTI dataset provides depth information and color images. The dataset contains 85,898 training data, 6,852 validation data, and 1,000 test data. Considering that the training set contains some frame sequences with tiny motion, we select 40,000 scenes with relatively large motion to train our network.
Since the upper part of the depth map of LiDAR projection does not provide any depth information, our network takes images with 1216256 by bottom-cropping on original images. In addition, we perform data augmentation operations such as random flipping and color adjustment on training data. The whole network was trained in an end-to-end manner. We used the Adam optimizer with an initial learning rate of13] with a batch size of 1 and trained on a 1080Ti GPU for about 60 hours.
Iii-B Ablation Study
To evaluate the effectiveness of each module, we perform ablation study on the proposed network. Firstly, we conduct three experiments on the coarse interpolation network.
The baseline network only takes two consecutive sparse depth maps as the input (baseline).
The forward and backward sparse depth maps and estimated optical flow maps are fed into the baseline network (baseline + flow).
The baseline network receives the forward and backward depth maps, the bidirectional optical flow, and the depth maps derived by the warping layer (baseline + warp_flow).
Similarly, there are two experiments on the refined interpolation network as follows.
The refined network takes the intermediate color image and two depth maps as its inputs (baseline + rgb).
The complete configurations including the coarse interpolation network with motion guidance using the warping operation, and the refined interpolation network with scene guidance (ours).
For the evaluation of interpolated depth maps, we choose four metrics: the root mean square error (RMSE), mean absolute error (MAE), root mean square error inverse depth (iRMSE), and mean absolute error inverse depth (iMAE). Similar to [12, 26, 10], we primarily focus on RMSE, which is the leading metric on the depth completion benchmark. The results of ablation study are listed in Table I
. The ablation study shows that the complete network (ours) achieves the best interpolation performance. For each module of PLIN, due to the provided optical flow between consecutive frames, the baseline network achieves more accurate results with the guidance of motion. Moreover, compared with the direct use of optical flow, the warping layer significantly improves the performance of interpolation, because of the more explicit intermediate representation. As a benefit of the rich texture information in color images, the baseline network with the guidance of scene outperforms the baseline network that only takes two consecutive depth maps as inputs. For the other minor evaluation metrics, the motion guidance module slightly increases their values, as the estimated optical flow obtained by LiteFlowNet contains some noises. To intuitively compare these different performances, we visualize the interpolated results of a scene obtained by the above methods in Fig. 4, the complete network generates the most realistic details and distributions of the intermediate point cloud.
Iii-C Comparison Results
Because PLIN is the first work for point cloud interpolation, we only compare our method with the traditional interpolation method that averages the two consecutive depth maps and the state-of-the-art video interpolation network Super Slomo . Note that, we retrain the Super Slomo network using the KITTI depth completion dataset .
Iii-C1 Quantitative Comparison
Table II reports the quantitative evaluation results of different methods. The traditional method averages two consecutive depth maps to obtain an intermediate depth map. However, the pixel values are relatively sparse and there is no obvious correspondence, so that the traditional method is not suitable for the interpolation of point clouds. Moreover, the video interpolation network  cannot handle the point cloud interpolation problem due to the challenging motion perception on the sparse depth map. Compared with these two methods, the proposed PLIN network is specially designed for the point cloud interpolation task and jointly guided by the explicit motion and realistic scenes, achieving the best performance.
Iii-C2 Visual Comparison
For visual comparison, we show three interpolated results achieved by different methods. As illustrated in Fig. 5, suffering from the plain average interpolation, the traditional method generates fake objects which do not exist in the original scene. Super Slomo , however, produces disordered point clouds due to its the insufficient learning capability on motion and scenes. In contrast, our model produces more sharp outlines and boundaries for small objects such as cars and people. In addition, the whole distribution of Pseudo-LiDAR is more similar to that of the ground truth point cloud.
In this paper, we have proposed a network to generate both temporally and spatially high-quality point cloud sequences. In order to gradually perceive different modal conditions, we adopted a coarse-to-fine cascade structure. Specifically, the bidirectional optical flow explicitly guides consecutive sparse depth maps to generate an intermediate depth map, which is further improved by the warping layer. To obtain more accurate and dense depth information, the scene guidance module exploits the intermediate color image to refine the coarse depth map. To the best of our knowledge, this is the first deep framework for Pseudo-LiDAR interpolation, which increases the frequency of LiADR sensor and shows appealing applications for more efficient multi-sensor systems.
This work was supported in part by National Key R&D Program of China (2018YFB1201601), in part by supported by National Natural Science Foundation of China (No.61772066, 61972028, 61602499, 61972435) and Fundamental Research Funds for the Central Universities (2018JBZ001, 18lgzd06).
Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan
Depth-aware video frame interpolation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3703–3712, 2019.
-  Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision (ECCV), pages 611–625. Springer, 2012.
-  Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3773–3777, 2017.
-  Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
-  Tak-Wai Hui, Xiaoou Tang, and Chen Change Loy. Liteflownet: A lightweight convolutional neural network for optical flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8981–8989, 2018.
-  Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2462–2470, 2017.
-  Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9000–9008, 2018.
-  Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven Lake Waslander. Joint 3D proposal generation and object detection from view aggregation. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8, 2018.
-  Byeong-Uk Lee, Hae-Gon Jeon, Sunghoon Im, and In So Kweon. Depth completion with deep geometry and context guidance. In IEEE International Conference on Robotics and Automation (ICRA), pages 3281–3287, 2019.
-  Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4463–4471, 2017.
-  Fangchang Ma, Guilherme Venturelli Cavalheiro, and Sertac Karaman. Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera. In IEEE International Conference on Robotics and Automation (ICRA), pages 3288–3295, 2019.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
-  Tomer Peleg, Pablo Szekely, Doron Sabo, and Omry Sendik. Im-net for high resolution video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2398–2407, 2019.
-  Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3D object detection from RGB-D data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 918–927, 2018.
-  Jiaxiong Qiu, Zhaopeng Cui, Yinda Zhang, Xingdi Zhang, Shuaicheng Liu, Bing Zeng, and Marc Pollefeys. Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3313–3322, 2019.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
-  Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. PointRcnn: 3D object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–779, 2019.
-  Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik G. Learned-Miller. Multi-view convolutional neural networks for 3D shape recognition. IEEE International Conference on Computer Vision (ICCV), 2015.
-  Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity invariant cnns. In 2017 International Conference on 3D Vision (3DV), pages 11–20. IEEE, 2017.
-  W. Van Gansbeke, D. Neven, B. De Brabandere, and L. Van Gool. Sparse and noisy lidar completion with rgb guidance and uncertainty. In International Conference on Machine Vision Applications (MVA), pages 1–6, 2019.
-  Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Weinberger. Pseudo-lidar from visual depth estimation: Bridging the gap in 3D object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8445–8453, 2019.
-  Bichen Wu, Alvin Wan, Xiangyu Yue, and Kurt Keutzer. Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3D lidar point cloud. In IEEE International Conference on Robotics and Automation (ICRA), pages 1887–1893, 2018.
-  Bichen Wu, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, and Kurt Keutzer. Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In IEEE International Conference on Robotics and Automation (ICRA), pages 4376–4382, 2019.
-  Zhichao Yin, Trevor Darrell, and Fisher Yu. Hierarchical discrete distribution decomposition for match density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6044–6053, 2019.
-  Yinda Zhang and Thomas A. Funkhouser. Deep depth completion of a single RGB-D image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 175–185, 2018.