A broad range of several robotics and computer vision algorithms rely on depth information along image data. Recently, several deep learning based methods have been proposed that can estimate the depth information from mono, stereo or multiple cameras[1, 2, 3]. However, the performance of these methods significantly degrades in environments with poor illumination and structural variability. In order to overcome these passive sensing based limitations of cameras (which is a passive sensor), LiDAR based methods of depth estimation have been proposed [4, 5, 6, 7]. Although, the LiDAR based methods are more robust to poor illumination and lack of environmental structure, the accuracy of these methods is dependent on the calibration of the LiDAR and camera sensors. To this end, most of the works available in the literature assume a given accurate registration and thus focuses only on up-sampling or super-resolving a sparse depth map generated by projecting a LiDAR point cloud into the corresponding image reference frame. Unfortunately, there are problems involved in generating depth super resolution (SR) in the presence of even slight registration offsets. These are typically in the form of reconstruction artifacts generated due to, for example, by the presence of strong depth discontinuities in individual objects.
Thus we formulate our problem based on the idea that the quality of the depth SR is tightly coupled both with the effectiveness of the super-resolution approach to yield accurate reconstructions and with the accuracy of the approach used for calibration or registration of the sensing modalities. This idea was first explored in the work of  in which LiDAR-camera registration was performed by aligning edges of an intensity image and a corresponding depth SR image. In this paper we propose a real-time depth super-resolution reconstruction method that only uses projections of the LiDAR data into the camera reference frame. We found that the LiDAR-to-camera registration accuracy used in  to obtain the calibration parameters provided in the KITTI benchmark is not accurate for depth super-resolution reconstructions. Therefore, we developed a new automatic calibration method based on motion vectors estimated from LiDAR and camera data independently. The calibration parameters obtained from the proposed method provide a better quality estimate of the super resolution depth maps.
The rest of the paper is organized as follows. Section 2 provides an overview of the state of the art on depth SR from LiDAR-camera fusion and also discusses LiDAR-camera rigid body registration methods. In section 3 we present and formulate both our depth SR and our motion based registration frameworks. Section 4 presents experimental validating results to depth SR and motion based registration with real data from the KITTI benchmark. Finally, Section 5 concludes our findings.
2 Related Work
In mobile robotics, most of the automatic and target-less LiDAR-camera registration approaches that have been developed in the literature have focused on determining modality-invariant features and the cost functions measuring alignment between modalities. One of the first effective works within this realm computes edges simultaneously in both modalities and uses a cost function of correlation similarity to measure alignment . This work was followed by the work of  which uses instead the mutual information contained between the reflectivity measurements from LiDAR and the intensities from camera images assuming these have overlapping views to measure alignment. One of the problems with these methods however is that target-less based registration from natural scenes tends to yield, in some instances, non-smooth and non-convex cost functions which may result in local convergence. To resolve these issues,  exploits temporal information and proposes an approach to automatically select specific time and place instances that better constrain the registration cost function.
On the approaches that have been developed for depth super-resolution, the early work proposed on  demonstrated the idea of constraining the Markov random field (MRF) based optimization to yield depth-maps with edges that co-occur with those from the camera image, assuming modalities have been co-registered. He et al.  proposed an image guided filtering approach to guide and improve edge selection in the reconstruction of depth SR. The work of  proposed a segmentation based approach where sparse LiDAR points are used to segment the image followed by smoothing of the sparse points within each of these segments to generate depth. The sparse-promoting regularization based work of  proposed to use a weighted total generalized variation (TGV) formulation to promote consistency with edges from a high-resolution image in the depth SR reconstruction.
More recent works have also exploited temporal information to further enhance super-resolution methods. For example, the work of  formulated an inverse problem which uses a low-rank penalty on groups of patch similarity found from the corresponding image sequence. The work of  instead learns group-sparse representations and uses total variation (TV) regularization for depth SR reconstruction. The authors  considered instead a joint optimization formulation of the LiDAR-camera registration and depth upsampling based on TV minimization. Although the aforementioned methods yield accurate depth reconstructions, these are not currently ready for real-time deployment and are thus not suitable for robotics and, in particular, for autonomous vehicle applications.
3 Proposed Approach
In this paper we propose a method for real-time depth-map super-resolution reconstructions from LiDAR data. In addition to this, we also propose a novel motion guided method for automatic calibration of the two multi-modal sensors.
3.1 Depth Super-resolution
This section describes the depth super-resolution method shown in Figure 1 that performs super-resolution (SR) depth-map reconstructions given a low resolution (LR) depth-map input. The set of actions followed to attain depth SR are completely described by a mathematical cost function which indicates the characteristics this SR depth-map reconstruction should have. These characteristics are:
Available LR depth-map pixel values from the depth sensor should be preserved exactly in the SR depth-map reconstruction. The reason behind this is that the LiDAR sensors from which depth measurements are typically collected have high reliability and thus can be considered as true depth.
The strength and occurrence of depth discontinuities or edges in the HR depth-map should be minimum.
To ensure SR depth-map reconstructions with these two characteristics we minimize the -norm of gradients which searches for solutions which minimize the magnitude (i.e., strength) and occurrence of depth-discontinuities (i.e., depth sparsity) regardless of its direction (i.e., horizontal or vertical) overall pixels in the depth-map image while also maintaining equality constraints in pixels where the LR depth-map observed depth measurements. Such constraints result in the following objective function mathematically defined as
Here, denotes the subset of pixels in the LR depth-map with a depth measurement from the LiDAR projection and is the SR depth-map. A similar problem formulation was envisioned in  for the problem of mapping the ground reflectivity with laser scanners. In (1) the gradient is the first order forward difference discrete gradient defined point-wise as
Since the cost function in (1) is convex, one can use any gradient based optimization algorithm. Here we use the accelerated gradient projection method of Nesterov  which achieves a rate of convergence of versus the of standard gradient descent, where is the iteration number. A summary of the algorithm is included in Algorithm (1)
Here, is the rectangular matrix with 1’s in the diagonal elements indexed by . This matrix selects only the pixels defined in the set . In words, in every iteration of Algorithm 1 the initial pixel depth value measurements from the LR depth-map propagate spatially to every pixel in the SR depth-map reconstruction provided this propagation takes the estimate of the reconstruction at the current iteration closer to satisfying the characteristics described in the cost function. The way in which depth values are propagated is dictated by the gradient of the cost function denoted by and with a strength indicated by the learning parameter . Such gradient based optimization and convex nature of the cost function guarantee that at each iteration the reconstruction improves itself until reaching convergence. The additional steps described in lines 5,6 and 7 of the algorithm further propagate depth values a little more where now the strength of this additional propagation is described by the scalar value .
3.2 Motion based calibration
In this section we describe the method we developed for automatic and target-less multi-modal sensor rigid-body registration. Our method comprises registering data from LiDAR and camera sensors arranged in any configuration over a mobile or static platform. The general idea is that at the correct registration parameters motion in the modality to be registered should follow the motion in the reference modality. The procedure used to implement this idea uses a sequence of independent time-synchronized point cloud clouds from LiDAR and corresponding images from the camera. Given these sequences, a total of motion vectors are computed independently for each modality by taking time consecutive image or depth sequence pairs, correspondingly. The resulting motion vectors from the modality being registered are then compared against the reference modality by using an -norm similarity measure. Note that we assume there are overlapping views of the scene from the multiple modalities while also we assume there is motion in at least the underlying scene or the mobile robot such that motion vectors carry significant information. The schematic illustrating this registration process is shown in Figure 2. Mathematically, this idea is described as finding the registration parameters that satisfy
Here, corresponds to the motion vector defined by where and are the horizontal and vertical motion operators, respectively. The terms represent the image pair used to estimate the motion vectors at a given time instant and represent the corresponding pair of super-resolution depth maps. As illustrated in Figure 2, the sequence of SR depth-maps is obtained by a 3D-to-2D projection of a point cloud into the image reference frame subsequently followed by super-resolving it using the optimization described in (1). This super-resolution step is used here to simplify the computation of motion vectors from dense rather than sparse depth-maps since one no longer needs to acccount or normalize against the varying amount of overlap when matching sparse depth sequence data. The advantages of registering modalities from the motion vectors is two-fold: (i) it avoids the challenges related to modalities measuring different units (e.g., intensity, depth, etc) and (ii) it constraints the alignments both spatially and temporally.
To minimize the cost function in (3), we use the simplex optimization method from Nelder-Mead described in . Our method is not limited to work only with the simplex method and other optimization approaches could be used instead. However we decided to use the simplex method because it is differentiation-free and capable of escaping local-minima.
To validate our approach, we used the KITTI benchmark  dataset. This data was collected with a vehicle outfitted with several perception and inertial sensors. For our experiments we use only the data from the Velodyne HDL-64E 3D-LiDAR scanner and the left PointGray Flea2 grayscale camera. However, the approach we propose here for both depth SR and motion guided registration can be trivially scaled to multiple cameras/LiDARs configurations.
4.1 Analysis of calibration parameters in the KITTI dataset
The KITTI dataset provides the extrinsic calibration parameters for each sensor mounted on the vehicle. The parameters for the LiDAR and camera pair chosen for our experiments are also given with the dataset. However, if we consider these parameters to be true and estimate the super resolution depth map using the technique described in section 3.1 then we observe artifacts in the reproduced depth SR especially for objects closer to the sensors. A representative example of this is shown in Figure 3 using data collected from a vehicle chasing a biker. Figure 3.a and b show the result of depth-super resolution using different registration parameters: those of the KITTI in (a) and our adjustment of the KITTI parameters in (b). Note that there is almost no difference in the resulting depth-maps. However, as the biker on the right part of the depth-map becomes closer relative to the LiDAR sensor the effect of misscalibration becomes clear and significant as can be seen when comparing Figures 3.c and d. To see this effect more clearly we have included the corresponding zoomed in patches in Figures 3.e,f and g where (e) represents the corresponding patch from the camera image and (f) and (g) are the depth SR reconstructions using the KITTI parameters and our own adjustment, respectively.
In general, we note that the accuracy of the method used to obtain the KITTI calibration parameters is not sufficient for depth SR reconstructions. We would like to mention that this issue is also shown in the work of  without discussion and more recently in  in which the presence of this artifact in specific regions is considered as patches of high depth-uncertainty and reduced using a pre-filtering process.
4.2 Motion based calibration
To resolve the miscalibration issue in the KITTI dataset, we use the motion based method proposed in section 3.2. First, given a pair of intensity and depth video sequences from the camera and LiDAR we compute corresponding motion vectors for both modalities. These motion vectors are computed using a grid based search only at uniformly spaced sample positions of the image space and the search space is restricted to a spatial neighborhood of size 30 30 pixels in the subsequent frame. Figure 4 plots the behavior of the motion based cost function in (3) for a given parameter offset. We observe that the proposed cost function is locally convex and shows a clear global minima at the correct calibration parameters.
4.2.1 Convergence of the cost function
In this experiment we illustrate the convergence behavior of the simplex optimization of (3). For this purpose, we initialize the simplex optimization approach with 7 points where 6 of these were chosen randomly within of the correct parameters and one is chosen within . A representative example of the convergence of the optimization is illustrated in Figure 5 showing the first 100 iterations of 20 independent experiments for the and the cases. This convergence behavior is consistent throughout the scenes and number of motion vectors . However, as shown in the Figure the cost range near 100 iterations becomes a bit wider when using a smaller since there are less motion vectors to constraint the alignments. Also note that the solution becomes stable after iterations in both cases which is also consistent with other scenes and ’s as long as there is motion in the platform or the scene.
4.2.2 Effect of number of images on the registration algorithm
In the second set of experiments, we focused on illustrating dependencies of scene motion on the registration accuracy. For this purpose, we compared root mean squared error (RMSE) of the registration performance against three different scenes and using motion vectors from the image sequence of length and . The RMSE is measured against a ground truth calibration which was obtained manually by aligning edges in corresponding images and depth SR. Such an alignment of edges has been shown to be a robust method for calibration . In our experiments, we conducted a total of 100 trials for each case through the simplex optimization and obtained the registration parameter after 100 iterations in every case. Table 1 summarizes the RMSE results for individual registration parameters.
We observe that the RMSE error depends upon the number of motion vectors used in the optimization equation (3). Therefore as we increase the value of the RMSE decreases. In ”Scene1” though, the effect of on RMSE is less significant because this scene corresponds to a crowded area with many moving pedestrians which generates motion vectors carrying significant amounts of information. In general, we can say that the higher the number of motion vector frames used in the optimization, the better the convergence and RMSE performance of the registration approach since these impose higher constrains on the 6 DOF calibration parameters.
4.2.3 Robustness of registration algorithm
In this experiment we show the robustness of our motion based registration formulation against the initial guess of the parameters used in the optimization algorithm. Here, we conducted 100 registration trials where we used motion vectors from an image sequence of length . In each of the 100 independent registration trials we randomly selected different scenes and randomly initialized the registration parameters within for rotation and m for translation. In this experiment also we use the manually adjusted parameters as ground truth. Figure 6 reports the results of the experimentation showing the RMSE performance in these 200 trails for each registration parameter. Note that our registration formulation is capable of bringing the calibration parameters to within RMSE accuracies of and cm.
4.3 Qualitative analysis of the proposed depth super-resolution reconstruction algorithm
In this experiment we present qualitative analysis of the proposed depth SR reconstruction algorithm and compare it with the generalized total variation (TGV) approach described in . Throughout the experiments, our reconstruction uses algorithm 1 with a learning rate . Note that other values also work well. However, we chose that value since we found experimentally that it gives a good trade-off between depth-map reconstruction quality and convergence time. To illustrate the performance comparison, we first refer to Figure 7 which shows the capabilities of our method to resolve finer details by using only the sparse depth or low-resolution (LR) depth from LiDAR in Figure 7.b in comparison to TGV which in addition uses 7.a. In Figure 7.b the gray colored pixels represent pixels with missing depth measurements from the LiDAR sensor. Comparing Figures 7.c and d of the depth SR reconstruction we see that our method is able to resolve finer details as further illustrated in the zoomed patches showing the bike and its wheels in the bottom left patch and the chain in between poles as seen in the bottom right patch whereas these are hardly distinguishable in the TGV method. Figure 8 also illustrates other SR depth reconstruction examples. The second and fourth columns of Figure 8 represent the LR and the SR reconstruction depth-maps, respectively. Note that the proposed approach results in better reconstructions and avoids oversmoothing edges as in TGV which results in losing resolution specially at objects with sharp edge discontinuities. Such characteristics can be visualized for example when resolving legs in the pedestrians walking in Figures 8(c) versus (d) and in (k) versus (l), in the poles in (g) versus (h) and in (o) versus (p). Note also in (s) that details specially those in the right part of the scene are hard to resolve whereas in (t) one is able to see the tree trunks and bikes present in the scene. In addition to the edge sharpness gain, our SR depth reconstruction implementation took 0.1 secs per frame versus the that it took the TGV method This advantage is due to the fixed equality constraint in equation (1) as opposed to its corresponding relaxation in the TGV method. Finally, we would like to add that some points which may appear as artifacts at the edges in our depth reconstruction are not caused by our depth SR method but are rather from the LiDAR scanning mechanism. In specific, this is that the vertically arranged lasers in the LiDAR do not fire at the same time and the LiDAR’s internal factory calibration was not able to compensate enough to completely eliminate them.
In this work, we proposed a novel method to generate depth super-resolution from LiDAR data and a motion based registration of LiDAR and camera modalities. The results of our experimentation show state of the art real-time depth super-resolution reconstructions performance. We also found that the motion based registration is an efficient metric which constraints alignments both spatially and temporally and effectively decouples the alignment function from the different modality metrics (e.g., intensity, distance). Our results validate that our motion based registration formulation yields accurate parameter estimations to calibrate the LiDAR-camera sensing modalities and in turn produces accurate super-resolution depth maps.
-  Ashutosh Saxena, Sung H. Chung, and Andrew Y. Ng, “Learning depth from single monocular images,” in Proceedings of the 18th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, 2005, NIPS’05, pp. 1161–1168, MIT Press.
Jure Žbontar and Yann LeCun,
“Stereo matching by training a convolutional neural network to compare image patches,”J. Mach. Learn. Res., vol. 17, no. 1, pp. 2287–2318, Jan. 2016.
-  A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” in Proceedings of the International Conference on Computer Vision (ICCV), 2017.
-  J. Diebel and S. Thrun, “An application of markov random fields to range sensing,” in Proceedings of the 18th International Conference on Neural Information Processing Systems (NIPS), Vancouver, Canada, 2005, pp. 291–298.
-  K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE Transactions Pattern Analysis and Machine Intelligence, vol. 35, no. 6, pp. 1397–1409, November 2013.
J. Lu and D. Forsyth,
“Sparse depth super resolution,”
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015.
-  D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruether, and H Bischof, “Image guided depth upsampling using anisotropic total generalized variation,” in Proc. IEEE Int. Conf. Comp. Vis., Sydney, NSW, Australia, December 1-8, 2013, pp. 993–1000.
-  J. Castorena, U. Kamilov, and P.T. Boufounos, “Autocalibration of lidar and optical cameras via edge alignment,” in IEEE International Conference on Acoustics, Speech and Signal processing (ICASSP), Shanghai, March 20-25, 2016, pp. 2862–2866.
-  Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun, “Vision meets robotics: The KITTI dataset,” Int. J. of Rob. Res., 2013.
-  J. Levinson and S. Thrun, “Automatic online calibration of cameras and lasers,” in Robotics: Science and Systems, Berlin, Germany, June 24-28, 2013, pp. 29–36.
-  G. Pandey, J.R. McBride, S. Savarese, and R.M. Eustice, “Automatic extrinsic calibration of vision and lidar by maximizing mutual information,” Journal of Field Robotics, vol. 32, no. 5, pp. 1–27, August 2014.
-  T. Scott, A.A. Morye, P. Piniés, L.M. Paz, I. Posner, and P. Newman, “Choosing a time and place for calibration of lidar-camera systems,” in IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 2016.
-  U.S. Kamilov and P.T. Boufounos, “Motion-adaptive depth superresolution,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 1723 – 1731, 2017.
-  K. Degraux, U.S. Kamilov, P.T. Boufounos, and D. Liu, “Online convolutional dictionary learning for multimodal imaging,” in IEEE International Conference on Image Processing, Beijing, China, 2017.
-  J. Castorena, “Computational mapping of the ground reflectivity with laser scanners,” CoRR, vol. abs/1611.09203, 2017.
-  Y.E. Nesterov, “A method for solving convex programming problem with convergence rate ,” Dokl. Akad. Nauk SSSR, vol. 269, no. 3, pp. 543–547, 1983.
-  J. C. Lagarias, J. A. Reeds, M. H. Wright, and P. E. Wright, “Convergence properties of the nelder-mead simplex method in low dimensions.,” SIAM Journal of Optimization, vol. 9, no. 1, pp. 112–147, 1998.
-  C. Premebida, L. Garrote, A. Asvadi, A.P. Ribeiro, and U. Nunes, “High-resolution lidar-based depth mapping using bilateral filter,” in IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), Rio de Janeiro, Brazil, 2016.
-  L. Chen, Y. He, J. Chen, Q. Li, and Q. Zou, “Transforming a 3-d lidar point cloud into a 2-d dense depth map through a parameter self-adaptive framework,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 1, pp. 165–176, 2017.