This paper deals with the problem of obtaining a depth map, in pixel coordinates, from a single 3D point-cloud generated by a multi-channel LIDAR mounted on-board an instrumented vehicle. Assuming the LIDAR is calibrated wrt a monocular camera, the transformation of the point-cloud in to the image plane generates a sparse distribution of points, as shown in Fig. 1
(zoom at the bottom-right), where the majority (more than 90%) of pixels are unsampled. Some difficulties arise in obtaining a consistent and dense (high resolution) depth map, where consistency has to do with obtaining a good estimation of depth values in edges and smooth regions of the depth-map. On the other hand, density is related to the spatial resolution of the map where the problem is caused by sparse and incomplete data. Dense depth maps are applicable in several fields, such as artificial perception, sensor fusion, scene reconstruction and semantic segmentation. Upsampling sparse 3D point-clouds to obtain a high resolution depth image can be explored in the context of artificial sensory-perception for intelligent/autonomous vehicles and ADAS applications, such as: road and obstacle detection , curb detection , vehicle detection .
The majority of the works in this research area use, in conjunction with range data, information from monocular camera as in [3, 1, 8, 10]. This work, instead, differs from existing solutions because the proposed approach uses only data from a LIDAR, and therefore monocular camera is considered just for calibration and visualization purposes. More specifically, a Bilateral Filter (BF) based framework is described which generates, from single 3D point-cloud delivered by a Velodyne HDL-64, a high resolution depth-map. Upsampling 3D point-clouds to produce a dense depth map based solely on data provided by a LIDAR i.e., color/texture information from camera is not been used, is a research area with very few scientific literature. The recent work of Miksik et al.  shares some common points with the approach described in this paper but, data from a stereo-image system is used in their algorithm. Conversely, methods that combine data from LIDAR and camera i.e., the depth upsampling strategy is enhanced by color and texture information, are much more common as evidenced by a number of interesting scientific works, such as , , . Besides the problem of the low number of sampled points provided by a single point-cloud, which corresponds to less than 10% of the region within the FOV (see Fig. 1), particular attention has to be given to the ambiguity problem of foreground vs background, that occurs in the areas of discontinuities (edges) between objects (such as a pedestrian or a vehicle) and far away objects (background). This is an aspect that will be addressed in detail in Sect III and is one of the main reasons to use edge-preserving methods like the BF. This work utilizes the structure of edge-preserving filters, namely the BF, which means that the sampled points in the local window are weighted by combining a neighborhood distance function and a range based function. The first weighting function does not contribute significantly to the performance of the system, while the second function plays a key role on the final results. These aspects will be further discussed and demonstrated.
Edge-preserving filters are very popular in the computer vision community, and they are used in many applications such as noise reduction, stereo matching, image deconvolution, image upsampling. Examples of this type of filter are the Anisotropic Diffusion filter, the Bilateral filter, and the recent Guided filter. The Anisotropic Diffusion filter was proposed by Perona and Malik to be a method to preserve region boundaries (edges) of objects in images. One of the criteria enunciated in  is that the region boundaries should be sharp and coincide with the semantically meaningful boundaries. Their idea was to prioritize smoothing within a local region over to smoothing across boundaries. On the other hand, the BF, named in , gained much popularity and demonstrated to be a very useful and efficient method in many applications. Examples of works which employed the BF to obtain dense depth-maps from LIDAR data are , . Recently, the Guided filter  was proposed as an alternative to the BF in the sense that it is also an edge-preserving smoothing filter . However, so far the Guided filter has been used only in the domain of computer vision applications and, therefore, work on LIDAR data upsampling seems to be an open problem to be addressed.
In terms of contributions, this paper proposes an upsampling framework for high-resolution depth mapping based on the BF, using an original approach to process the range values, within a local mask, where the discontinuities (edges) are meaningfully preserved. The proposed approach is particularly suitable for LIDAR only depth maps. Moreover, this work provides a thorough experimental validation of the framework, considering qualitative and quantitative criteria, and reports experimental results and comparative evaluation with several other techniques.
The remainder of this paper is organized as follows, in the next section the problem formulation is presented, Sect. III is devoted to edge-preserving filtering, while Sect. III-B focuses on the proposed solution to the discontinuity problem using BF. Sect. IV describes the evaluation methodology and datasets; experiments and results are detailed in Sect. V, and finally Sect. VI concludes this paper.
Ii Depth-map upsampling formulation
Given a 3D point-cloud produced by a LIDAR at a given time-stamp (from now on -index will be omitted to simplify the notations), and assuming the LIDAR is calibrated wrt a monocular camera with image-plane denoted by , the formulation starts by considering the set of points that lie within i.e., is obtained after the transformation from to the camera coordinate system and then to the image-plane using the calibration matrix. Given the set of points , with having non-integer pixel coordinates and being sparse, the goal here is to obtain a high resolution depth-map (high density of points) limited by the size of () and by the horizon line. In terms of locations, is restricted to positive-integer values i.e., pixel locations, in the range and respectively for the horizontal and vertical positions.
The elements of are the points , where each point is represented by the position in pixel coordinates and by the range value as measured by the LIDAR. However, since and are finite real numbers, the position coordinates of can be rounded to integer values in for purpose of computational efficiency. At this stage i.e., without further processing, three cases may occur: (case1) locations of with just one point; (case2) locations with more than one point from ; and (case3) locations of without corresponding point (“empty”). In this paper, which deals with the problem of obtaining depth-maps from a Velodyne HDL-64 sensor, the case1 occurs in approximately 6.7% of pixel locations of the , in the case2 we have about 0.1% of locations with more than a single point and finally, for the case3, about 93.2% of positions in are empty ( unsampled). Those percentages were calculated from experimental data using 100 frames of the KITTI dataset (more details in Sect. V-A). Although negligible for one pixel-size (let’s say a mask of ), the case2 deserves particular attention as the number of sample points rapidly increases as the size of the area of interest, the local window, increases to , , and so on. This situation is better understood after viewing Fig. 2 where, due to the LIDAR’s perspective viewing, range points pertaining to the cyclist, located in the foreground, and points belonging to the background are situated in the same ’s location (the zoomed view in the bottom row of Fig. 1 illustrates this problem.). This yields large deviations for the average range value at a local window and is the main cause of errors in the depth map. As explained before, the set of measurement points from a LIDAR sensor, in the form of the point-cloud , are converted to discrete locations in a two-dimensional map , in pixel coordinates, and is represented by . The main difference from other upsampling problems is that has a very low density of points: of the order of 6.8%. Therefore, the goal is to find a technique to estimate the value of in unsampled locations of and keeping those estimated values of depth (range-distance to the LIDAR) consistent through the depth-map.
The discussion above allows us to concentrate on solving the case3 thus, we need a solution of estimating the desirable variable of the unsampled positions in and, at the same time, the errors due to the boundaries regions (edges) should be minimized. Carrying out a solution of estimating depth values in locations without measurement points will result in a depth-map with higher resolution than the input; this is known as “upsampling” and can be formulated as a spatial estimation problem as detailed in the next sections.
Ii-a Defining the region of interest
Spatial data interpolation, or estimation, is typically performed under the assumption that a local region of interest is previously defined and the desired point-value to be estimated is located within ; this is known as local interpolation. In a more general case, any polygon shape can be used to define , however, in the problem considered here the most usual solutions are square regions (usually called mask or window). Nevertheless, for purpose of completeness, solutions based on Delaunay-triangles will also be considered. Delaunay-triangulation is effective in obtaining depth-maps with close to 100% of density i.e., all locations with unknown depth-values are estimated, because this method interpolates all points regardless the distance between the points of a triangle. Except for Delaunay based methods, all the interpolation methods discussed in this paper are applied to estimate the range value of locations centred at the local and square window with size defined by (in pixel units). The value of is not simple to decide and has direct impact on the number of available points to estimate the variable of interest and consequently on the computational effort. Moreover, an important aspect to take into account is the minimum number of points in necessary to guarantee consistency, statistical significance and efficiency of the estimator. This paper will address this issue by experiments, investigating the spatial-resolution of the (i.e., its density) and some statistics for increasing values of . The implementation consists in ‘moving’ through the locations of using the sliding window technique; then, all the points within are considered for estimating, locally, the depth value of the centre point of the window. A variety of interpolation (or estimation) methods can be applied to estimate the desired depth-value, some of them are described next.
Ii-B Local interpolation algorithms
The role of an interpolation method is to estimate values of range measurements, from a LIDAR, in both sampled and unsampled (empty) locations of the depth-map (). The locations in are in pixel coordinates and the interpolation is performed locally i.e., restricted to a local region . Let be the location of interest, which is the centre of , and let be the variable to be estimated, that is, the range-distance at . Given a finite set of measured points , where , spatial interpolation can be formulated in terms of a function that weights the depth values of the points according to spatial (position) based parameters thus, . There are many possibilities for to solve the problem but, for obvious reasons, we will address a limited number of methods. Besides, the following basic operators will be considered: sample average, minimum, maximum, median, nearest neighbor. The following three classes of interpolation techniques will be considered in this work: 1) inverse distance weighting (), the simplest variant of the Shepard’s Method; 2) the Ordinary Kriging () and 3) a Polygon based method using Delaunay triangles ().
The can be expressed as , where , is a given distance function (a metric), and is a power parameter (positive real number). In a general form, the weights can be generalized to an arbitrary kernel function yielding, for Kernel-based methods, the expression: . is an optimal linear estimator that estimates the value of a random function at the location of interest, , from samples located in the local region of interest. The points are weighted according to a covariance function or the equivalent semivariogram . The parameters in this method are the values of nugget, sill and range; where represents the lag distance i.e., a distance measure between points. In the approach, the value of the unsampled point of interest, which lies within the triangle, can be estimated by different techniques. In this work, we use the Matlab classes delaunayTriangulation and scatteredInterpolant to interpolate the three points of a given triangle; the available techniques in scatteredInterpolant are: linear, nearest neighbor, and natural neighbor interpolation.
In common with all those interpolation methods, the weighting is performed as function of the position of the sampled points , and the variable of interest () is not considered in the problem formulation. Conversely, the BF  allows one of its weighting terms to be dependent of the variable of interest, in our case the range distance (). For this reason, and due to the successful performance of Bilateral filtering in edge-preserving applications , we propose a modified version of BF to upsample depth maps.
Iii Edge-preserving on depth map upsampling
In this section we briefly review the BF, a well-known edge-preserving filter, and propose a new range-weighting technique for upsampling depth-maps from 3D-LIDAR’s data (as shown in Fig. 1). For a detailed review on edge-preserving filters, in the domain of image processing, please see .
Iii-a Bilateral filter
where weights the points inversely to their distance to the position of interest , controls the influence of the sampled points as function of their range values , and finally is a normalization factor that ensures weights sum to one. In (1), we set to be inversely proportional to the Euclidean distance between the center of the mask and the sampled locations , yielding
The influence of
is not very significant because the problem of jump discontinuities is caused by differences in range. On the other hand,is the key component to be explored in order to provide improvement in the estimation of , under the influence of discontinuities between foreground and background. A common form of weighting the range values, as in , is given by
However, as mentioned in Sect. II, the average percentage of centred pixel with range values is (only) 6.8% therefore, the nearest value has been chosen at an unsampled location .
Iii-B Modified Bilateral filter
We propose a modification in the BF, henceforth indicated as , by expressing the weighting element as a function of the ‘dispersion’ of in the mask . Assuming that an edge is characterized by a discontinuity in the range values, we propose to use a clustering algorithm to detect discontinuities and, if it is the case, the number of clusters () will be at least two; therefore, an edge (or a discontinuity) occurs if (as shown in Fig. 3). The algorithm used to perform clustering is based on the popular DBSCAN , which is a simple and effective algorithm that depends on two parameters, and . The implementation of the DBSCAN should take into consideration six definitions, as detailed in , and a distance function between points. In this work, we consider the distance function () as given by:
where is the number of points . If i.e., a discontinuity has been detected, the occurrence of more than one cluster is likely true. A cluster () is accepted only if . Ideally, the number of clusters corresponding to a window where an edge occurs should be i.e., one clustered set of points belonging to the foreground and another to the background. We conducted experiments using the DBSCAN algorithm, with and , and found out that is equal to two in the majority of the regions where an edge occurs. Figure 3 provides a visual display where the light-grey pixels correspond to regions with . In some circumstances, however, is greater than 2 and it is not easy to detect the ‘optimal’ boundary between foreground vs background. Whenever , the approach presented in this section considers at most two clusters; conversely, if equ. (1) is applied to all the points in . When the uses a ratio () between the number of points of, at most, two clusters.
Let and be the number of points belonging to the clusters and , where is the ratio of interest. The variable corresponds to the cluster that has the closest average distance to the LIDAR (denoted cluster ); this holds regardless of the number of remaining clusters. However, if and excluding the cluster , then a decision process is carried out, to select , according to the following rule: is chosen as the cluster with more points and, in case of clusters with the same number of points, then corresponds to the one with the closest average distance. Once the pair has been selected, then a threshold-based rule is applied to the ratio in order to penalize or . This rule is exclusive in the sense that one of the clusters will be excluded, therefore if , then the will be run on the points belonging to else, only the points in will be considered. The key idea is to strongly penalize, based on the ratio , one of two clusters and, as consequence, only points belonging to one of the clusters will be considered in (1).
Iv Dataset and evaluation methodology
Publicly datasets for purpose of LIDAR-based depth map evaluation and performance assessment are not available at the time of this writing. Therefore, to provide quantitative evaluation, we resort to the KITTI Stereo 2015 which provides groundtruth for disparity maps evaluation and benchmark. But, because that dataset was built to evaluate stereo systems, we had to find a solution that makes the evaluation of LIDAR-based depth maps possible; this is explained in the following section.
Iv-a KITTI dataset for depth-map evaluation
The present KITTI Stereo 2015111 http://www.cvlibs.net/datasets/kitti/eval_scene_flow.php?benchmark=stereo comprises a total of 400 frames from stereo images, 200 for training with their corresponding disparity maps (the groundtruth), and 200 for benchmarking purposes, where each frame has two associated images: one from each camera of the stereo pair. The KITTI Stereo 2015 is part of the KITTI Vision Benchmark Suite, being the latter composed of thousands of frames from different sensors: a Velodyne LIDAR, cameras, and GPS/IMU. The groundtruth for disparity evaluation was created using a process that incorporates a set of consecutive scans from the LIDAR (5 before and 5 after the actual frame of interest), where this sequence of point clouds were conveniently merged by a ICP technique as reported in , followed by a manually correction step to rectify eventual ambiguities. Nevertheless, the actual KITTI Stereo 2015 dataset provides a more challenging and accurate groundtruth, described in , where “objects” (vehicles) in the scenes were recovered by fitting detailed CAD models to the point clouds. The consequence is a set of groundtruth frames where some objects are very well delineated as shown in Fig. 4. To make possible the evaluation of depth-maps generated solely by LIDAR, a new dataset - using data from KITTI - had to be built. First, the groundtruth, originally in the form of disparity maps, has to be converted to depth maps by known geometry. To calculate depth of a disparity map , it is required to know the values of the baseline and the focal length , and the conversion is given by . Once the KITTI Stereo 2015 made available image-frames and the corresponding disparity maps, it is necessary to find the LIDAR scans that match the groundtruth frames. In this work, we performed a non-exhaustive search in the “Raw Data” recordings of the KITTI Vision Benchmark Suite and established the correspondence between 100 LIDAR scans and their counterpart in the groundtruth set in KITTI Stereo 2015. Henceforth, the experiments presented in the next sections were carried out using this set of 100 scans.
The evaluation methodology adopted in this work is similar to the one used in the KITTI Stereo 2015 benchmark, excepting the following: the total number of frames used in the evaluation is 100, and the separation of training and testing set is not carried out once our approach does not depend on a learning strategy. In short, we adapted the C++ codes of the development kit package, from KITTI website, to run the evaluation routine on the dataset described above. Four performance measures, detailed in , are calculated: D-bg, D-fg, D-all, and Density. The meaning of these terms are:
: % of outliers averaged over background regions;
- D-fg: % of outliers averaged over foreground (objects);
- D-all: % of outliers averaged over all groundtruth pixels;
- Density: % of the average number of pixels, in the depth-maps, with coincident positions of the groundtruth pixels.
Although the groundtruth, as already mentioned, has been obtained by a combination of 10 point clouds and by complementary object CAD-models, the number of unsampled pixels in the groundtruth depth-maps is still significant (as shown in Fig. 4). So, the groundtruth depth maps are not 100% dense i.e., the value of Density is calculated considering only the sampled pixels .
V Experiments and evaluation
This section describes the experiments conducted on the dataset detailed in Sect. IV-A, with the goals of assessing the performance of the local-region () and evaluating the approaches discussed in Sect. II-B and Sect. III. The reported results are from the non-occluded set (). An appropriate solution to mitigate the errors caused by discontinuities between foreground and background will depend, primary, on the detection of the occurrence of such discontinuities. As mentioned in Sect. III-A, a BF-modified version, , was proposed to obtain consistent depth-map from solely LIDAR data.
V-a The role of the region of interest
The size of , defined by in pixel units as described in Sect.II-A, controls the number of points (within ) to be used by a given local-interpolation approach. Table I provides the results, averaged over the 100 frames of the dataset, for increasing values of . The average number of points in is denoted by , while indicates the maximum number of points. These values were obtained by applying a sliding-window strategy with step of 1 pixel. The density of the map, denoted by and given in percentage, is calculated considering the region covered by the LIDAR’s field-of-view (FOV). As shown in Fig. 1, the LIDAR measurements cover a limited vertical FOV i.e., the upper part of the image plane is empty, and hence only the pixels below a certain ‘horizon line’ (see Fig. 1) are used to compute . A density of 100% means that the window had, in all locations of the sparse map, at least 1 point inside . The ‘horizon line’ was calculated by averaging, for each column of , the points of that have the smallest value in vertical-axis.
The KITTI Database provides the MATLAB/C++ utility package used in the evaluation of the algorithms. Density is evaluated against the KITTI groundtruth and, therefore, it follows a methodology which is not the same as the above. For that reason, the values of density (denoted by ) as calculated by the KITTI evaluation package, shown in the last row of Table I, are not the same of .
V-B Evaluation of local-spatial interpolation algorithms
The first experiments conducted in this work involve methods that do not depend on the size of , which are the cases of the Nearest () operator and the Delaunay-based techniques using linear (), nearest neighbor (), and natural neighbor () interpolation. In terms of error performance, as shown in Table II, and achieved comparable results, however the main difference resides in the fact that the density resulted from varies with the window size (), while the Delaunay-based approaches attain the same value of density, being equal to 99.96%.
|17.47 %||3.78 %||5.53 %|
|22.05 %||4.15 %||6.48 %|
|17.10 %||3.82 %||5.55 %|
|23.54 %||4.21 %||6.73 %|
|8.23 %||2.63 %||3.35 %|
|7.57 %||4.20 %||4.63 %|
|14.64 %||3.32 %||4.77 %|
|20.37 %||4.91 %||6.88 %|
|25.84 %||4.41 %||7.14 %|
|25.77 %||4.54 %||7.25 %|
|26.67 %||4.77 %||7.56 %|
|34.12 %||15.37 %||17.76 %|
The methods and , the operators sample average (), minimum (), maximum () and median (), and also the BF and have different performances for different values of . For this reason, and to show the relationship between the errors and , Figure 5 provides the values of the error measures (D1-bg, D1-fg and D1-all) for increasing values of window size.
Considering the results shown in Fig. 5 the error (D1-fg) on the foreground objects (Fig. 5(a)) demonstrated to be the most challenging case, particularly for the operators , , and methods , , where the error increases monotonically with . For the median operator (), the situation is also not favorable. The shows a good behavior up to , while and are relatively robust for all values of , although the proposed attained the best results. In terms of background-errors (D1-bg), depicted in Fig. 5(b), most of the methods are generally satisfactory except, clearly, the operators , and . Finally, and as consequence of the combined errors (D1-all = D1-fg + D1-bg), from Fig. 5(c) it is possible to conclude that achieved the lowest error among the implemented methods. Based on the values of density, provided in Table I, and considering the error plots in Fig. 5, the ‘optimum’ value of the window size is . Therefore, Table III shows the values of error, in percentage and for a window size of , which facilitates the comparison of results among the methods and techniques, including those reported in Table II.
In terms of qualitative results, Fig. 6 shows, for a given frame, the resulting depth-maps (left part) and the error images (in the right). The color image and the groundtruth are provided to facilitate the analysis; the output depth-maps are shown using a colormap which is proportional to the distance, while the error images were mapped using the KITTI dev-kit . Notice that the top part of the depth-maps where removed because the LIDAR’s FOV does not cover the entire frame (as discussed in Sect. V-A).
A high-resolution, LIDAR-based only, depth mapping approach is presented in this work as a promising solution to be used in sensory perception systems, as part of applications such as: road detection, object recognition and tracking, environment modeling, cooperative perception. The approach is based on the Bilateral Filter (BF) and contributes with a technique that influences the weighting range-term of the BF. Experiments using the KITTI database were carried out to assess the performance of the proposed approach as well as other usual interpolation techniques, namely: Kriging, IDW, Delaunay interpolation, Median, Nearest neighbor, and others. From the experimental results reported in this paper, the proposed approach, denoted by , attained the best results among the methods and techniques tested.
This work has been supported by FCT and COMPETE under projects “AMSHMI2012-RECI/EEIAUT/0181/2012” and “UID/EEA/00048/2013”.
-  H. Andreasson, R. Triebel, and A. Lilienthal. Non-iterative vision-based interpolation of 3D laser scans. In Autonomous Robots and Agents, volume 76.
-  A. Broggi, P. Grisleri, and P. Zani. Sensors technologies for intelligent vehicles perception systems: A comparison between vision and 3D-LIDAR. In IEEE ITSC, pages 887–892, Oct 2013.
-  J. Dolson, J. Baek, C. Plagemann, and S. Thrun. Upsampling range data in dynamic environments. In IEEE CVPR, 2010.
-  C. Fernandez, D. F. Llorca, C. Stiller, and M. A. Sotelo. Curvature-based curb detection method in urban environments using stereo and laser. In IEEE Intelligent Vehicles Symposium (IV), 2015.
-  A. Geiger, P. Lenz, and R. Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In IEEE CVPR, 2012.
A. Gonzalez, G. Villalonga, J. Xu, D. Vazquez, J. Amores, and A. M. Lopez.
Multiview random forest of local experts combining RGB and LIDAR data for pedestrian detection.In IEEE IV, 2015.
-  K. He, J. Sun, and X. Tang. Guided image filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(6):1397–1409, 2013.
Y. He, L. Chen, and M. Li.
Sparse depth map upsampling with rgb image and anisotropic diffusion tensor.In Intelligent Vehicles Symposium (IV), 2015 IEEE, pages 205–210, June 2015.
-  M. Menze and A. Geiger. Object scene flow for autonomous vehicles. In IEEE CVPR, 2015.
-  O. Miksik, Y. Amar, V. Vineet, P. Perez, and P. Torr. Incremental dense multi-modal 3d scene reconstruction. In IEEE/RSJ IROS, 2015.
-  S. Paris, P. Kornprobst, J. Tumblin, and F. Durand. Bilateral filtering: Theory and applications. Foundations and Trends in Computer Graphics and Vision, 4(1):1–73, 2009.
-  P. Perona and J. Malik. Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(7):629–639, Jul 1990.
-  C. Premebida, J. Carreira, J. Batista, and U. Nunes. Pedestrian detection combining RGB and dense LIDAR data. In IROS, 2014.
-  J. Sander, M. Ester, H.-P. Kriegel, and X. Xu. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Min. Knowl. Discov., 2(2):169–194, June 1998.
-  P. Y. Shinzato, D. F. Wolf, and C. Stiller. Road terrain detection: Avoiding common obstacle detection assumptions using sensor fusion. In IEEE IV, pages 687–692, June 2014.
-  C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In IEEE CVPR, 1998.
-  F. Zhang, D. Clarke, and A. Knoll. Vehicle detection based on LiDAR and camera fusion. In IEEE ITSC, pages 1620–1625, Oct 2014.